Skip to main content

No project description provided

Project description

CrawlBot

CrawlBot is a Scrapy-based project designed to crawl specified domains and extract various webpage components such as titles, headings, images, and links. This project supports dynamic configuration and can be used to run different spiders with specified start URLs.

Table of Contents

Installation

To install the CrawlBot package, use pip:

```bash pip install crawl_bot ```

Usage

Spiders

This project includes the following spiders:

  • BasicSpider: A basic spider that extracts titles, headings, images, links, etc.

Command-Line Usage

You can run the spiders from the command line using the run_spider command. Replace <spider_name> with the name of the spider you want to run and provide the start URLs:

```bash run_spider <spider_name> ... ```

Example:

```bash run_spider basic_spider http://example.com http://another-example.com ```

Programmatic Usage

You can also run the spiders programmatically from another Python script:

```python from crawl_bot.run_spider import run_spider

spider_name = 'basic_spider' start_urls = ['http://example.com', 'http://another-example.com'] run_spider(spider_name, start_urls)

```

Project Structure

Here is an overview of the project structure:

  • scrapy.cfg: Scrapy configuration file.
  • my_scrapy_project/: Directory containing the Scrapy project.
    • items.py: Defines the items that will be scraped.
    • middlewares.py: Custom middlewares for the Scrapy project.
    • pipelines.py: Pipelines for processing scraped data.
    • settings.py: Configuration settings for the Scrapy project.
    • spiders/: Directory containing the spiders.
      • basic_spider.py: Basic spider implementation.
      • another_spider.py: Another example spider.
  • run_spider.py: Script to run the spiders.
  • setup.py: Setup script for installing the package.
  • MANIFEST.in: Configuration for including additional files in the package.
  • README.md: Project documentation.

Contributing

We welcome contributions to CrawlBot! If you have an idea for a new feature or have found a bug, please open an issue or submit a pull request. Here's how you can contribute:

  1. Fork the repository.
  2. Create a new branch: git checkout -b my-feature-branch
  3. Make your changes and commit them: git commit -m 'Add some feature'
  4. Push to the branch: git push origin my-feature-branch
  5. Open a pull request.

Please ensure your code adheres to the project's coding standards and includes appropriate tests.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawl_bot-0.1.4.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

crawl_bot-0.1.4-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file crawl_bot-0.1.4.tar.gz.

File metadata

  • Download URL: crawl_bot-0.1.4.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for crawl_bot-0.1.4.tar.gz
Algorithm Hash digest
SHA256 bb46900b85f699956b1093d9cf0c4c1e350ab09f72afe3e34ab7bb50758e9939
MD5 8e9ad6aaf746f0c5a3e56af1adfef5d2
BLAKE2b-256 a10967e3a47ed7a6cb0dce98d7931ba627a5be9430a34cc2f1bedcc8ab153058

See more details on using hashes here.

File details

Details for the file crawl_bot-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: crawl_bot-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for crawl_bot-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 152b13a2d5919d205649586805bbd25ade387974c59c700688274996f7fcba41
MD5 60b8f7e4d6ac1e797a42ce9867b938dc
BLAKE2b-256 d7c7a93a45f580295fea4932beac556a5d8a0d2debd002cd16763e3c87ba291c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page