No project description provided
Project description
CrawlBot
CrawlBot is a Scrapy-based project designed to crawl specified domains and extract various webpage components such as titles, headings, images, and links. This project supports dynamic configuration and can be used to run different spiders with specified start URLs.
Table of Contents
Installation
To install the CrawlBot package, use pip:
```bash pip install crawl_bot ```
Usage
Spiders
This project includes the following spiders:
BasicSpider
: A basic spider that extracts titles, headings, images, links, etc.
Command-Line Usage
You can run the spiders from the command line using the run_spider
command. Replace <spider_name>
with the name of the spider you want to run and provide the start URLs:
```bash run_spider <spider_name> ... ```
Example:
```bash run_spider basic_spider http://example.com http://another-example.com ```
Programmatic Usage
You can also run the spiders programmatically from another Python script:
```python from crawl_bot.run_spider import run_spider
spider_name = 'basic_spider' start_urls = ['http://example.com', 'http://another-example.com'] run_spider(spider_name, start_urls)
```
Project Structure
Here is an overview of the project structure:
- scrapy.cfg: Scrapy configuration file.
- my_scrapy_project/: Directory containing the Scrapy project.
- items.py: Defines the items that will be scraped.
- middlewares.py: Custom middlewares for the Scrapy project.
- pipelines.py: Pipelines for processing scraped data.
- settings.py: Configuration settings for the Scrapy project.
- spiders/: Directory containing the spiders.
- basic_spider.py: Basic spider implementation.
- another_spider.py: Another example spider.
- run_spider.py: Script to run the spiders.
- setup.py: Setup script for installing the package.
- MANIFEST.in: Configuration for including additional files in the package.
- README.md: Project documentation.
Contributing
We welcome contributions to CrawlBot! If you have an idea for a new feature or have found a bug, please open an issue or submit a pull request. Here's how you can contribute:
- Fork the repository.
- Create a new branch:
git checkout -b my-feature-branch
- Make your changes and commit them:
git commit -m 'Add some feature'
- Push to the branch:
git push origin my-feature-branch
- Open a pull request.
Please ensure your code adheres to the project's coding standards and includes appropriate tests.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file crawl_bot-0.1.1.tar.gz
.
File metadata
- Download URL: crawl_bot-0.1.1.tar.gz
- Upload date:
- Size: 7.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54fc8420ff0aa9f8bc1871aa3a8d02c9ddce8d9eb3ea9c0ea88432396c3aaacc |
|
MD5 | 672c8d7bed00df328471df0c0da524b2 |
|
BLAKE2b-256 | 04e00bc5217534447399acbafc3a6e810a4c4d5d13c4c29fb0d0eb713cea9d13 |
File details
Details for the file crawl_bot-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: crawl_bot-0.1.1-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34ebf9145faa581fe04f0ed2a54c79c8403bd382b2c83f6df1737603d46c4277 |
|
MD5 | 9f945813283c74201e8155654af78654 |
|
BLAKE2b-256 | ff7722a94f95b859fda3d38590f275d8fef52fc28e028d25039862b46b0a04b3 |