A Python library for ethical web crawling
Project description
Ethicrawl
A Python library for ethical web crawling that respects robots.txt rules, maintains proper rate limits, and provides powerful tools for web scraping.
Project Goals
Ethicrawl is built on the principle that web crawling should be:
- Ethical: Respect website owners' rights and server resources
- Safe: Prevent accidental overloading of servers or violation of policies
- Powerful: Provide a complete toolkit for professional web crawling
- Extensible: Support customization for diverse crawling needs
Key Features
- Robots.txt Compliance: Automatic parsing and enforcement of robots.txt rules
- Rate Limiting: Built-in, configurable request rate management
- Sitemap Support: Parse and filter XML sitemaps to discover content
- Domain Control: Explicit whitelisting for cross-domain access
- Flexible Configuration: Easily configure all aspects of crawling behavior
Installation
Install the latest version from PyPI:
pip install ethicrawl
For development:
# Clone the repository
git clone https://github.com/ethicrawl/ethicrawl.git
# Navigate to the directory
cd ethicrawl
# Install in development mode
pip install -e .
Quick Start
from ethicrawl import Ethicrawl
from ethicrawl.error import RobotDisallowedError
# Create and bind to a domain
ethicrawl = Ethicrawl()
ethicrawl.bind("https://example.com")
# Get a page - robots.txt rules automatically respected
try:
response = ethicrawl.get("/page.html")
except RobotDisallowedError:
print("The site prohibits fetching the page")
# Release resources when done
ethicrawl.unbind()
Documentation
Comprehensive documentation is available at https://ethicrawl.github.io/ethicrawl/
License
Apache 2.0 License - See LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ethicrawl-1.0.0b1.tar.gz.
File metadata
- Download URL: ethicrawl-1.0.0b1.tar.gz
- Upload date:
- Size: 53.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a412b746d0bde4e1b7c5e09a408c6c9246c43778e11050ddb8bed58aab145a54
|
|
| MD5 |
1cf889c72379794bf28f6696c1bdc59a
|
|
| BLAKE2b-256 |
331ece16cd629a43545a1a01f0e7ed721de021071b4ec9308ab3d5b08d840520
|
File details
Details for the file ethicrawl-1.0.0b1-py3-none-any.whl.
File metadata
- Download URL: ethicrawl-1.0.0b1-py3-none-any.whl
- Upload date:
- Size: 72.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1367e02123f4bd6625f3d79d68c2699c4ab616a4502e0977652542f2656cbfeb
|
|
| MD5 |
7e68edfbc7d1dfe567ee17ca2f323630
|
|
| BLAKE2b-256 |
8cc0c1bced6c034449a58c67d675f049d223b3c2adfb3bd928a9cc481c552a98
|