A Python library for ethical web crawling

These details have not been verified by PyPI

Project links

Project description

Ethicrawl

Ethicrawl is a Python library for ethical, professional-grade web crawling. It automatically respects robots.txt, enforces rate limits, and offers robust sitemap parsing and domain control—making it easy to build reliable and responsible crawlers.

Project Goals

Ethicrawl is built on the principle that web crawling should be:

Ethical by Design: Automatically respects robots.txt and rate limits, ensuring responsible web crawling.
Server-Safe: Prevents accidental overloading with built-in safeguards.
Feature-Rich: Includes robust sitemap parsing, domain control, and flexible configuration.
Extensible & Customizable: Easily adapts to diverse crawling needs through flexible settings and clean architecture.

Key Features

Robots.txt Compliance: Automatic parsing and enforcement of robots.txt rules
Rate Limiting: Built-in, configurable request rate management
Sitemap Support: Parse and filter XML sitemaps to discover content
Domain Control: Explicit whitelisting for cross-domain access
Flexible Configuration: Easily configure all aspects of crawling behavior

Documentation

Comprehensive documentation is available at https://ethicrawl.github.io/ethicrawl/

Installation

Install the latest version from PyPI:

pip install ethicrawl

For development:

# Clone the repository
git clone https://github.com/ethicrawl/ethicrawl.git

# Navigate to the directory
cd ethicrawl

# Create and activate a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

# Install in development mode
pip install -e .

Quick Start

from ethicrawl import Ethicrawl
from ethicrawl.error import RobotDisallowedError

# Create and bind to a domain
ethicrawl = Ethicrawl()
ethicrawl.bind("https://example.com")

# Get a page - robots.txt rules automatically respected
try:
    response = ethicrawl.get("https://example.com/page.html")
except RobotDisallowedError:
    print("The site prohibits fetching the page")

# Release resources when done
ethicrawl.unbind()

License

Apache 2.0 License - See LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0b2 pre-release

Apr 10, 2025

1.0.0b1 pre-release

Mar 27, 2025

1.0.0a1 pre-release

Mar 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ethicrawl-1.0.0b2.tar.gz (55.9 kB view details)

Uploaded Apr 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ethicrawl-1.0.0b2-py3-none-any.whl (77.2 kB view details)

Uploaded Apr 10, 2025 Python 3

File details

Details for the file ethicrawl-1.0.0b2.tar.gz.

File metadata

Download URL: ethicrawl-1.0.0b2.tar.gz
Upload date: Apr 10, 2025
Size: 55.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for ethicrawl-1.0.0b2.tar.gz
Algorithm	Hash digest
SHA256	`6e94e05a5b3bee7b8a25c182c88231dc5f8a1117865e9e6f39b605922df1df38`
MD5	`1ef349fcfb1abc37715fe5688a068e39`
BLAKE2b-256	`ccdf76072eb6683aad66b57e00a315dd2d51c707dac5f426b3c0a43a216bb4c6`

See more details on using hashes here.

File details

Details for the file ethicrawl-1.0.0b2-py3-none-any.whl.

File metadata

Download URL: ethicrawl-1.0.0b2-py3-none-any.whl
Upload date: Apr 10, 2025
Size: 77.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for ethicrawl-1.0.0b2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ef7f648489bcdf28cbf8f2dc45dbb2857f992885c3a789a6673cd3b5ace5bc21`
MD5	`e30824853fb7493d12a93d14d35e44c3`
BLAKE2b-256	`9513e5e0f77f7b965eefbc1bd709a1dd0b02093fbc29b27e6580dfe7b527fe0b`

See more details on using hashes here.

ethicrawl 1.0.0b2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Ethicrawl

Project Goals

Key Features

Documentation

Installation

Quick Start

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes