Skip to main content

A Python library for ethical web crawling

Project description

Ethicrawl

python license pytest codecov security PyPI docs

Ethicrawl is a Python library for ethical, professional-grade web crawling. It automatically respects robots.txt, enforces rate limits, and offers robust sitemap parsing and domain control—making it easy to build reliable and responsible crawlers.

Project Goals

Ethicrawl is built on the principle that web crawling should be:

  • Ethical by Design: Automatically respects robots.txt and rate limits, ensuring responsible web crawling.
  • Server-Safe: Prevents accidental overloading with built-in safeguards.
  • Feature-Rich: Includes robust sitemap parsing, domain control, and flexible configuration.
  • Extensible & Customizable: Easily adapts to diverse crawling needs through flexible settings and clean architecture.

Key Features

  • Robots.txt Compliance: Automatic parsing and enforcement of robots.txt rules
  • Rate Limiting: Built-in, configurable request rate management
  • Sitemap Support: Parse and filter XML sitemaps to discover content
  • Domain Control: Explicit whitelisting for cross-domain access
  • Flexible Configuration: Easily configure all aspects of crawling behavior

Documentation

Comprehensive documentation is available at https://ethicrawl.github.io/ethicrawl/

Installation

Install the latest version from PyPI:

pip install ethicrawl

For development:

# Clone the repository
git clone https://github.com/ethicrawl/ethicrawl.git

# Navigate to the directory
cd ethicrawl

# Create and activate a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

# Install in development mode
pip install -e .

Quick Start

from ethicrawl import Ethicrawl
from ethicrawl.error import RobotDisallowedError

# Create and bind to a domain
ethicrawl = Ethicrawl()
ethicrawl.bind("https://example.com")

# Get a page - robots.txt rules automatically respected
try:
    response = ethicrawl.get("https://example.com/page.html")
except RobotDisallowedError:
    print("The site prohibits fetching the page")

# Release resources when done
ethicrawl.unbind()

License

Apache 2.0 License - See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ethicrawl-1.0.0b2.tar.gz (55.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ethicrawl-1.0.0b2-py3-none-any.whl (77.2 kB view details)

Uploaded Python 3

File details

Details for the file ethicrawl-1.0.0b2.tar.gz.

File metadata

  • Download URL: ethicrawl-1.0.0b2.tar.gz
  • Upload date:
  • Size: 55.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for ethicrawl-1.0.0b2.tar.gz
Algorithm Hash digest
SHA256 6e94e05a5b3bee7b8a25c182c88231dc5f8a1117865e9e6f39b605922df1df38
MD5 1ef349fcfb1abc37715fe5688a068e39
BLAKE2b-256 ccdf76072eb6683aad66b57e00a315dd2d51c707dac5f426b3c0a43a216bb4c6

See more details on using hashes here.

File details

Details for the file ethicrawl-1.0.0b2-py3-none-any.whl.

File metadata

  • Download URL: ethicrawl-1.0.0b2-py3-none-any.whl
  • Upload date:
  • Size: 77.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for ethicrawl-1.0.0b2-py3-none-any.whl
Algorithm Hash digest
SHA256 ef7f648489bcdf28cbf8f2dc45dbb2857f992885c3a789a6673cd3b5ace5bc21
MD5 e30824853fb7493d12a93d14d35e44c3
BLAKE2b-256 9513e5e0f77f7b965eefbc1bd709a1dd0b02093fbc29b27e6580dfe7b527fe0b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page