Skip to main content

A Python library for ethical web crawling

Project description

Ethicrawl

pytest codecov security python license PyPI docs

A Python library for ethical web crawling that respects robots.txt rules, maintains proper rate limits, and provides powerful tools for web scraping.

Project Goals

Ethicrawl is built on the principle that web crawling should be:

  • Ethical: Respect website owners' rights and server resources
  • Safe: Prevent accidental overloading of servers or violation of policies
  • Powerful: Provide a complete toolkit for professional web crawling
  • Extensible: Support customization for diverse crawling needs

Key Features

  • Robots.txt Compliance: Automatic parsing and enforcement of robots.txt rules
  • Rate Limiting: Built-in, configurable request rate management
  • Sitemap Support: Parse and filter XML sitemaps to discover content
  • Domain Control: Explicit whitelisting for cross-domain access
  • Flexible Configuration: Easily configure all aspects of crawling behavior

Installation

Install the latest version from PyPI:

pip install ethicrawl

For development:

# Clone the repository
git clone https://github.com/ethicrawl/ethicrawl.git

# Navigate to the directory
cd ethicrawl

# Install in development mode
pip install -e .

Quick Start

from ethicrawl import Ethicrawl
from ethicrawl.error import RobotDisallowedError

# Create and bind to a domain
ethicrawl = Ethicrawl()
ethicrawl.bind("https://example.com")

# Get a page - robots.txt rules automatically respected
try:
    response = ethicrawl.get("/page.html")
except RobotDisallowedError:
    print("The site prohibits fetching the page")

# Release resources when done
ethicrawl.unbind()

Documentation

Comprehensive documentation is available at https://ethicrawl.github.io/ethicrawl/

License

Apache 2.0 License - See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ethicrawl-1.0.0b1.tar.gz (53.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ethicrawl-1.0.0b1-py3-none-any.whl (72.7 kB view details)

Uploaded Python 3

File details

Details for the file ethicrawl-1.0.0b1.tar.gz.

File metadata

  • Download URL: ethicrawl-1.0.0b1.tar.gz
  • Upload date:
  • Size: 53.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for ethicrawl-1.0.0b1.tar.gz
Algorithm Hash digest
SHA256 a412b746d0bde4e1b7c5e09a408c6c9246c43778e11050ddb8bed58aab145a54
MD5 1cf889c72379794bf28f6696c1bdc59a
BLAKE2b-256 331ece16cd629a43545a1a01f0e7ed721de021071b4ec9308ab3d5b08d840520

See more details on using hashes here.

File details

Details for the file ethicrawl-1.0.0b1-py3-none-any.whl.

File metadata

  • Download URL: ethicrawl-1.0.0b1-py3-none-any.whl
  • Upload date:
  • Size: 72.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for ethicrawl-1.0.0b1-py3-none-any.whl
Algorithm Hash digest
SHA256 1367e02123f4bd6625f3d79d68c2699c4ab616a4502e0977652542f2656cbfeb
MD5 7e68edfbc7d1dfe567ee17ca2f323630
BLAKE2b-256 8cc0c1bced6c034449a58c67d675f049d223b3c2adfb3bd928a9cc481c552a98

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page