Skip to main content

A CLI tool to crawl developer documentation sites and save pages as Markdown using crawl4ai.

Project description

DevDoc Crawler

A CLI tool to crawl developer documentation websites and save each page as a Markdown file.

This tool uses crawl4ai to perform deep crawling and extract content suitable for ingestion into RAG pipelines or direct use with LLMs.

Features

  • Crawls websites starting from a given URL.
  • Uses crawl4ai's deep crawling (BFSDeepCrawlStrategy by default).
  • Stays within the original domain (does not follow external links).
  • Saves the markdown content of each successfully crawled page.
  • Organizes output into a subdirectory named after the crawled domain (e.g., output_dir/docs_example_com/).
  • Attempts to preserve the URL path structure within the domain subdirectory.
  • Offers a streaming mode (--stream, enabled by default) to process pages as they arrive.

Installation (Recommended: pipx)

Using pipx is recommended as it installs the tool and its dependencies in an isolated environment, preventing conflicts with other Python projects.

# Ensure you have Python 3.12+ and pipx installed (pip install pipx)

pipx install devdocs-crawler

# To upgrade later:
pipx upgrade devdocs-crawler

Alternative Installation (pip)

You can also install using pip directly (ideally within a virtual environment):

# Ensure you have Python 3.12+ installed

pip install devdocs-crawler

Usage

# Basic usage (crawl depth 1, stream enabled by default)
# Saves to ./devdocs_crawler_output/<domain_name>/
# Example: Saves to ./devdocs_crawler_output/docs_python_org/
devdocs-crawler https://docs.python.org/3/

# Specify a different base output directory
# Example: Saves to ./python_docs/docs_python_org/
devdocs-crawler https://docs.python.org/3/ -o ./python_docs

# Example: Crawl Neo4j GDS docs (depth 2)
# Example: Saves to ./devdocs_crawler_output/neo4j_com/
devdocs-crawler https://neo4j.com/docs/graph-data-science/current/ -d 2

# Example: Disable streaming
devdocs-crawler https://docs.example.com --no-stream

Options:

  • start_url: (Required) The starting URL for the crawl (must include scheme like https://).
  • -o, --output DIRECTORY: Base directory to save crawl-specific subdirectories (default: ./devdocs_crawler_output).
  • -d, --depth INTEGER: Crawling depth beyond the start URL (0 = start URL only, 1 = start URL + linked pages, etc.) (default: 1).
  • --max-pages INTEGER: Maximum total number of pages to crawl (default: no limit).
  • --stream / --no-stream: Streaming mode processes pages as they arrive. Enabled by default. Use --no-stream to disable it and process all pages after the crawl finishes.
  • -v, --verbose: Increase logging verbosity (-v for INFO, -vv for DEBUG). Default is WARNING.
  • --version: Show the package version and exit.
  • -h, --help: Show the help message and exit.

Development

  1. Clone the repository: git clone https://github.com/youssef-tharwat/devdocs-crawler (Replace with your fork if contributing)
  2. Navigate to the project directory: cd devdocs-crawler
  3. Install uv: If you don't have it, install uv (e.g., pip install uv or see uv installation docs).
  4. Create environment & Install: Use uv to create an environment and install dependencies (including dev dependencies). Requires Python 3.12+.
    uv venv # Creates .venv
    uv sync --dev # Syncs based on pyproject.toml
    
    (Alternatively, if you prefer manual venv: python3.12 -m venv .venv, source .venv/bin/activate, then uv pip install -e .[dev])
  5. Activate the environment:
    • macOS/Linux: source .venv/bin/activate
    • Windows: .venv\Scripts\activate

Now you can run the tool using devdocs-crawler from within the activated environment.

You can run linters and formatters:

ruff check .
ruff format .

And run tests (if/when tests are added):

pytest

Building and Publishing (using uv)

  1. Ensure your pyproject.toml has the correct version number and author details.
  2. Build the distributions:
    uv build
    
    This creates wheel and source distributions in the dist/ directory.
  3. Publish to PyPI (requires a PyPI account and an API token configured with uv):
    uv publish
    
    You can also publish to TestPyPI using uv publish --repository testpypi. See uv publish --help for more options, including providing tokens via environment variables or arguments.

Contributing

Contributions are welcome! Please see the CONTRIBUTING.md file for guidelines (if one exists).

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

devdocs_crawler-0.1.0.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

devdocs_crawler-0.1.0-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file devdocs_crawler-0.1.0.tar.gz.

File metadata

  • Download URL: devdocs_crawler-0.1.0.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.5

File hashes

Hashes for devdocs_crawler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a85afd1596984b64dcc1a4496291b49f90ad0123bf3ff6f837fbf521ec606e54
MD5 8ef139858d9456c6f1969bd2a303c5c6
BLAKE2b-256 b46f4594c0c457fdc208ae10791935e27b133e55ffcf6a7acd7c51e786530631

See more details on using hashes here.

File details

Details for the file devdocs_crawler-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for devdocs_crawler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d35d72b9bf7ca6f893cf019f4b705eee4644f58c03dfbea5249c07720daeed40
MD5 74c4b9a53aa50ce293e16a9975bec375
BLAKE2b-256 5971533dc1ca26657e8eae6c6c6274cbe93a345076480f9eb324a0a78ef34570

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page