A CLI tool to crawl developer documentation sites and save pages as Markdown using crawl4ai.

These details have not been verified by PyPI

Project links

Project description

DevDoc Crawler

A CLI tool to crawl developer documentation websites and save each page as a Markdown file.

This tool uses crawl4ai to perform deep crawling and extract content suitable for ingestion into RAG pipelines or direct use with LLMs.

Features

Crawls websites starting from a given URL.
Uses crawl4ai's deep crawling (BFSDeepCrawlStrategy by default).
Stays within the original domain (does not follow external links).
Saves the markdown content of each successfully crawled page.
Organizes output into a subdirectory named after the crawled domain (e.g., output_dir/docs_example_com/).
Attempts to preserve the URL path structure within the domain subdirectory.
Offers a streaming mode (--stream, enabled by default) to process pages as they arrive.

Installation (Recommended: pipx)

Using pipx is recommended as it installs the tool and its dependencies in an isolated environment, preventing conflicts with other Python projects.

# Ensure you have Python 3.12+ and pipx installed (pip install pipx)

pipx install devdocs-crawler

# To upgrade later:
pipx upgrade devdocs-crawler

Alternative Installation (pip)

You can also install using pip directly (ideally within a virtual environment):

# Ensure you have Python 3.12+ installed

pip install devdocs-crawler

Usage

# Basic usage (crawl depth 1, stream enabled by default)
# Saves to ./devdocs_crawler_output/<domain_name>/
# Example: Saves to ./devdocs_crawler_output/docs_python_org/
devdocs-crawler https://docs.python.org/3/

# Specify a different base output directory
# Example: Saves to ./python_docs/docs_python_org/
devdocs-crawler https://docs.python.org/3/ -o ./python_docs

# Example: Crawl Neo4j GDS docs (depth 2)
# Example: Saves to ./devdocs_crawler_output/neo4j_com/
devdocs-crawler https://neo4j.com/docs/graph-data-science/current/ -d 2

# Example: Disable streaming
devdocs-crawler https://docs.example.com --no-stream

Options:

start_url: (Required) The starting URL for the crawl (must include scheme like https://).
-o, --output DIRECTORY: Base directory to save crawl-specific subdirectories (default: ./devdocs_crawler_output).
-d, --depth INTEGER: Crawling depth beyond the start URL (0 = start URL only, 1 = start URL + linked pages, etc.) (default: 1).
--max-pages INTEGER: Maximum total number of pages to crawl (default: no limit).
--stream / --no-stream: Streaming mode processes pages as they arrive. Enabled by default. Use --no-stream to disable it and process all pages after the crawl finishes.
-v, --verbose: Increase logging verbosity (-v for INFO, -vv for DEBUG). Default is WARNING.
--version: Show the package version and exit.
-h, --help: Show the help message and exit.

Development

Clone the repository: git clone https://github.com/youssef-tharwat/devdocs-crawler (Replace with your fork if contributing)
Navigate to the project directory: cd devdocs-crawler
Install uv: If you don't have it, install uv (e.g., pip install uv or see uv installation docs).
Create environment & Install: Use uv to create an environment and install dependencies (including dev dependencies). Requires Python 3.12+.
```
uv venv # Creates .venv
uv sync --dev # Syncs based on pyproject.toml
```
(Alternatively, if you prefer manual venv: python3.12 -m venv .venv, source .venv/bin/activate, then uv pip install -e .[dev])
Activate the environment:
- macOS/Linux: source .venv/bin/activate
- Windows: .venv\Scripts\activate

Now you can run the tool using devdocs-crawler from within the activated environment.

You can run linters and formatters:

ruff check .
ruff format .

And run tests (if/when tests are added):

pytest

Building and Publishing (using uv)

Ensure your pyproject.toml has the correct version number and author details.
Build the distributions:
```
uv build
```
This creates wheel and source distributions in the dist/ directory.
Publish to PyPI (requires a PyPI account and an API token configured with uv):
```
uv publish
```
You can also publish to TestPyPI using uv publish --repository testpypi. See uv publish --help for more options, including providing tokens via environment variables or arguments.

Contributing

Contributions are welcome! Please see the CONTRIBUTING.md file for guidelines (if one exists).

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

devdocs_crawler-0.1.0.tar.gz (10.7 kB view details)

Uploaded Apr 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

devdocs_crawler-0.1.0-py3-none-any.whl (9.9 kB view details)

Uploaded Apr 25, 2025 Python 3

File details

Details for the file devdocs_crawler-0.1.0.tar.gz.

File metadata

Download URL: devdocs_crawler-0.1.0.tar.gz
Upload date: Apr 25, 2025
Size: 10.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.5

File hashes

Hashes for devdocs_crawler-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a85afd1596984b64dcc1a4496291b49f90ad0123bf3ff6f837fbf521ec606e54`
MD5	`8ef139858d9456c6f1969bd2a303c5c6`
BLAKE2b-256	`b46f4594c0c457fdc208ae10791935e27b133e55ffcf6a7acd7c51e786530631`

See more details on using hashes here.

File details

Details for the file devdocs_crawler-0.1.0-py3-none-any.whl.

File metadata

Download URL: devdocs_crawler-0.1.0-py3-none-any.whl
Upload date: Apr 25, 2025
Size: 9.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.5

File hashes

Hashes for devdocs_crawler-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d35d72b9bf7ca6f893cf019f4b705eee4644f58c03dfbea5249c07720daeed40`
MD5	`74c4b9a53aa50ce293e16a9975bec375`
BLAKE2b-256	`5971533dc1ca26657e8eae6c6c6274cbe93a345076480f9eb324a0a78ef34570`

See more details on using hashes here.

devdocs-crawler 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DevDoc Crawler

Features

Installation (Recommended: pipx)

Alternative Installation (pip)

Usage

Development

Building and Publishing (using uv)

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes