Universal documentation crawler that converts HTML pages to Markdown with internal link correction

These details have not been verified by PyPI

Project links

Project description

scrapy-mth

A Scrapy-based universal documentation crawler that converts HTML documentation sites to Markdown format, with automatic internal link rewriting to local .md relative paths. Supports multiple converter engines (markitdown / html2text), path whitelist filtering, and automatic media file download.

Parameters

start_urls: Starting URLs (comma-separated)
allowed_domains: Allowed domains (comma-separated)
deny_patterns: Regex deny patterns (comma-separated)
allow_paths: Allowed path prefixes (comma-separated); only URLs starting with these prefixes will be processed
body_selector: CSS selector for main HTML content (default: "main, article, .content, .document, .body, body")
output_dir: Output directory (default: "~/.config/doc_crawler/_docs/{domain_name}", where {domain_name} is extracted from start_urls)
converter_engine: Converter engine (default: "markitdown", optional: "html2text")
single_page: Single-page mode (default: "false", set to "true" to crawl a single page without following links)

Install via UV

uv tool install git+https://github.com/zwidny/doc_crawler.git

After installation, you can use the doc_crawler command from any directory.

Usage Examples

# Crawl AKShare documentation
doc_crawler --start-urls "https://akshare.akfamily.xyz" \
  --allowed-domains "akshare.akfamily.xyz" \
  --deny-patterns "/_sources/" \
  --body-selector "main, article, .content, .document, .body" \
  --output-dir "_docs/akshare_markdown"

Single-page mode

doc_crawler --start-urls "https://build123d.readthedocs.io/en/stable/examples_1.html" \
  --single-page true \
  --body-selector ".wy-nav-content" \
  --output-dir "single_page_output"

Path whitelist filtering

doc_crawler --start-urls "https://opencode.ai/docs/zh-cn/" \
  --allow-paths "/docs/zh-cn/" \
  --body-selector "main, article, .content" \
  --output-dir "_docs/opencode_docs_zh_cn"

Crawl with html2text engine

doc_crawler --start-urls "https://akshare.akfamily.xyz/" \
  --allowed-domains "akshare.akfamily.xyz" \
  --deny-patterns "/_sources/" \
  --body-selector "main, article, .content, .document, .body" \
  --converter-engine "html2text" \
  --output-dir "_docs/akshare_markdown_html2text"

More examples

# Crawl build123d docs
doc_crawler --start-urls "https://build123d.readthedocs.io/en/stable/" \
  --deny-patterns "/_sources/,/latest/" \
  --body-selector ".wy-nav-content" \
  --output-dir "_docs/build123d"

# Crawl Docusaurus docs
doc_crawler --start-urls "https://docusaurus.io/docs" \
  --allow-paths "/docs" \
  --body-selector ".col.docItemCol_n6xZ" \
  --output-dir "_docs/docusaurus"

# Crawl uv documentation
doc_crawler --start-urls "https://docs.astral.sh/uv/" \
  --body-selector ".md-content" \
  --output-dir "_docs/uv"

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

May 12, 2026

0.1.1

May 12, 2026

This version

0.1.0

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_docs_crawler-0.1.0.tar.gz (14.8 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

html_docs_crawler-0.1.0-py3-none-any.whl (16.6 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file html_docs_crawler-0.1.0.tar.gz.

File metadata

Download URL: html_docs_crawler-0.1.0.tar.gz
Upload date: May 6, 2026
Size: 14.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for html_docs_crawler-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8bfb32e9e12afcad5d7742cf7ea7401ca49206c2ef9d6959b577cedce0cc19f9`
MD5	`dc5a49f05d5411de4284eef1e51012e9`
BLAKE2b-256	`4ee9c122b6a01a0dde9044857c6cc41ae4f0f18f943cf4fcecb92c54508b4299`

See more details on using hashes here.

File details

Details for the file html_docs_crawler-0.1.0-py3-none-any.whl.

File metadata

Download URL: html_docs_crawler-0.1.0-py3-none-any.whl
Upload date: May 6, 2026
Size: 16.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for html_docs_crawler-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c0c1229cb406aba70033a0bb539212b8ffcd852b8b0ca7eee795150c1a319058`
MD5	`e0b6b3934d23b7fdf1e37f64f248a755`
BLAKE2b-256	`200f14ded0c6f735474be33c57af9f3976ffabf12ffc1ec8eaa81cd829f5b0ca`

See more details on using hashes here.

html_docs_crawler 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

scrapy-mth

Parameters

Install via UV

Usage Examples

Single-page mode

Path whitelist filtering

Crawl with html2text engine

More examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes