Universal documentation crawler that converts HTML pages to Markdown with internal link correction
Project description
scrapy-mth
A Scrapy-based universal documentation crawler that converts HTML documentation sites to Markdown format, with automatic internal link rewriting to local .md relative paths. Supports multiple converter engines (markitdown / html2text), path whitelist filtering, and automatic media file download.
Parameters
start_urls: Starting URLs (comma-separated)allowed_domains: Allowed domains (comma-separated)deny_patterns: Regex deny patterns (comma-separated)allow_paths: Allowed path prefixes (comma-separated); only URLs starting with these prefixes will be processedbody_selector: CSS selector for main HTML content (default:"main, article, .content, .document, .body, body")output_dir: Output directory (default:"~/.config/doc_crawler/_docs/{domain_name}", where{domain_name}is extracted fromstart_urls)converter_engine: Converter engine (default:"markitdown", optional:"html2text")single_page: Single-page mode (default:"false", set to"true"to crawl a single page without following links)
Install via UV
uv tool install git+https://github.com/zwidny/doc_crawler.git
After installation, you can use the doc_crawler command from any directory.
Usage Examples
# Crawl AKShare documentation
doc_crawler --start-urls "https://akshare.akfamily.xyz" \
--allowed-domains "akshare.akfamily.xyz" \
--deny-patterns "/_sources/" \
--body-selector "main, article, .content, .document, .body" \
--output-dir "_docs/akshare_markdown"
Single-page mode
doc_crawler --start-urls "https://build123d.readthedocs.io/en/stable/examples_1.html" \
--single-page true \
--body-selector ".wy-nav-content" \
--output-dir "single_page_output"
Path whitelist filtering
doc_crawler --start-urls "https://opencode.ai/docs/zh-cn/" \
--allow-paths "/docs/zh-cn/" \
--body-selector "main, article, .content" \
--output-dir "_docs/opencode_docs_zh_cn"
Crawl with html2text engine
doc_crawler --start-urls "https://akshare.akfamily.xyz/" \
--allowed-domains "akshare.akfamily.xyz" \
--deny-patterns "/_sources/" \
--body-selector "main, article, .content, .document, .body" \
--converter-engine "html2text" \
--output-dir "_docs/akshare_markdown_html2text"
More examples
# Crawl build123d docs
doc_crawler --start-urls "https://build123d.readthedocs.io/en/stable/" \
--deny-patterns "/_sources/,/latest/" \
--body-selector ".wy-nav-content" \
--output-dir "_docs/build123d"
# Crawl Docusaurus docs
doc_crawler --start-urls "https://docusaurus.io/docs" \
--allow-paths "/docs" \
--body-selector ".col.docItemCol_n6xZ" \
--output-dir "_docs/docusaurus"
# Crawl uv documentation
doc_crawler --start-urls "https://docs.astral.sh/uv/" \
--body-selector ".md-content" \
--output-dir "_docs/uv"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file html_docs_crawler-0.1.0.tar.gz.
File metadata
- Download URL: html_docs_crawler-0.1.0.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8bfb32e9e12afcad5d7742cf7ea7401ca49206c2ef9d6959b577cedce0cc19f9
|
|
| MD5 |
dc5a49f05d5411de4284eef1e51012e9
|
|
| BLAKE2b-256 |
4ee9c122b6a01a0dde9044857c6cc41ae4f0f18f943cf4fcecb92c54508b4299
|
File details
Details for the file html_docs_crawler-0.1.0-py3-none-any.whl.
File metadata
- Download URL: html_docs_crawler-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0c1229cb406aba70033a0bb539212b8ffcd852b8b0ca7eee795150c1a319058
|
|
| MD5 |
e0b6b3934d23b7fdf1e37f64f248a755
|
|
| BLAKE2b-256 |
200f14ded0c6f735474be33c57af9f3976ffabf12ffc1ec8eaa81cd829f5b0ca
|