Crawl documentation websites and convert to Markdown files

These details have not been verified by PyPI

Project links

Project description

crawl2md

Crawl documentation websites and convert them to Markdown files.

crawl2md is a command-line tool that:

Crawls documentation websites using breadth-first search
Extracts the main content from each page
Converts HTML to clean Markdown
Adds optional Obsidian-compatible YAML frontmatter
Mirrors the URL structure as a local directory tree

Installation

Using pip

pip install .

Using pipx (recommended for CLI tools)

pipx install .

Development install

make dev-install
# or
pip install -e .

Installing the man page

sudo make man-install

Quick Start

Crawl an entire documentation site:

crawl2md -s https://docs.example.com/

Crawl only a specific section:

crawl2md -s https://docs.example.com/tutorial/ -p /tutorial

Crawl with custom output directory and tags:

crawl2md -s https://docs.example.com/ -o my-docs -t docs reference

Crawl with plain output (disables default TUI):

crawl2md -s https://docs.example.com/ --no-tui

Crawl with deduplication to skip duplicate content:

crawl2md -s https://docs.example.com/ --dedupe

Usage

crawl2md [options]

Options:
  -s, --start URL           Starting URL to crawl (required)
  -b, --base BASE_URL       Base URL to constrain crawling
  -o, --output DIR          Output directory (default: docs-md)
  -t, --tags TAG [TAG ...]  Tags for YAML frontmatter
  -p, --restrict-prefix PREFIX  Only crawl paths starting with PREFIX
  -e, --exclude-patterns PATTERN [...]  Exclude URLs matching patterns
  -d, --delay SECONDS       Delay between requests (default: 0.3)
  -m, --max-pages N         Maximum pages to process
  --no-frontmatter          Disable YAML frontmatter
  --user-agent STRING       Custom User-Agent header
  -v, --verbose             Enable verbose logging
  --no-tui                  Disable TUI, use plain output (TUI is default)
  --dedupe                  Enable content deduplication
  --scroll-lines N          Lines to scroll per keypress in TUI (default: auto)
  --max-log-lines N         Maximum log buffer size in TUI (default: unlimited)
  --version                 Show version and exit

Output Format

Directory Structure

URLs are mapped to local files:

URL	File
`https://example.com/`	`docs-md/index.md`
`https://example.com/intro/`	`docs-md/intro.md`
`https://example.com/guide/`	`docs-md/guide/index.md`
`https://example.com/guide/api/`	`docs-md/guide/api.md`

YAML Frontmatter

By default, each file includes Obsidian-compatible frontmatter:

---
title: "Page Title"
source: https://example.com/page/
created: 2025-12-01
tags:
  - docs
  - tutorial
---

Use --no-frontmatter to disable this.

Interactive TUI Mode

The curses-based TUI provides real-time monitoring and control for long-running crawls. TUI mode is enabled by default; use --no-tui for plain output.

TUI Features

Real-time Statistics:

Pages processed, files saved, duplicates skipped, errors
Current URL being crawled
Queue size and elapsed time
Current crawl speed (delay between requests)

Interactive Controls:

Key	Action	Description
`q`	Quit	Stop crawl and exit
`p`	Pause/Resume	Pause or resume the crawl
`h`	Help	Toggle help overlay with all controls
`m`	Menu	Open mid-crawl configuration menu
`c`	Center	Re-center queue view on current item
`u`	URL Toggle	Switch between path-only and full URL display
`↑/↓`	Scroll	Scroll log window up/down by one line
`PgUp/PgDn`	Page Scroll	Scroll log window by half page
`Home/End`	Jump	Jump to oldest/newest logs
`Esc`	Close	Close help overlay or config menu

Mouse Support:

Scroll wheel to scroll log window

Adaptive Layout:

Works on terminals as small as 1 line (graceful degradation)
Automatically adjusts panel visibility based on terminal size
Shows warning when terminal is too small for full view
Handles terminal resize without crashes

Error Handling:

Terminal always restored on exit (even on crashes)
Crawler errors displayed in overlay panel
Clean exit with 'q' even in error state

TUI Screenshot

[⠋] Pages: 1234 | Saved: 1180 | Dups: 54 | Errors: 0 | Queue: 23 | Elapsed: 05:32 | Speed: 0.5s
Current: ...example.com/docs/advanced/configuration#authentication
Fetching: https://example.com/docs/setup
  Saved: docs-md/setup.md
  Queued: https://example.com/docs/install
Fetching: https://example.com/docs/install
  DUPLICATE body: https://example.com/docs/install → same as docs-md/setup.md
q:quit  p:pause  c:center  u:url  m:menu  ↑/↓:scroll  h:help

Troubleshooting TUI

Terminal Issues:

If terminal appears broken after crash, run: reset
Ensure terminal supports Unicode (for spinner animation)
Minimum terminal size: 1 line (but 10+ lines recommended for full view)

Performance:

TUI updates at ~10 Hz (every 100ms)
No significant overhead on crawler performance
Safe to use on long-running crawls (hours+)

Content Deduplication

Skip saving duplicate content to avoid redundant files:

crawl2md -s https://docs.example.com/ --dedupe

How it works:

Computes SHA256 hash of markdown body (excluding frontmatter)
First occurrence is saved normally
Subsequent pages with identical content are skipped
Duplicates counter incremented in stats/logs

Use cases:

Documentation sites with mirrors/aliases
Sites with "print" versions of pages
Multi-language sites with untranslated pages

Configuration

Configuration can be provided via:

CLI arguments (highest priority)
Environment variables
Config file (crawl2md.toml)
Built-in defaults (lowest priority)

Config File

Create crawl2md.toml in your working directory or ~/.config/crawl2md/:

[crawl2md]
start_url = "https://docs.example.com/"
output = "my-docs"
tags = ["docs", "reference"]
delay = 0.5
verbose = true
no_tui = false
dedupe = true

Environment Variables

Variable	Description
`CRAWL2MD_START_URL`	Starting URL
`CRAWL2MD_BASE_URL`	Base URL constraint
`CRAWL2MD_OUTPUT`	Output directory
`CRAWL2MD_TAGS`	Comma-separated tags
`CRAWL2MD_RESTRICT_PREFIX`	Path prefix filter
`CRAWL2MD_EXCLUDE_PATTERNS`	Comma-separated URL exclusion patterns
`CRAWL2MD_DELAY`	Request delay (seconds)
`CRAWL2MD_MAX_PAGES`	Max pages to process
`CRAWL2MD_NO_FRONTMATTER`	Disable frontmatter ("1" or "true")
`CRAWL2MD_USER_AGENT`	Custom User-Agent
`CRAWL2MD_VERBOSE`	Enable verbose mode ("1" or "true")
`CRAWL2MD_NO_TUI`	Disable TUI ("1" or "true")
`CRAWL2MD_DEDUPE`	Enable deduplication ("1" or "true")
`CRAWL2MD_SCROLL_LINES`	Lines to scroll per keypress in TUI
`CRAWL2MD_MAX_LOG_LINES`	Maximum log buffer size in TUI

Development

Install dev dependencies:

pip install -e ".[dev]"

Run checks:

make format - Format code with Black
make lint - Lint with Ruff
make typecheck - Type check with mypy
make test - Run tests with pytest (tests are in tests/)
make check - Run all checks (format check, lint, typecheck)

Requirements

Python 3.9+
requests
beautifulsoup4
markdownify
tomli (Python < 3.11 only)

Limitations

Designed for static HTML documentation sites
Does not execute JavaScript (no headless browser)
Does not download images or rewrite internal links

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Dec 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawl2md-0.1.0.tar.gz (65.8 kB view details)

Uploaded Dec 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crawl2md-0.1.0-py3-none-any.whl (56.4 kB view details)

Uploaded Dec 19, 2025 Python 3

File details

Details for the file crawl2md-0.1.0.tar.gz.

File metadata

Download URL: crawl2md-0.1.0.tar.gz
Upload date: Dec 19, 2025
Size: 65.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for crawl2md-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`75788b609db8cd82068fd92ae6cc44cc5c129c238378a845589e4adab6a48833`
MD5	`6e71e771786a74177f7269ea430f2308`
BLAKE2b-256	`b55b4b9dafe0f75c5510442a383d3a777e8c6eba2e112d0fcfe227cd3635a7f9`

See more details on using hashes here.

File details

Details for the file crawl2md-0.1.0-py3-none-any.whl.

File metadata

Download URL: crawl2md-0.1.0-py3-none-any.whl
Upload date: Dec 19, 2025
Size: 56.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for crawl2md-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e935b2b9bb8c2ebe08eb1dac15018fea599acfb2ac61e57e61ec8d30678810e`
MD5	`ae7001bab4d9b0e05d1c579e7203941a`
BLAKE2b-256	`d56723975798a17c382950bca66bcf4f5b9bf3df4270e9a9ce6121423d4dc678`

See more details on using hashes here.

crawl2md 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

crawl2md

Installation

Using pip

Using pipx (recommended for CLI tools)

Development install

Installing the man page

Quick Start

Usage

Output Format

Directory Structure

YAML Frontmatter

Interactive TUI Mode

TUI Features

TUI Screenshot

Troubleshooting TUI

Content Deduplication

Configuration

Config File

Environment Variables

Development

Requirements

Limitations

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes