Skip to main content

Crawl documentation websites and convert to Markdown files

Project description

crawl2md

Crawl documentation websites and convert them to Markdown files.

crawl2md is a command-line tool that:

  • Crawls documentation websites using breadth-first search
  • Extracts the main content from each page
  • Converts HTML to clean Markdown
  • Adds optional Obsidian-compatible YAML frontmatter
  • Mirrors the URL structure as a local directory tree

Installation

Using pip

pip install .

Using pipx (recommended for CLI tools)

pipx install .

Development install

make dev-install
# or
pip install -e .

Installing the man page

sudo make man-install

Quick Start

Crawl an entire documentation site:

crawl2md -s https://docs.example.com/

Crawl only a specific section:

crawl2md -s https://docs.example.com/tutorial/ -p /tutorial

Crawl with custom output directory and tags:

crawl2md -s https://docs.example.com/ -o my-docs -t docs reference

Crawl with plain output (disables default TUI):

crawl2md -s https://docs.example.com/ --no-tui

Crawl with deduplication to skip duplicate content:

crawl2md -s https://docs.example.com/ --dedupe

Usage

crawl2md [options]

Options:
  -s, --start URL           Starting URL to crawl (required)
  -b, --base BASE_URL       Base URL to constrain crawling
  -o, --output DIR          Output directory (default: docs-md)
  -t, --tags TAG [TAG ...]  Tags for YAML frontmatter
  -p, --restrict-prefix PREFIX  Only crawl paths starting with PREFIX
  -e, --exclude-patterns PATTERN [...]  Exclude URLs matching patterns
  -d, --delay SECONDS       Delay between requests (default: 0.3)
  -m, --max-pages N         Maximum pages to process
  --no-frontmatter          Disable YAML frontmatter
  --user-agent STRING       Custom User-Agent header
  -v, --verbose             Enable verbose logging
  --no-tui                  Disable TUI, use plain output (TUI is default)
  --dedupe                  Enable content deduplication
  --scroll-lines N          Lines to scroll per keypress in TUI (default: auto)
  --max-log-lines N         Maximum log buffer size in TUI (default: unlimited)
  --version                 Show version and exit

Output Format

Directory Structure

URLs are mapped to local files:

URL File
https://example.com/ docs-md/index.md
https://example.com/intro/ docs-md/intro.md
https://example.com/guide/ docs-md/guide/index.md
https://example.com/guide/api/ docs-md/guide/api.md

YAML Frontmatter

By default, each file includes Obsidian-compatible frontmatter:

---
title: "Page Title"
source: https://example.com/page/
created: 2025-12-01
tags:
  - docs
  - tutorial
---

Use --no-frontmatter to disable this.

Interactive TUI Mode

The curses-based TUI provides real-time monitoring and control for long-running crawls. TUI mode is enabled by default; use --no-tui for plain output.

TUI Features

Real-time Statistics:

  • Pages processed, files saved, duplicates skipped, errors
  • Current URL being crawled
  • Queue size and elapsed time
  • Current crawl speed (delay between requests)

Interactive Controls:

Key Action Description
q Quit Stop crawl and exit
p Pause/Resume Pause or resume the crawl
h Help Toggle help overlay with all controls
m Menu Open mid-crawl configuration menu
c Center Re-center queue view on current item
u URL Toggle Switch between path-only and full URL display
↑/↓ Scroll Scroll log window up/down by one line
PgUp/PgDn Page Scroll Scroll log window by half page
Home/End Jump Jump to oldest/newest logs
Esc Close Close help overlay or config menu

Mouse Support:

  • Scroll wheel to scroll log window

Adaptive Layout:

  • Works on terminals as small as 1 line (graceful degradation)
  • Automatically adjusts panel visibility based on terminal size
  • Shows warning when terminal is too small for full view
  • Handles terminal resize without crashes

Error Handling:

  • Terminal always restored on exit (even on crashes)
  • Crawler errors displayed in overlay panel
  • Clean exit with 'q' even in error state

TUI Screenshot

[⠋] Pages: 1234 | Saved: 1180 | Dups: 54 | Errors: 0 | Queue: 23 | Elapsed: 05:32 | Speed: 0.5s
Current: ...example.com/docs/advanced/configuration#authentication
Fetching: https://example.com/docs/setup
  Saved: docs-md/setup.md
  Queued: https://example.com/docs/install
Fetching: https://example.com/docs/install
  DUPLICATE body: https://example.com/docs/install → same as docs-md/setup.md
q:quit  p:pause  c:center  u:url  m:menu  ↑/↓:scroll  h:help

Troubleshooting TUI

Terminal Issues:

  • If terminal appears broken after crash, run: reset
  • Ensure terminal supports Unicode (for spinner animation)
  • Minimum terminal size: 1 line (but 10+ lines recommended for full view)

Performance:

  • TUI updates at ~10 Hz (every 100ms)
  • No significant overhead on crawler performance
  • Safe to use on long-running crawls (hours+)

Content Deduplication

Skip saving duplicate content to avoid redundant files:

crawl2md -s https://docs.example.com/ --dedupe

How it works:

  • Computes SHA256 hash of markdown body (excluding frontmatter)
  • First occurrence is saved normally
  • Subsequent pages with identical content are skipped
  • Duplicates counter incremented in stats/logs

Use cases:

  • Documentation sites with mirrors/aliases
  • Sites with "print" versions of pages
  • Multi-language sites with untranslated pages

Configuration

Configuration can be provided via:

  1. CLI arguments (highest priority)
  2. Environment variables
  3. Config file (crawl2md.toml)
  4. Built-in defaults (lowest priority)

Config File

Create crawl2md.toml in your working directory or ~/.config/crawl2md/:

[crawl2md]
start_url = "https://docs.example.com/"
output = "my-docs"
tags = ["docs", "reference"]
delay = 0.5
verbose = true
no_tui = false
dedupe = true

Environment Variables

Variable Description
CRAWL2MD_START_URL Starting URL
CRAWL2MD_BASE_URL Base URL constraint
CRAWL2MD_OUTPUT Output directory
CRAWL2MD_TAGS Comma-separated tags
CRAWL2MD_RESTRICT_PREFIX Path prefix filter
CRAWL2MD_EXCLUDE_PATTERNS Comma-separated URL exclusion patterns
CRAWL2MD_DELAY Request delay (seconds)
CRAWL2MD_MAX_PAGES Max pages to process
CRAWL2MD_NO_FRONTMATTER Disable frontmatter ("1" or "true")
CRAWL2MD_USER_AGENT Custom User-Agent
CRAWL2MD_VERBOSE Enable verbose mode ("1" or "true")
CRAWL2MD_NO_TUI Disable TUI ("1" or "true")
CRAWL2MD_DEDUPE Enable deduplication ("1" or "true")
CRAWL2MD_SCROLL_LINES Lines to scroll per keypress in TUI
CRAWL2MD_MAX_LOG_LINES Maximum log buffer size in TUI

Development

Install dev dependencies:

pip install -e ".[dev]"

Run checks:

  • make format - Format code with Black
  • make lint - Lint with Ruff
  • make typecheck - Type check with mypy
  • make test - Run tests with pytest (tests are in tests/)
  • make check - Run all checks (format check, lint, typecheck)

Requirements

  • Python 3.9+
  • requests
  • beautifulsoup4
  • markdownify
  • tomli (Python < 3.11 only)

Limitations

  • Designed for static HTML documentation sites
  • Does not execute JavaScript (no headless browser)
  • Does not download images or rewrite internal links

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawl2md-0.1.0.tar.gz (65.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawl2md-0.1.0-py3-none-any.whl (56.4 kB view details)

Uploaded Python 3

File details

Details for the file crawl2md-0.1.0.tar.gz.

File metadata

  • Download URL: crawl2md-0.1.0.tar.gz
  • Upload date:
  • Size: 65.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for crawl2md-0.1.0.tar.gz
Algorithm Hash digest
SHA256 75788b609db8cd82068fd92ae6cc44cc5c129c238378a845589e4adab6a48833
MD5 6e71e771786a74177f7269ea430f2308
BLAKE2b-256 b55b4b9dafe0f75c5510442a383d3a777e8c6eba2e112d0fcfe227cd3635a7f9

See more details on using hashes here.

File details

Details for the file crawl2md-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: crawl2md-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 56.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for crawl2md-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3e935b2b9bb8c2ebe08eb1dac15018fea599acfb2ac61e57e61ec8d30678810e
MD5 ae7001bab4d9b0e05d1c579e7203941a
BLAKE2b-256 d56723975798a17c382950bca66bcf4f5b9bf3df4270e9a9ce6121423d4dc678

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page