Skip to main content

Convert web pages and HTML to clean Markdown

Project description

crawl4md

crawl4md is a minimal, clean CLI tool that crawls web pages or sitemaps and converts them into structured Markdown files.

The project is intentionally designed to stay simple, deterministic, and easy to extend — without unnecessary complexity or hidden behavior.


Philosophy

  • Minimal: only what is needed, nothing more
  • Deterministic: same input → same output
  • Transparent: no magic, clear processing steps
  • Composable: ideal as a building block for pipelines (e.g. RAG)

Features

  • Crawl from:
    • sitemap.xml
    • explicit page lists
  • Clean Markdown output (via crawl4ai, markdown-fit mode)
  • Deterministic file structure based on URL paths
  • YAML-based project configuration
  • CLI-first workflow (uv-compatible)
  • Clear, readable progress output

Installation

There are two ways to use crawl4md.

Use the Batch Crawler

If you want to use the project directly for batch crawling via crawl.yml, clone the repository:

git clone git@github.com:ixnode/crawl4md.git && cd crawl4md

Then continue with the configuration section below.

Use the Python Package

If you want to build your own tooling on top of crawl4md, install it as a package:

pip install crawl4md

Or with uv:

uv add crawl4md

For local development inside the repository:

uv sync

Configuration

The CLI reads a crawl.yml file from the current working directory.

Create it from the example:

cp crawl.yml.example crawl.yml

Minimal example:

projects:
    planes:
        type: pages
        crawl:
            parse_type: markdown-fit
        sources:
            - https://de.wikipedia.org/wiki/Boeing_707
            - https://de.wikipedia.org/wiki/Boeing_717
        preprocessing:
            markdown:
                enabled: true
                remove_html_comments: true
                normalize_whitespace: true

    pydantic:
        type: sitemap
        crawl:
            parse_type: markdown-fit
        sources:
            - https://pydantic.dev/sitemap.xml
        preprocessing:
            markdown:
                enabled: false

Available project settings:

  • type: pages or sitemap
  • sources: list of page URLs or sitemap URLs
  • crawl.parse_type: markdown or markdown-fit
  • preprocessing.markdown.enabled: enables Markdown cleanup
  • preprocessing.markdown.*: optional cleanup rules such as ensure_h1, remove_html_comments, remove_reference_sections, and normalize_whitespace

For the full configuration, see crawl.yml.example.


Usage

After cloning the repository and creating crawl.yml, use:

crawl planes
crawl pydantic

Or with uv inside the project:

uv run crawl planes
uv run crawl pydantic

Python API

crawl4md can also be used as a Python package.

The public classes are:

  • MarkdownFetcher
  • MarkdownConverter
  • ParseType
  • MarkdownPreprocessingConfig

Configure Parse Type

Use ParseType to control how Markdown is generated:

  • "markdown": raw markdown output
  • "markdown-fit": cleaned and reduced markdown output via crawl4ai

Configure Preprocessing

Use MarkdownPreprocessingConfig to enable optional cleanup steps.

Simple example:

from crawl4md import MarkdownPreprocessingConfig

config = MarkdownPreprocessingConfig(
    enabled=True,
    remove_html_comments=True,
    normalize_whitespace=True,
)

Fetch Markdown From a URL

Use MarkdownFetcher if you want to fetch a page and directly receive Markdown.

from crawl4md import MarkdownFetcher, MarkdownPreprocessingConfig

config = MarkdownPreprocessingConfig(enabled=True)
fetcher = MarkdownFetcher(config=config, parse_type="markdown-fit")

markdown = fetcher.fetch_sync("https://example.com")
print(markdown)

Async version:

import asyncio

from crawl4md import MarkdownFetcher, MarkdownPreprocessingConfig

config = MarkdownPreprocessingConfig(enabled=True)
fetcher = MarkdownFetcher(config=config, parse_type="markdown-fit")

markdown = asyncio.run(fetcher.fetch("https://example.com"))
print(markdown)

Convert HTML to Markdown

Use MarkdownConverter if you already have HTML and only want the conversion step.

from crawl4md import MarkdownConverter, MarkdownPreprocessingConfig

html = "<html><body><h1>Hello</h1><p>World</p></body></html>"

config = MarkdownPreprocessingConfig(enabled=True, ensure_h1=True)
converter = MarkdownConverter(config=config, parse_type="markdown")

markdown = converter.convert_sync(html=html, url="https://example.com")
print(markdown)

Async version:

import asyncio

from crawl4md import MarkdownConverter, MarkdownPreprocessingConfig

html = "<html><body><h1>Hello</h1><p>World</p></body></html>"

config = MarkdownPreprocessingConfig(enabled=True, ensure_h1=True)
converter = MarkdownConverter(config=config, parse_type="markdown")

markdown = asyncio.run(
    converter.convert(html=html, url="https://example.com")
)
print(markdown)

Output Structure

Markdown files are stored deterministically based on the URL path:

docs/<project>/<url-path>.md

Example:

docs/planes/wiki/Boeing_707.md

Rules:

  • Domain is ignored
  • URL path is preserved
  • /index.md
  • Query parameters are ignored

Example Output

1/2 Crawl https://de.wikipedia.org/wiki/Boeing_707
- Fetching ... done
- Processing ... done
- Writing docs/planes/wiki/Boeing_707.md ... done

Use Cases

  • RAG data ingestion
  • Website snapshotting
  • Knowledge base generation
  • Offline documentation

Project Structure

src/crawl4md/
├─ cli.py
├─ config.py
├─ sitemap.py
├─ crawler.py
├─ paths.py
└─ writer.py

Notes

  • No recursive crawling (by design)
  • No hidden caching or transformations
  • Focus on clean Markdown output only

License

This project is licensed under the MIT License. See the LICENSE.md file for details.

Authors


Built on top of crawl4ai

This project builds on the excellent crawl4ai library and extends it with a simpler batch-oriented workflow for repeatable Markdown exports.

Why use crawl4md as a complement to crawl4ai:

  • project-based batch crawling via crawl.yml
  • support for both page lists and sitemap-driven crawls
  • deterministic output paths for generated Markdown files
  • optional Markdown cleanup rules for better downstream text quality
  • a small CLI and Python API focused on URL or HTML to Markdown workflows
  • clearer separation between fetching, conversion, preprocessing, and writing

In short: crawl4ai provides the powerful crawling and Markdown generation foundation, while crawl4md adds a lightweight structure around it for batch jobs, cleaner output, and easier integration into documentation or RAG pipelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawl4md-0.1.3.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawl4md-0.1.3-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file crawl4md-0.1.3.tar.gz.

File metadata

  • Download URL: crawl4md-0.1.3.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for crawl4md-0.1.3.tar.gz
Algorithm Hash digest
SHA256 260b327596ea34fa7e7484e32dcd2ad2da3ed50466af16364e2e895e0e0cc7e2
MD5 3cdbf8ab37b95a0f59e3eb72368f2306
BLAKE2b-256 c31738160a0f47d1abea58a32fc541679a601b09f414474ae1553463c5158636

See more details on using hashes here.

File details

Details for the file crawl4md-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: crawl4md-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 25.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for crawl4md-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 60544fc2cee408e1cd6083ef96114ba3d88d66b8646c952effe4a4d9457f5655
MD5 1d08e12de3c9b35601547f916d753fbe
BLAKE2b-256 2490d8f26393c4c428d0a34769f59bcda85e06c8778bf084ab1ab05fdad575af

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page