Convert web pages and HTML to clean Markdown
Project description
crawl4md
crawl4md is a minimal, clean CLI tool that crawls web pages or sitemaps and converts them into structured Markdown files.
The project is intentionally designed to stay simple, deterministic, and easy to extend — without unnecessary complexity or hidden behavior.
Philosophy
- Minimal: only what is needed, nothing more
- Deterministic: same input → same output
- Transparent: no magic, clear processing steps
- Composable: ideal as a building block for pipelines (e.g. RAG)
Features
- Crawl from:
sitemap.xml- explicit page lists
- Clean Markdown output (via
crawl4ai, markdown-fit mode) - Deterministic file structure based on URL paths
- YAML-based project configuration
- CLI-first workflow (uv-compatible)
- Clear, readable progress output
Installation
There are two ways to use crawl4md.
Use the Batch Crawler
If you want to use the project directly for batch crawling via crawl.yml, clone the repository:
git clone git@github.com:ixnode/crawl4md.git && cd crawl4md
Then continue with the configuration section below.
Use the Python Package
If you want to build your own tooling on top of crawl4md, install it as a package:
pip install crawl4md
Or with uv:
uv add crawl4md
For local development inside the repository:
uv sync
Configuration
The CLI reads a crawl.yml file from the current working directory.
Create it from the example:
cp crawl.yml.example crawl.yml
Minimal example:
projects:
planes:
type: pages
crawl:
parse_type: markdown-fit
sources:
- https://de.wikipedia.org/wiki/Boeing_707
- https://de.wikipedia.org/wiki/Boeing_717
preprocessing:
markdown:
enabled: true
remove_html_comments: true
normalize_whitespace: true
pydantic:
type: sitemap
crawl:
parse_type: markdown-fit
sources:
- https://pydantic.dev/sitemap.xml
preprocessing:
markdown:
enabled: false
Available project settings:
type:pagesorsitemapsources: list of page URLs or sitemap URLscrawl.parse_type:markdownormarkdown-fitpreprocessing.markdown.enabled: enables Markdown cleanuppreprocessing.markdown.*: optional cleanup rules such asensure_h1,remove_html_comments,remove_reference_sections, andnormalize_whitespace
For the full configuration, see crawl.yml.example.
Usage
After cloning the repository and creating crawl.yml, use:
crawl planes
crawl pydantic
Or with uv inside the project:
uv run crawl planes
uv run crawl pydantic
Python API
crawl4md can also be used as a Python package.
The public classes are:
MarkdownFetcherMarkdownConverterParseTypeMarkdownPreprocessingConfig
Configure Parse Type
Use ParseType to control how Markdown is generated:
"markdown": raw markdown output"markdown-fit": cleaned and reduced markdown output viacrawl4ai
Configure Preprocessing
Use MarkdownPreprocessingConfig to enable optional cleanup steps.
Simple example:
from crawl4md import MarkdownPreprocessingConfig
config = MarkdownPreprocessingConfig(
enabled=True,
remove_html_comments=True,
normalize_whitespace=True,
)
Fetch Markdown From a URL
Use MarkdownFetcher if you want to fetch a page and directly receive Markdown.
from crawl4md import MarkdownFetcher, MarkdownPreprocessingConfig
config = MarkdownPreprocessingConfig(enabled=True)
fetcher = MarkdownFetcher(config=config, parse_type="markdown-fit")
markdown = fetcher.fetch_sync("https://example.com")
print(markdown)
Async version:
import asyncio
from crawl4md import MarkdownFetcher, MarkdownPreprocessingConfig
config = MarkdownPreprocessingConfig(enabled=True)
fetcher = MarkdownFetcher(config=config, parse_type="markdown-fit")
markdown = asyncio.run(fetcher.fetch("https://example.com"))
print(markdown)
Convert HTML to Markdown
Use MarkdownConverter if you already have HTML and only want the conversion step.
from crawl4md import MarkdownConverter, MarkdownPreprocessingConfig
html = "<html><body><h1>Hello</h1><p>World</p></body></html>"
config = MarkdownPreprocessingConfig(enabled=True, ensure_h1=True)
converter = MarkdownConverter(config=config, parse_type="markdown")
markdown = converter.convert_sync(html=html, url="https://example.com")
print(markdown)
Async version:
import asyncio
from crawl4md import MarkdownConverter, MarkdownPreprocessingConfig
html = "<html><body><h1>Hello</h1><p>World</p></body></html>"
config = MarkdownPreprocessingConfig(enabled=True, ensure_h1=True)
converter = MarkdownConverter(config=config, parse_type="markdown")
markdown = asyncio.run(
converter.convert(html=html, url="https://example.com")
)
print(markdown)
Output Structure
Markdown files are stored deterministically based on the URL path:
docs/<project>/<url-path>.md
Example:
docs/planes/wiki/Boeing_707.md
Rules:
- Domain is ignored
- URL path is preserved
/→index.md- Query parameters are ignored
Example Output
1/2 Crawl https://de.wikipedia.org/wiki/Boeing_707
- Fetching ... done
- Processing ... done
- Writing docs/planes/wiki/Boeing_707.md ... done
Use Cases
- RAG data ingestion
- Website snapshotting
- Knowledge base generation
- Offline documentation
Project Structure
src/crawl4md/
├─ cli.py
├─ config.py
├─ sitemap.py
├─ crawler.py
├─ paths.py
└─ writer.py
Notes
- No recursive crawling (by design)
- No hidden caching or transformations
- Focus on clean Markdown output only
License
This project is licensed under the MIT License. See the LICENSE.md file for details.
Authors
- Björn Hempel bjoern@hempel.li - Initial work - https://github.com/bjoern-hempel
Built on top of crawl4ai
This project builds on the excellent crawl4ai library and extends it with a simpler batch-oriented workflow for repeatable Markdown exports.
Why use crawl4md as a complement to crawl4ai:
- project-based batch crawling via
crawl.yml - support for both page lists and sitemap-driven crawls
- deterministic output paths for generated Markdown files
- optional Markdown cleanup rules for better downstream text quality
- a small CLI and Python API focused on URL or HTML to Markdown workflows
- clearer separation between fetching, conversion, preprocessing, and writing
In short: crawl4ai provides the powerful crawling and Markdown generation foundation, while crawl4md adds a lightweight structure around it for batch jobs, cleaner output, and easier integration into documentation or RAG pipelines.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawl4md-0.1.2.tar.gz.
File metadata
- Download URL: crawl4md-0.1.2.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab39fe76a3f6eda6a58409dacc4e42d3586f2c7a9ab1a5df420ffee6c6253a75
|
|
| MD5 |
887323b5c01a45045249c197703faaaf
|
|
| BLAKE2b-256 |
f83e38091e91d900acccfb13f06743780b57df25c6754522d5ab3a1bffa9c415
|
File details
Details for the file crawl4md-0.1.2-py3-none-any.whl.
File metadata
- Download URL: crawl4md-0.1.2-py3-none-any.whl
- Upload date:
- Size: 25.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1fdf9d9b903a2d9e6b91b4f720a7a5b9b901e922d24749eda2962af53ce43fde
|
|
| MD5 |
4ba17e92d79f8ec451cb7a4331e87f18
|
|
| BLAKE2b-256 |
b3087a194c0a1ce6c1261c813634c947ff7ea8c438a7f05b389c9ee1dc119053
|