extracts the main content from web pages and returns cleaned HTML, optional markdown, and structured metadata.
Project description
domdown
domdown turns article-like web pages into clean, structured Markdown.
It is built for pages where the shape matters: long-form posts, research writeups, technical blogs, security reports, and other content-heavy pages that need to become readable Markdown without losing useful structure.
What it does
domdown takes care of the full HTML-to-Markdown pipeline:
- Parses messy web HTML
- Selects the main article content
- Removes navigation, promo blocks, and other chrome
- Extracts metadata
- Preserves images, tables, code blocks, links, and lists
- Optionally emits YAML frontmatter
- Renders the final Markdown document
The result is Markdown that is ready to read, reuse, archive, or feed into another model.
Why it exists
Most pages are not written like clean documents. They mix article content with menus, banners, share widgets, related links, and other page furniture.
domdown is designed for cases where you want the content to stay faithful to the original page while still producing a clean Markdown output that is easy to consume downstream.
Example
from domdown import DomdownOptions, html_to_markdown
html = """
<html>
<head>
<title>Credential theft campaign expands</title>
<meta name="description" content="A concise security article." />
<link rel="canonical" href="https://example.com/research/campaign" />
</head>
<body>
<nav>Home Pricing Docs</nav>
<article>
<h1>Credential theft campaign expands</h1>
<p>Researchers observed a new wave of phishing infrastructure.</p>
<figure>
<img src="/images/chart.png" alt="Campaign infrastructure chart" />
<figcaption>Campaign infrastructure by week.</figcaption>
</figure>
<ul>
<li>Windows targets increased.</li>
<li>Linux staging remained stable.</li>
</ul>
</article>
</body>
</html>
"""
markdown = html_to_markdown(
html,
DomdownOptions(base_url="https://example.com/research/campaign"),
)
print(markdown)
Output:
---
title: Credential theft campaign expands
source: "https://example.com/research/campaign"
description: A concise security article.
---
# Credential theft campaign expands
Researchers observed a new wave of phishing infrastructure.

Campaign infrastructure by week.
- Windows targets increased.
- Linux staging remained stable.
What it preserves
domdown is optimized for article-style pages where useful structure should survive the conversion:
- Titles and headings
- Visible author and publication metadata
- Canonical URLs and source references
- Images and captions
- Tables and code blocks
- Inline links and emphasized text
- Lists, quotes, and other document structure
Using domdown
Client usage
Use html_to_markdown() when you only need the final Markdown document as a string.
from domdown import DomdownOptions, html_to_markdown
markdown = html_to_markdown(
html,
DomdownOptions(
base_url="https://example.com/post",
emit_frontmatter=False,
),
)
When emit_frontmatter=True or left at the default, the returned string includes YAML frontmatter followed by the Markdown body.
API usage
Use HtmlToMarkdownPipeline when you want structured output.
from domdown import DomdownOptions, HtmlToMarkdownPipeline
pipeline = HtmlToMarkdownPipeline(
DomdownOptions(base_url="https://example.com/post")
)
result = pipeline.run(html)
print(result.document)
print(result.markdown)
print(result.cleaned_html)
print(result.frontmatter)
print(result.warnings)
HtmlToMarkdownResult exposes:
| Field | Type | Description |
|---|---|---|
markdown |
str |
Markdown rendered from the selected content. |
cleaned_html |
str | None |
HTML after parsing, selection, cleaning, and preservation. |
metadata |
HtmlMetadata | None |
Normalized metadata extracted from the source HTML. |
frontmatter |
str | None |
YAML frontmatter when enabled. |
document |
str | None |
Final document string, including frontmatter when enabled. |
warnings |
tuple[str, ...] |
Non-fatal pipeline warnings. |
HtmlMetadata exposes:
| Field | Type |
|---|---|
title |
str | None |
site_name |
str | None |
source |
str | None |
author |
tuple[str, ...] |
published |
str | None |
created |
str | None |
description |
str | None |
tags |
tuple[str, ...] |
language |
str | None |
canonical_url |
str | None |
image |
str | None |
Options
DomdownOptions controls parsing, cleanup, metadata extraction, and output shaping.
| Option | Default | Behavior |
|---|---|---|
base_url |
None |
Source URL used for metadata and relative URL resolution. |
created |
None |
Creation date to include in metadata/frontmatter. |
extract_metadata |
True |
Enables metadata extraction. |
emit_frontmatter |
True |
Prepends YAML frontmatter to document. |
prefer_article_body |
True |
Prefers article-like containers during selection. |
author_priority |
"visible" |
Chooses visible author text before metadata unless set otherwise. |
frontmatter_tags |
() |
Extra tags to include in generated frontmatter. |
preserve_images |
True |
Keeps images for Markdown rendering. |
preserve_tables |
True |
Keeps tables for Markdown rendering. |
preserve_code_blocks |
True |
Keeps code/preformatted blocks. |
strip_hidden |
True |
Removes hidden or non-visible elements. |
remove_selectors |
() |
CSS selectors to remove. |
keep_selectors |
() |
CSS selectors to protect during cleaning. |
unwrap_selectors |
() |
CSS selectors whose wrapper is removed while children remain. |
Example:
from domdown import DomdownOptions
options = DomdownOptions(
base_url="https://example.com/article",
emit_frontmatter=True,
preserve_images=True,
remove_selectors=(".share-widget", ".newsletter-signup"),
)
Real-world coverage
domdown includes curated real-world HTML/Markdown pairs under tests/real/ to protect the pipeline against regressions on live site shapes.
html/stores the captured HTML for each case.raw/stores the expected Markdown output for the same case.manifest.jsondeclares the cases and their relative fixture paths.
To run the real-example suite:
pytest tests/real/test_real_examples.py -q
Public API
domdown exports these names from domdown.__init__:
from domdown import (
DomdownOptions,
HtmlMetadata,
HtmlToMarkdownPipeline,
HtmlToMarkdownResult,
html_to_markdown,
)
Installation
Install from this repository:
pip install git+https://github.com/juanmcristobal/domdown.git
Install locally for development:
git clone https://github.com/juanmcristobal/domdown.git
cd domdown
pip install -e ".[dev]"
Runtime dependencies:
beautifulsoup4lxmlsoupsievehttpx
Support & Connect
- ⭐ Star the repo if you found it useful
- ☕ Support me: Say thanks by buying me a coffee! https://buymeacoffee.com/juanmcristobal
- 💼 Open to work: https://www.linkedin.com/in/jmcristobal/
History
0.1.0 (2026-05-21)
- First release.
0.1.1 (2026-05-31)
- Fix release workflow checkout for PyPI publish.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file domdown-0.1.1.tar.gz.
File metadata
- Download URL: domdown-0.1.1.tar.gz
- Upload date:
- Size: 8.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68441615d3e106cd9d1492c33e7a762aa96487139c55c501c4496651a651f312
|
|
| MD5 |
17b232b56a71dc6fac21b19d21f5b6c1
|
|
| BLAKE2b-256 |
cfa206966806eab8a9ef8bdcecc764d13173994e14e4af6008c2b01665e59e96
|
Provenance
The following attestation bundles were made for domdown-0.1.1.tar.gz:
Publisher:
release.yml on juanmcristobal/domdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
domdown-0.1.1.tar.gz -
Subject digest:
68441615d3e106cd9d1492c33e7a762aa96487139c55c501c4496651a651f312 - Sigstore transparency entry: 1678694432
- Sigstore integration time:
-
Permalink:
juanmcristobal/domdown@546bbc8df1e82e5446d98330038a0f9bb38a83ac -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/juanmcristobal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@546bbc8df1e82e5446d98330038a0f9bb38a83ac -
Trigger Event:
push
-
Statement type:
File details
Details for the file domdown-0.1.1-py3-none-any.whl.
File metadata
- Download URL: domdown-0.1.1-py3-none-any.whl
- Upload date:
- Size: 47.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fac755b5d6fd9aea06ff08f4d436835a774809feb6c86dc61df3d7675c46ba8
|
|
| MD5 |
da3ee19dbba084ce1692f7cbaf335097
|
|
| BLAKE2b-256 |
d99295a1ab3e8afdabe55e0ad16cd733b2663fc38f7d016c9af953bab3f4dc18
|
Provenance
The following attestation bundles were made for domdown-0.1.1-py3-none-any.whl:
Publisher:
release.yml on juanmcristobal/domdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
domdown-0.1.1-py3-none-any.whl -
Subject digest:
4fac755b5d6fd9aea06ff08f4d436835a774809feb6c86dc61df3d7675c46ba8 - Sigstore transparency entry: 1678694683
- Sigstore integration time:
-
Permalink:
juanmcristobal/domdown@546bbc8df1e82e5446d98330038a0f9bb38a83ac -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/juanmcristobal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@546bbc8df1e82e5446d98330038a0f9bb38a83ac -
Trigger Event:
push
-
Statement type: