extracts the main content from web pages and returns cleaned HTML, optional markdown, and structured metadata.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

jmcristobal

These details have not been verified by PyPI

Intended Audience
- Developers
Natural Language
- English
Programming Language

Project description

domdown

domdown banner

domdown turns article-like web pages into clean, structured Markdown.

It is built for pages where the shape matters: long-form posts, research writeups, technical blogs, security reports, and other content-heavy pages that need to become readable Markdown without losing useful structure.

What it does

domdown takes care of the full HTML-to-Markdown pipeline:

Parses messy web HTML
Selects the main article content
Removes navigation, promo blocks, and other chrome
Extracts metadata
Preserves images, tables, code blocks, links, and lists
Optionally emits YAML frontmatter
Renders the final Markdown document

The result is Markdown that is ready to read, reuse, archive, or feed into another model.

Why it exists

Most pages are not written like clean documents. They mix article content with menus, banners, share widgets, related links, and other page furniture.

domdown is designed for cases where you want the content to stay faithful to the original page while still producing a clean Markdown output that is easy to consume downstream.

Example

from domdown import DomdownOptions, html_to_markdown

html = """
<html>
  <head>
    <title>Credential theft campaign expands</title>
    <meta name="description" content="A concise security article." />
    <link rel="canonical" href="https://example.com/research/campaign" />
  </head>
  <body>
    <nav>Home Pricing Docs</nav>
    <article>
      <h1>Credential theft campaign expands</h1>
      <p>Researchers observed a new wave of phishing infrastructure.</p>
      <figure>
        <img src="/images/chart.png" alt="Campaign infrastructure chart" />
        <figcaption>Campaign infrastructure by week.</figcaption>
      </figure>
      <ul>
        <li>Windows targets increased.</li>
        <li>Linux staging remained stable.</li>
      </ul>
    </article>
  </body>
</html>
"""

markdown = html_to_markdown(
    html,
    DomdownOptions(base_url="https://example.com/research/campaign"),
)

print(markdown)

Output:

---
title: Credential theft campaign expands
source: "https://example.com/research/campaign"
description: A concise security article.
---
# Credential theft campaign expands

Researchers observed a new wave of phishing infrastructure.

![Campaign infrastructure chart](https://example.com/images/chart.png)

Campaign infrastructure by week.

- Windows targets increased.
- Linux staging remained stable.

What it preserves

domdown is optimized for article-style pages where useful structure should survive the conversion:

Titles and headings
Visible author and publication metadata
Canonical URLs and source references
Images and captions
Tables and code blocks
Inline links and emphasized text
Lists, quotes, and other document structure

Using domdown

Client usage

Use html_to_markdown() when you only need the final Markdown document as a string.

from domdown import DomdownOptions, html_to_markdown

markdown = html_to_markdown(
    html,
    DomdownOptions(
        base_url="https://example.com/post",
        emit_frontmatter=False,
    ),
)

When emit_frontmatter=True or left at the default, the returned string includes YAML frontmatter followed by the Markdown body.

API usage

Use HtmlToMarkdownPipeline when you want structured output.

from domdown import DomdownOptions, HtmlToMarkdownPipeline

pipeline = HtmlToMarkdownPipeline(
    DomdownOptions(base_url="https://example.com/post")
)
result = pipeline.run(html)

print(result.document)
print(result.markdown)
print(result.cleaned_html)
print(result.frontmatter)
print(result.warnings)

HtmlToMarkdownResult exposes:

Field	Type	Description
`markdown`	`str`	Markdown rendered from the selected content.
`cleaned_html`	`str \| None`	HTML after parsing, selection, cleaning, and preservation.
`metadata`	`HtmlMetadata \| None`	Normalized metadata extracted from the source HTML.
`frontmatter`	`str \| None`	YAML frontmatter when enabled.
`document`	`str \| None`	Final document string, including frontmatter when enabled.
`warnings`	`tuple[str, ...]`	Non-fatal pipeline warnings.

HtmlMetadata exposes:

Field	Type
`title`	`str \| None`
`site_name`	`str \| None`
`source`	`str \| None`
`author`	`tuple[str, ...]`
`published`	`str \| None`
`created`	`str \| None`
`description`	`str \| None`
`tags`	`tuple[str, ...]`
`language`	`str \| None`
`canonical_url`	`str \| None`
`image`	`str \| None`

Options

DomdownOptions controls parsing, cleanup, metadata extraction, and output shaping.

Option	Default	Behavior
`base_url`	`None`	Source URL used for metadata and relative URL resolution.
`created`	`None`	Creation date to include in metadata/frontmatter.
`extract_metadata`	`True`	Enables metadata extraction.
`emit_frontmatter`	`True`	Prepends YAML frontmatter to `document`.
`prefer_article_body`	`True`	Prefers article-like containers during selection.
`author_priority`	`"visible"`	Chooses visible author text before metadata unless set otherwise.
`frontmatter_tags`	`()`	Extra tags to include in generated frontmatter.
`preserve_images`	`True`	Keeps images for Markdown rendering.
`preserve_tables`	`True`	Keeps tables for Markdown rendering.
`preserve_code_blocks`	`True`	Keeps code/preformatted blocks.
`strip_hidden`	`True`	Removes hidden or non-visible elements.
`remove_selectors`	`()`	CSS selectors to remove.
`keep_selectors`	`()`	CSS selectors to protect during cleaning.
`unwrap_selectors`	`()`	CSS selectors whose wrapper is removed while children remain.

Example:

from domdown import DomdownOptions

options = DomdownOptions(
    base_url="https://example.com/article",
    emit_frontmatter=True,
    preserve_images=True,
    remove_selectors=(".share-widget", ".newsletter-signup"),
)

Real-world coverage

domdown includes curated real-world HTML/Markdown pairs under tests/real/ to protect the pipeline against regressions on live site shapes.

html/ stores the captured HTML for each case.
raw/ stores the expected Markdown output for the same case.
manifest.json declares the cases and their relative fixture paths.

To run the real-example suite:

pytest tests/real/test_real_examples.py -q

Public API

domdown exports these names from domdown.__init__:

from domdown import (
    DomdownOptions,
    HtmlMetadata,
    HtmlToMarkdownPipeline,
    HtmlToMarkdownResult,
    html_to_markdown,
)

Installation

Install from this repository:

pip install git+https://github.com/juanmcristobal/domdown.git

Install locally for development:

git clone https://github.com/juanmcristobal/domdown.git
cd domdown
pip install -e ".[dev]"

Runtime dependencies:

beautifulsoup4
lxml
soupsieve
httpx

Support & Connect

⭐ Star the repo if you found it useful
☕ Support me: Say thanks by buying me a coffee! https://buymeacoffee.com/juanmcristobal
💼 Open to work: https://www.linkedin.com/in/jmcristobal/

History

0.1.0 (2026-05-21)

First release.

0.1.1 (2026-05-31)

Fix release workflow checkout for PyPI publish.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

jmcristobal

These details have not been verified by PyPI

Intended Audience
- Developers
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.3.0

Jun 3, 2026

This version

0.1.1

May 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domdown-0.1.1.tar.gz (8.6 MB view details)

Uploaded May 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

domdown-0.1.1-py3-none-any.whl (47.5 kB view details)

Uploaded May 31, 2026 Python 3

File details

Details for the file domdown-0.1.1.tar.gz.

File metadata

Download URL: domdown-0.1.1.tar.gz
Upload date: May 31, 2026
Size: 8.6 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for domdown-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`68441615d3e106cd9d1492c33e7a762aa96487139c55c501c4496651a651f312`
MD5	`17b232b56a71dc6fac21b19d21f5b6c1`
BLAKE2b-256	`cfa206966806eab8a9ef8bdcecc764d13173994e14e4af6008c2b01665e59e96`

See more details on using hashes here.

Provenance

The following attestation bundles were made for domdown-0.1.1.tar.gz:

Publisher: release.yml on juanmcristobal/domdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: domdown-0.1.1.tar.gz
- Subject digest: 68441615d3e106cd9d1492c33e7a762aa96487139c55c501c4496651a651f312
- Sigstore transparency entry: 1678694432
- Sigstore integration time: May 31, 2026
Source repository:
- Permalink: juanmcristobal/domdown@546bbc8df1e82e5446d98330038a0f9bb38a83ac
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/juanmcristobal
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@546bbc8df1e82e5446d98330038a0f9bb38a83ac
- Trigger Event: push

File details

Details for the file domdown-0.1.1-py3-none-any.whl.

File metadata

Download URL: domdown-0.1.1-py3-none-any.whl
Upload date: May 31, 2026
Size: 47.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for domdown-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4fac755b5d6fd9aea06ff08f4d436835a774809feb6c86dc61df3d7675c46ba8`
MD5	`da3ee19dbba084ce1692f7cbaf335097`
BLAKE2b-256	`d99295a1ab3e8afdabe55e0ad16cd733b2663fc38f7d016c9af953bab3f4dc18`

See more details on using hashes here.

Provenance

The following attestation bundles were made for domdown-0.1.1-py3-none-any.whl:

Publisher: release.yml on juanmcristobal/domdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: domdown-0.1.1-py3-none-any.whl
- Subject digest: 4fac755b5d6fd9aea06ff08f4d436835a774809feb6c86dc61df3d7675c46ba8
- Sigstore transparency entry: 1678694683
- Sigstore integration time: May 31, 2026
Source repository:
- Permalink: juanmcristobal/domdown@546bbc8df1e82e5446d98330038a0f9bb38a83ac
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/juanmcristobal
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@546bbc8df1e82e5446d98330038a0f9bb38a83ac
- Trigger Event: push

domdown 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

domdown

What it does

Why it exists

Example

What it preserves

Using domdown

Client usage

API usage

Options

Real-world coverage

Public API

Installation

Support & Connect

History

0.1.0 (2026-05-21)

0.1.1 (2026-05-31)

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance