Skip to main content

extracts the main content from web pages and returns cleaned HTML, optional markdown, and structured metadata.

Project description

domdown

domdown banner

domdown turns article-like web pages into clean, structured Markdown.

It is built for pages where the shape matters: long-form posts, research writeups, technical blogs, security reports, and other content-heavy pages that need to become readable Markdown without losing useful structure.

What it does

domdown takes care of the full HTML-to-Markdown pipeline:

  • Parses messy web HTML
  • Selects the main article content
  • Removes navigation, promo blocks, and other chrome
  • Extracts metadata
  • Preserves images, tables, code blocks, links, and lists
  • Optionally emits YAML frontmatter
  • Renders the final Markdown document

The result is Markdown that is ready to read, reuse, archive, or feed into another model.

Why it exists

Most pages are not written like clean documents. They mix article content with menus, banners, share widgets, related links, and other page furniture.

domdown is designed for cases where you want the content to stay faithful to the original page while still producing a clean Markdown output that is easy to consume downstream.

Example

from domdown import DomdownOptions, html_to_markdown

html = """
<html>
  <head>
    <title>Credential theft campaign expands</title>
    <meta name="description" content="A concise security article." />
    <link rel="canonical" href="https://example.com/research/campaign" />
  </head>
  <body>
    <nav>Home Pricing Docs</nav>
    <article>
      <h1>Credential theft campaign expands</h1>
      <p>Researchers observed a new wave of phishing infrastructure.</p>
      <figure>
        <img src="/images/chart.png" alt="Campaign infrastructure chart" />
        <figcaption>Campaign infrastructure by week.</figcaption>
      </figure>
      <ul>
        <li>Windows targets increased.</li>
        <li>Linux staging remained stable.</li>
      </ul>
    </article>
  </body>
</html>
"""

markdown = html_to_markdown(
    html,
    DomdownOptions(base_url="https://example.com/research/campaign"),
)

print(markdown)

Output:

---
title: Credential theft campaign expands
source: "https://example.com/research/campaign"
description: A concise security article.
---
# Credential theft campaign expands

Researchers observed a new wave of phishing infrastructure.

![Campaign infrastructure chart](https://example.com/images/chart.png)

Campaign infrastructure by week.

- Windows targets increased.
- Linux staging remained stable.

What it preserves

domdown is optimized for article-style pages where useful structure should survive the conversion:

  • Titles and headings
  • Visible author and publication metadata
  • Canonical URLs and source references
  • Images and captions
  • Tables and code blocks
  • Inline links and emphasized text
  • Lists, quotes, and other document structure

Using domdown

Client usage

Use html_to_markdown() when you only need the final Markdown document as a string.

from domdown import DomdownOptions, html_to_markdown

markdown = html_to_markdown(
    html,
    DomdownOptions(
        base_url="https://example.com/post",
        emit_frontmatter=False,
    ),
)

When emit_frontmatter=True or left at the default, the returned string includes YAML frontmatter followed by the Markdown body.

API usage

Use HtmlToMarkdownPipeline when you want structured output.

from domdown import DomdownOptions, HtmlToMarkdownPipeline

pipeline = HtmlToMarkdownPipeline(
    DomdownOptions(base_url="https://example.com/post")
)
result = pipeline.run(html)

print(result.document)
print(result.markdown)
print(result.cleaned_html)
print(result.frontmatter)
print(result.warnings)

HtmlToMarkdownResult exposes:

Field Type Description
markdown str Markdown rendered from the selected content.
cleaned_html str | None HTML after parsing, selection, cleaning, and preservation.
metadata HtmlMetadata | None Normalized metadata extracted from the source HTML.
frontmatter str | None YAML frontmatter when enabled.
document str | None Final document string, including frontmatter when enabled.
warnings tuple[str, ...] Non-fatal pipeline warnings.

HtmlMetadata exposes:

Field Type
title str | None
site_name str | None
source str | None
author tuple[str, ...]
published str | None
created str | None
description str | None
tags tuple[str, ...]
language str | None
canonical_url str | None
image str | None

Options

DomdownOptions controls parsing, cleanup, metadata extraction, and output shaping.

Option Default Behavior
base_url None Source URL used for metadata and relative URL resolution.
created None Creation date to include in metadata/frontmatter.
extract_metadata True Enables metadata extraction.
emit_frontmatter True Prepends YAML frontmatter to document.
prefer_article_body True Prefers article-like containers during selection.
author_priority "visible" Chooses visible author text before metadata unless set otherwise.
frontmatter_tags () Extra tags to include in generated frontmatter.
preserve_images True Keeps images for Markdown rendering.
preserve_tables True Keeps tables for Markdown rendering.
preserve_code_blocks True Keeps code/preformatted blocks.
strip_hidden True Removes hidden or non-visible elements.
remove_selectors () CSS selectors to remove.
keep_selectors () CSS selectors to protect during cleaning.
unwrap_selectors () CSS selectors whose wrapper is removed while children remain.

Example:

from domdown import DomdownOptions

options = DomdownOptions(
    base_url="https://example.com/article",
    emit_frontmatter=True,
    preserve_images=True,
    remove_selectors=(".share-widget", ".newsletter-signup"),
)

Real-world coverage

domdown includes curated real-world HTML/Markdown pairs under tests/real/ to protect the pipeline against regressions on live site shapes.

  • html/ stores the captured HTML for each case.
  • raw/ stores the expected Markdown output for the same case.
  • manifest.json declares the cases and their relative fixture paths.

To run the real-example suite:

pytest tests/real/test_real_examples.py -q

Public API

domdown exports these names from domdown.__init__:

from domdown import (
    DomdownOptions,
    HtmlMetadata,
    HtmlToMarkdownPipeline,
    HtmlToMarkdownResult,
    html_to_markdown,
)

Installation

Install from this repository:

pip install domdown

Install locally for development:

git clone https://github.com/juanmcristobal/domdown.git
cd domdown
pip install -e ".[dev]"

Runtime dependencies:

  • beautifulsoup4
  • lxml
  • soupsieve
  • httpx

Support & Connect

History

0.1.0 (2026-05-21)

  • First release.

0.2.0 (2026-06-03)

  • Fix release workflow checkout for PyPI publish.
  • Change installation instructions to use pip install domdown (breaking change).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domdown-0.3.0.tar.gz (8.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

domdown-0.3.0-py3-none-any.whl (47.5 kB view details)

Uploaded Python 3

File details

Details for the file domdown-0.3.0.tar.gz.

File metadata

  • Download URL: domdown-0.3.0.tar.gz
  • Upload date:
  • Size: 8.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for domdown-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c3ba485012ac20daf57e4cac05773a1e2754239cc5d8b89108c7e93ba6cfde09
MD5 a7cdee3415f096d71e83ca9a952a91e9
BLAKE2b-256 bcb8d31eb59f4e779a5f287ac00161bf4513ccc4aee87159f495a1e77bfefc54

See more details on using hashes here.

Provenance

The following attestation bundles were made for domdown-0.3.0.tar.gz:

Publisher: release.yml on juanmcristobal/domdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file domdown-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: domdown-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 47.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for domdown-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f59c3ad2b6c1854c545bdcdb2dd761cbc0600b3f65c3a6af862ccb5f2ab791aa
MD5 0670b7f2f11f6f610832cdc2e543620a
BLAKE2b-256 ca6e96b7fd05b29cc37620a011af6cc717a3381e15d0a222f14008c112f93b32

See more details on using hashes here.

Provenance

The following attestation bundles were made for domdown-0.3.0-py3-none-any.whl:

Publisher: release.yml on juanmcristobal/domdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page