Skip to main content

Python implementation of Defuddle - extract and clean web content as Markdown

Project description

pydefuddle

Python implementation of Defuddle — extract and clean web content as Markdown.

Pass any HTML string (or a URL via the CLI) and get back clean, readable Markdown with rich metadata extracted from the page.

Features

  • Content extraction — finds the main article body, removes ads, navbars, sidebars, comments, cookie notices, paywalls, and other clutter
  • Metadata extraction — title, author, published date, description, image, favicon, domain, language, site name via OpenGraph, Twitter Cards, Schema.org, and DOM fallbacks
  • Markdown conversion — clean ATX-style Markdown with properly fenced code blocks (with language tags), tables, figures, and footnotes
  • Code block handling — detects syntax highlighter markup from Prism, Highlight.js, Shiki, and others; normalises indentation and strips UI chrome (copy buttons, toolbars)
  • Image processing — promotes lazy-loaded images, picks the highest-res srcset source, removes tracking pixels
  • CLI — fetch any URL and copy the Markdown to your clipboard in one command
  • Raw Python — only standard library + BeautifulSoup4, markdownify, click, httpx, rich, pyperclip

Installation

pip install pydefuddle

Or with uv:

uv add pydefuddle

Python API

from pydefuddle import defuddle

with open("page.html") as f:
    html = f.read()

result = defuddle(html, url="https://example.com/article")

print(result.title)      # "How Python Works"
print(result.author)     # "Jane Smith"
print(result.published)  # "2024-03-15"
print(result.markdown)   # Clean Markdown string

defuddle(html, url="", **options) — convenience function

Option Type Default Description
markdown bool True Convert to Markdown (set False for clean HTML only)
remove_low_scoring bool True Remove low-signal blocks via content scoring
remove_small_images bool True Remove tracking pixels and tiny images
remove_hidden_elements bool True Remove elements hidden with CSS
content_selector str None Override content discovery with a CSS selector
debug bool False Include removal debug info in result

Defuddle class

from pydefuddle import Defuddle, DefuddleOptions

opts = DefuddleOptions(markdown=True, debug=True, content_selector="article")
result = Defuddle(html, url="https://example.com", options=opts).parse()

for removal in result.debug:
    print(removal.name, removal.count, removal.selector)

DefuddleResult fields

result.content       # str  — clean HTML
result.markdown      # str  — Markdown (empty if markdown=False)
result.title         # str
result.author        # str
result.published     # str  — ISO date / datetime string
result.description   # str
result.image         # str  — URL
result.favicon       # str  — URL
result.domain        # str
result.language      # str  — BCP 47 (e.g. "en", "fr")
result.site_title    # str
result.word_count    # int
result.parse_time    # float — milliseconds
result.debug         # list[DebugRemoval] | None

CLI

Fetch a URL → clipboard

pydefuddle fetch https://example.com/some-article

The Markdown is copied to your clipboard automatically.

Options

pydefuddle fetch <url> --no-clipboard   # print to stdout instead
pydefuddle fetch <url> --output out.md  # write to file
pydefuddle fetch <url> --preview        # render in terminal with rich
pydefuddle fetch <url> --debug          # show removal steps
pydefuddle fetch <url> --no-markdown    # return clean HTML instead

Parse a local file

pydefuddle parse page.html --no-clipboard
pydefuddle parse page.html --output article.md

Development

git clone https://github.com/phalt/pydefuddle
cd pydefuddle
make install   # install deps with uv
make test      # run tests with coverage
make format    # ruff format + lint

Credits

Based on Defuddle by Steph Ango (@kepano), which is the JavaScript original powering Obsidian Web Clipper.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydefuddle-0.1.0.tar.gz (57.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydefuddle-0.1.0-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file pydefuddle-0.1.0.tar.gz.

File metadata

  • Download URL: pydefuddle-0.1.0.tar.gz
  • Upload date:
  • Size: 57.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pydefuddle-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b17234bf58a6a1036fd45d6ba40dc05a5fa311b24f89b093c9582a61edb78cdd
MD5 7c1560a941692940cc58bd6cbeba018f
BLAKE2b-256 dbb4b5cab381af6932ca0b9c5e247d0cd430e80117dfa970755b494d320497db

See more details on using hashes here.

File details

Details for the file pydefuddle-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pydefuddle-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pydefuddle-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ce7832f6db3d8dfd8a185f67e7d4f42a9f9bcd185a98efdc2456b079cfb237fa
MD5 00cbcdd3aa6fecb40b5cceed001a11b0
BLAKE2b-256 2464881f90cb864fdf3980e79914561bae56ccaadca52ca7428f732d465813a7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page