Python implementation of Defuddle - extract and clean web content as Markdown
Project description
pydefuddle
Python implementation of Defuddle — extract and clean web content as Markdown.
Pass any HTML string (or a URL via the CLI) and get back clean, readable Markdown with rich metadata extracted from the page.
Features
- Content extraction — finds the main article body, removes ads, navbars, sidebars, comments, cookie notices, paywalls, and other clutter
- Metadata extraction — title, author, published date, description, image, favicon, domain, language, site name via OpenGraph, Twitter Cards, Schema.org, and DOM fallbacks
- Markdown conversion — clean ATX-style Markdown with properly fenced code blocks (with language tags), tables, figures, and footnotes
- Code block handling — detects syntax highlighter markup from Prism, Highlight.js, Shiki, and others; normalises indentation and strips UI chrome (copy buttons, toolbars)
- Image processing — promotes lazy-loaded images, picks the highest-res srcset source, removes tracking pixels
- CLI — fetch any URL and copy the Markdown to your clipboard in one command
- Raw Python — only standard library + BeautifulSoup4, markdownify, click, httpx, rich, pyperclip
Installation
pip install pydefuddle
Or with uv:
uv add pydefuddle
Python API
from pydefuddle import defuddle
with open("page.html") as f:
html = f.read()
result = defuddle(html, url="https://example.com/article")
print(result.title) # "How Python Works"
print(result.author) # "Jane Smith"
print(result.published) # "2024-03-15"
print(result.markdown) # Clean Markdown string
defuddle(html, url="", **options) — convenience function
| Option | Type | Default | Description |
|---|---|---|---|
markdown |
bool | True |
Convert to Markdown (set False for clean HTML only) |
remove_low_scoring |
bool | True |
Remove low-signal blocks via content scoring |
remove_small_images |
bool | True |
Remove tracking pixels and tiny images |
remove_hidden_elements |
bool | True |
Remove elements hidden with CSS |
content_selector |
str | None |
Override content discovery with a CSS selector |
debug |
bool | False |
Include removal debug info in result |
Defuddle class
from pydefuddle import Defuddle, DefuddleOptions
opts = DefuddleOptions(markdown=True, debug=True, content_selector="article")
result = Defuddle(html, url="https://example.com", options=opts).parse()
for removal in result.debug:
print(removal.name, removal.count, removal.selector)
DefuddleResult fields
result.content # str — clean HTML
result.markdown # str — Markdown (empty if markdown=False)
result.title # str
result.author # str
result.published # str — ISO date / datetime string
result.description # str
result.image # str — URL
result.favicon # str — URL
result.domain # str
result.language # str — BCP 47 (e.g. "en", "fr")
result.site_title # str
result.word_count # int
result.parse_time # float — milliseconds
result.debug # list[DebugRemoval] | None
CLI
Fetch a URL → clipboard
pydefuddle fetch https://example.com/some-article
The Markdown is copied to your clipboard automatically.
Options
pydefuddle fetch <url> --no-clipboard # print to stdout instead
pydefuddle fetch <url> --output out.md # write to file
pydefuddle fetch <url> --preview # render in terminal with rich
pydefuddle fetch <url> --debug # show removal steps
pydefuddle fetch <url> --no-markdown # return clean HTML instead
Parse a local file
pydefuddle parse page.html --no-clipboard
pydefuddle parse page.html --output article.md
Development
git clone https://github.com/phalt/pydefuddle
cd pydefuddle
make install # install deps with uv
make test # run tests with coverage
make format # ruff format + lint
Credits
Based on Defuddle by Steph Ango (@kepano), which is the JavaScript original powering Obsidian Web Clipper.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydefuddle-0.1.0.tar.gz.
File metadata
- Download URL: pydefuddle-0.1.0.tar.gz
- Upload date:
- Size: 57.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b17234bf58a6a1036fd45d6ba40dc05a5fa311b24f89b093c9582a61edb78cdd
|
|
| MD5 |
7c1560a941692940cc58bd6cbeba018f
|
|
| BLAKE2b-256 |
dbb4b5cab381af6932ca0b9c5e247d0cd430e80117dfa970755b494d320497db
|
File details
Details for the file pydefuddle-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pydefuddle-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce7832f6db3d8dfd8a185f67e7d4f42a9f9bcd185a98efdc2456b079cfb237fa
|
|
| MD5 |
00cbcdd3aa6fecb40b5cceed001a11b0
|
|
| BLAKE2b-256 |
2464881f90cb864fdf3980e79914561bae56ccaadca52ca7428f732d465813a7
|