MarkItDown plugin: convert live URLs via Plasmate instead of BeautifulSoup (10-100x fewer tokens)
Project description
markitdown-plasmate
A MarkItDown plugin that converts live URLs via Plasmate instead of BeautifulSoup — returning 10-100x fewer tokens with no API key required.
Why?
MarkItDown's built-in HTML converter fetches a URL, strips <script> tags, and converts whatever remains with BeautifulSoup. For a typical news article that means ~60,000 tokens of navigation menus, cookie banners, sidebar widgets, and footer links wrapped around ~2,000 tokens of actual content.
Plasmate is an open-source Rust browser engine that renders the page properly and returns only the meaningful content as clean Markdown. The token difference is significant:
| Site | Raw HTML (BeautifulSoup) | Plasmate | Reduction |
|---|---|---|---|
| TechCrunch article | ~75,000 tokens | ~975 tokens | 77× |
| Average (45 sites) | ~45,000 tokens | ~2,500 tokens | 17.7× |
The plugin slots in specifically for http:// and https:// URL inputs — local files (PDF, Word, Excel, etc.) continue to use MarkItDown's native converters unchanged.
Installation
pip install markitdown-plasmate
pip install plasmate # the Rust browser engine
Or with cargo:
cargo install plasmate
Usage
CLI
markitdown --use-plugins https://techcrunch.com/2025/04/08/some-article/
Python
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=True)
result = md.convert("https://blog.cloudflare.com/ai-crawler-traffic-by-purpose-and-industry/")
print(result.markdown)
# → clean article content, ~2,000 tokens instead of ~60,000
Options
Pass plugin options via MarkItDown kwargs:
md = MarkItDown(
enable_plugins=True,
plasmate_format="markdown", # markdown | text | som | links
plasmate_timeout=30, # seconds
plasmate_selector="article", # CSS selector to scope extraction
)
Or use PlasmateConverter directly:
from markitdown_plasmate import PlasmateConverter
from markitdown import MarkItDown
md = MarkItDown()
md.register_converter(PlasmateConverter(output_format="markdown", selector="main"))
result = md.convert("https://example.com")
Output formats
| Format | Description |
|---|---|
markdown |
Clean Markdown (default) |
text |
Plain text, no markup |
som |
Structured Object Model — semantic JSON tree |
links |
Extracted hyperlinks only |
When it applies
The plugin only intercepts http:// and https:// URLs. All other MarkItDown input types (PDF, Word, Excel, images, audio, local HTML files) are unaffected.
Requirements
- Python 3.10+
markitdown >= 0.1.0plasmatebinary on PATH (pip install plasmateorcargo install plasmate)
The plugin is constructable without the binary — ImportError is raised on the first conversion attempt with clear install instructions.
Related
- Plasmate — the open-source Rust browser engine
- somspec.org — Structured Object Model specification
- MarkItDown — the Python file-to-Markdown converter this plugin extends
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markitdown_plasmate-0.1.0.tar.gz.
File metadata
- Download URL: markitdown_plasmate-0.1.0.tar.gz
- Upload date:
- Size: 4.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e7ade0b15b404f449748f2c5cfa2c3e4622f9039ae8938707f7318a83064901
|
|
| MD5 |
90d47ab72f4f472a94f4534cab86a3af
|
|
| BLAKE2b-256 |
e3bdcecbc7b4c16fa4f3ba9bd89450a9e88d59fe083171ec60170e46bbf5abc8
|
File details
Details for the file markitdown_plasmate-0.1.0-py3-none-any.whl.
File metadata
- Download URL: markitdown_plasmate-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad37b0ac8d3351269b078f4ec98524b82639d101e6f1ac7af47bed767dcb3722
|
|
| MD5 |
000f0c3340e3225bf5701dee4156275a
|
|
| BLAKE2b-256 |
a37b3af8fb08cc8fbae97561c97a64abd973ba85266d3d34578ba43fe8857d54
|