Skip to main content

MarkItDown plugin: convert live URLs via Plasmate instead of BeautifulSoup (10-100x fewer tokens)

Project description

markitdown-plasmate

A MarkItDown plugin that converts live URLs via Plasmate instead of BeautifulSoup — returning 10-100x fewer tokens with no API key required.

Why?

MarkItDown's built-in HTML converter fetches a URL, strips <script> tags, and converts whatever remains with BeautifulSoup. For a typical news article that means ~60,000 tokens of navigation menus, cookie banners, sidebar widgets, and footer links wrapped around ~2,000 tokens of actual content.

Plasmate is an open-source Rust browser engine that renders the page properly and returns only the meaningful content as clean Markdown. The token difference is significant:

Site Raw HTML (BeautifulSoup) Plasmate Reduction
TechCrunch article ~75,000 tokens ~975 tokens 77×
Average (45 sites) ~45,000 tokens ~2,500 tokens 17.7×

The plugin slots in specifically for http:// and https:// URL inputs — local files (PDF, Word, Excel, etc.) continue to use MarkItDown's native converters unchanged.

Installation

pip install markitdown-plasmate
pip install plasmate          # the Rust browser engine

Or with cargo:

cargo install plasmate

Usage

CLI

markitdown --use-plugins https://techcrunch.com/2025/04/08/some-article/

Python

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=True)
result = md.convert("https://blog.cloudflare.com/ai-crawler-traffic-by-purpose-and-industry/")
print(result.markdown)
# → clean article content, ~2,000 tokens instead of ~60,000

Options

Pass plugin options via MarkItDown kwargs:

md = MarkItDown(
    enable_plugins=True,
    plasmate_format="markdown",   # markdown | text | som | links
    plasmate_timeout=30,          # seconds
    plasmate_selector="article",  # CSS selector to scope extraction
)

Or use PlasmateConverter directly:

from markitdown_plasmate import PlasmateConverter
from markitdown import MarkItDown

md = MarkItDown()
md.register_converter(PlasmateConverter(output_format="markdown", selector="main"))
result = md.convert("https://example.com")

Output formats

Format Description
markdown Clean Markdown (default)
text Plain text, no markup
som Structured Object Model — semantic JSON tree
links Extracted hyperlinks only

When it applies

The plugin only intercepts http:// and https:// URLs. All other MarkItDown input types (PDF, Word, Excel, images, audio, local HTML files) are unaffected.

Requirements

  • Python 3.10+
  • markitdown >= 0.1.0
  • plasmate binary on PATH (pip install plasmate or cargo install plasmate)

The plugin is constructable without the binary — ImportError is raised on the first conversion attempt with clear install instructions.

Related

  • Plasmate — the open-source Rust browser engine
  • somspec.org — Structured Object Model specification
  • MarkItDown — the Python file-to-Markdown converter this plugin extends

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markitdown_plasmate-0.1.0.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markitdown_plasmate-0.1.0-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file markitdown_plasmate-0.1.0.tar.gz.

File metadata

  • Download URL: markitdown_plasmate-0.1.0.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for markitdown_plasmate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7e7ade0b15b404f449748f2c5cfa2c3e4622f9039ae8938707f7318a83064901
MD5 90d47ab72f4f472a94f4534cab86a3af
BLAKE2b-256 e3bdcecbc7b4c16fa4f3ba9bd89450a9e88d59fe083171ec60170e46bbf5abc8

See more details on using hashes here.

File details

Details for the file markitdown_plasmate-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for markitdown_plasmate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ad37b0ac8d3351269b078f4ec98524b82639d101e6f1ac7af47bed767dcb3722
MD5 000f0c3340e3225bf5701dee4156275a
BLAKE2b-256 a37b3af8fb08cc8fbae97561c97a64abd973ba85266d3d34578ba43fe8857d54

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page