Fast web content extraction, page classification, HTML cleaning, and Markdown conversion — powered by Rust

These details have not been verified by PyPI

Project links

Project description

rs-trafilatura

Fast web content extraction, page type classification, HTML cleaning, and Markdown conversion for Python — powered by Rust.

rs-trafilatura is a Python package built with PyO3 that wraps four Rust crates into a single pip install. It extracts the main content from web pages, classifies page types, predicts extraction quality, cleans HTML, and converts HTML to Markdown — all at native Rust speed.

Why rs-trafilatura?

Fast: Rust-native extraction at ~44ms per page on commodity hardware — 36x faster than neural approaches
Accurate: F1 0.859 across 7 page types on the Web Content Extraction Benchmark, outperforming Trafilatura (0.791), MinerU-HTML (0.827), and ReaderLM-v2 (0.741)
Page-type aware: XGBoost classifier detects articles, forums, products, collections, listings, documentation, and service pages — then applies type-specific extraction profiles
Quality scoring: ML-based confidence predictor (0.0–1.0) tells you when extraction might be unreliable, enabling hybrid pipelines with LLM fallback
Framework adapters: Drop-in integrations for crawl4ai, Scrapy, Firecrawl, and Crawlee

Install

pip install rs-trafilatura

Quick Start

import rs_trafilatura

# Extract main content from HTML
result = rs_trafilatura.extract(html, url="https://example.com")
print(result.title)                # Page title
print(result.main_content)         # Clean extracted text
print(result.page_type)            # article, forum, product, etc.
print(result.extraction_quality)   # 0.0–1.0 confidence score

API Reference

Content Extraction

# From a string
result = rs_trafilatura.extract(
    html,
    url="https://example.com",      # URL for page type classification
    page_type="product",             # Force a page type (bypasses classifier)
    favor_precision=True,            # Stricter filtering, less noise
    favor_recall=False,              # More inclusive extraction
    include_tables=True,             # Include table content
    include_images=True,             # Extract image metadata
    include_comments=False,          # Include comment sections
    output_markdown=True,            # Generate Markdown in content_markdown
)

# From raw bytes (auto-detects encoding)
result = rs_trafilatura.extract_bytes(
    response_bytes,
    url="https://example.com",
    output_markdown=True,
)

ExtractResult fields:

Field	Type	Description
`title`	`str \| None`	Page title
`author`	`str \| None`	Author name
`date`	`str \| None`	Publication date (ISO 8601)
`main_content`	`str`	Extracted main content as plain text
`content_markdown`	`str \| None`	Markdown output (when `output_markdown=True`)
`content_html`	`str \| None`	Extracted content as HTML
`page_type`	`str \| None`	Detected page type
`extraction_quality`	`float`	Confidence score (0.0–1.0)
`classification_confidence`	`float \| None`	Page type classifier confidence
`language`	`str \| None`	Detected language
`sitename`	`str \| None`	Site name
`description`	`str \| None`	Meta description
`images`	`list[ImageData]`	Extracted images with src, alt, caption

Page Type Classification

# Fast URL-based heuristic (no HTML needed)
page_type, confidence = rs_trafilatura.classify_url("https://docs.example.com/api")
# ("documentation", 0.9) — or ("article", None) when no pattern matches

# ML classifier with DOM features (higher accuracy)
page_type, confidence = rs_trafilatura.classify_page(
    numeric_features,   # 89 numeric features from the HTML DOM
    "page title text",  # Title + description for TF-IDF
)

Extraction Quality Prediction

# Predict how reliable an extraction is (for hybrid pipeline routing)
quality = rs_trafilatura.predict_quality(features)  # 27 post-extraction features
# Returns 0.0–1.0. Below 0.80 suggests routing to an LLM fallback.

HTML Cleaning

# Remove scripts, styles, comments, SVGs, iframes — keep content
cleaned = rs_trafilatura.clean_html(raw_html)

HTML to Markdown

# Convert HTML to GitHub Flavored Markdown
markdown = rs_trafilatura.html_to_markdown(html)

Framework Integrations

crawl4ai

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from rs_trafilatura.crawl4ai import RsTrafilaturaStrategy

config = CrawlerRunConfig(extraction_strategy=RsTrafilaturaStrategy(output_markdown=True))
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com", config=config)
    data = json.loads(result.extracted_content)
    print(data[0]["main_content"])

Scrapy

# settings.py
ITEM_PIPELINES = {
    "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
}
RS_TRAFILATURA_MARKDOWN = True  # optional

# spider.py
def parse(self, response):
    yield {"url": response.url, "body": response.body}
    # Pipeline adds item["extraction"] with title, main_content, page_type, etc.

Firecrawl

from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result

app = FirecrawlApp(api_key="...")
result = app.scrape("https://example.com", formats=["html"])
extracted = extract_firecrawl_result(result)
print(extracted.title, extracted.main_content, extracted.page_type)

Crawlee

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from rs_trafilatura.crawlee import extract_crawlee_context

crawler = BeautifulSoupCrawler()

@crawler.router.default_handler
async def handler(context):
    extracted = extract_crawlee_context(context)
    print(extracted.title, extracted.main_content, extracted.page_type)

Benchmarks

Tested on the Web Content Extraction Benchmark (WCXB) — 1,497 pages across 7 page types:

System	F1	Speed
rs-trafilatura	0.859	44 ms/page
MinerU-HTML (0.6B)	0.827	1,570 ms/page
Trafilatura (Python)	0.791	94 ms/page
ReaderLM-v2 (1.5B)	0.741	10,410 ms/page

Per-page-type F1:

Page Type	F1
Article	0.932
Documentation	0.931
Service	0.843
Forum	0.792
Collection	0.713
Listing	0.704
Product	0.670

What's Inside

This package bundles four Rust crates compiled into a single Python extension:

Crate	What it does
rs-trafilatura	Content extraction with page-type-aware profiles
web-page-classifier	XGBoost page type classification + quality prediction
html-cleaning	HTML sanitisation and tag removal
quick_html2md	HTML to GFM Markdown conversion

License

MIT OR Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Apr 4, 2026

0.1.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rs_trafilatura-0.1.1.tar.gz (942.2 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rs_trafilatura-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl (3.4 MB view details)

Uploaded Apr 4, 2026 CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file rs_trafilatura-0.1.1.tar.gz.

File metadata

Download URL: rs_trafilatura-0.1.1.tar.gz
Upload date: Apr 4, 2026
Size: 942.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for rs_trafilatura-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`d149bb64482ea5e09909205f98fe701733fa4cf61a2b15c281b8a4f3bf6cea7d`
MD5	`8308e3d6fcc58cd6d5e6f30747d7b365`
BLAKE2b-256	`6643e68dc0e3cea78fe6defbb33924dfaa1572473500859da16f62c3419b73eb`

See more details on using hashes here.

File details

Details for the file rs_trafilatura-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

Download URL: rs_trafilatura-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl
Upload date: Apr 4, 2026
Size: 3.4 MB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for rs_trafilatura-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`58eb83eb86b0629c7177c1c9a406ebbe75594d15689774cd0a6a41d4259ae08a`
MD5	`7162113c75afecb9f2948e8e422c3e1f`
BLAKE2b-256	`6f1e3bc4fe0b3c49f5d1b8f6097ff62839ffc763f036de1cb82ea805187329bd`

See more details on using hashes here.

rs-trafilatura 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

rs-trafilatura

Why rs-trafilatura?

Install

Quick Start

API Reference

Content Extraction

Page Type Classification

Extraction Quality Prediction

HTML Cleaning

HTML to Markdown

Framework Integrations

crawl4ai

Scrapy

Firecrawl

Crawlee

Benchmarks

What's Inside

Links

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes