Fast web content extraction, page classification, HTML cleaning, and Markdown conversion — powered by Rust
Project description
rs-trafilatura
Fast web content extraction, page type classification, HTML cleaning, and Markdown conversion for Python — powered by Rust.
rs-trafilatura is a Python package built with PyO3 that wraps four Rust crates into a single pip install. It extracts the main content from web pages, classifies page types, predicts extraction quality, cleans HTML, and converts HTML to Markdown — all at native Rust speed.
Why rs-trafilatura?
- Fast: Rust-native extraction at ~44ms per page on commodity hardware — 36x faster than neural approaches
- Accurate: F1 0.859 across 7 page types on the Web Content Extraction Benchmark, outperforming Trafilatura (0.791), MinerU-HTML (0.827), and ReaderLM-v2 (0.741)
- Page-type aware: XGBoost classifier detects articles, forums, products, collections, listings, documentation, and service pages — then applies type-specific extraction profiles
- Quality scoring: ML-based confidence predictor (0.0–1.0) tells you when extraction might be unreliable, enabling hybrid pipelines with LLM fallback
- Framework adapters: Drop-in integrations for crawl4ai, Scrapy, Firecrawl, and Crawlee
Install
pip install rs-trafilatura
Quick Start
import rs_trafilatura
# Extract main content from HTML
result = rs_trafilatura.extract(html, url="https://example.com")
print(result.title) # Page title
print(result.main_content) # Clean extracted text
print(result.page_type) # article, forum, product, etc.
print(result.extraction_quality) # 0.0–1.0 confidence score
API Reference
Content Extraction
# From a string
result = rs_trafilatura.extract(
html,
url="https://example.com", # URL for page type classification
page_type="product", # Force a page type (bypasses classifier)
favor_precision=True, # Stricter filtering, less noise
favor_recall=False, # More inclusive extraction
include_tables=True, # Include table content
include_images=True, # Extract image metadata
include_comments=False, # Include comment sections
output_markdown=True, # Generate Markdown in content_markdown
)
# From raw bytes (auto-detects encoding)
result = rs_trafilatura.extract_bytes(
response_bytes,
url="https://example.com",
output_markdown=True,
)
ExtractResult fields:
| Field | Type | Description |
|---|---|---|
title |
str | None |
Page title |
author |
str | None |
Author name |
date |
str | None |
Publication date (ISO 8601) |
main_content |
str |
Extracted main content as plain text |
content_markdown |
str | None |
Markdown output (when output_markdown=True) |
content_html |
str | None |
Extracted content as HTML |
page_type |
str | None |
Detected page type |
extraction_quality |
float |
Confidence score (0.0–1.0) |
classification_confidence |
float | None |
Page type classifier confidence |
language |
str | None |
Detected language |
sitename |
str | None |
Site name |
description |
str | None |
Meta description |
images |
list[ImageData] |
Extracted images with src, alt, caption |
Page Type Classification
# Fast URL-based heuristic (no HTML needed)
page_type, confidence = rs_trafilatura.classify_url("https://docs.example.com/api")
# ("documentation", 0.9) — or ("article", None) when no pattern matches
# ML classifier with DOM features (higher accuracy)
page_type, confidence = rs_trafilatura.classify_page(
numeric_features, # 89 numeric features from the HTML DOM
"page title text", # Title + description for TF-IDF
)
Extraction Quality Prediction
# Predict how reliable an extraction is (for hybrid pipeline routing)
quality = rs_trafilatura.predict_quality(features) # 27 post-extraction features
# Returns 0.0–1.0. Below 0.80 suggests routing to an LLM fallback.
HTML Cleaning
# Remove scripts, styles, comments, SVGs, iframes — keep content
cleaned = rs_trafilatura.clean_html(raw_html)
HTML to Markdown
# Convert HTML to GitHub Flavored Markdown
markdown = rs_trafilatura.html_to_markdown(html)
Framework Integrations
crawl4ai
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from rs_trafilatura.crawl4ai import RsTrafilaturaStrategy
config = CrawlerRunConfig(extraction_strategy=RsTrafilaturaStrategy(output_markdown=True))
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
data = json.loads(result.extracted_content)
print(data[0]["main_content"])
Scrapy
# settings.py
ITEM_PIPELINES = {
"rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
}
RS_TRAFILATURA_MARKDOWN = True # optional
# spider.py
def parse(self, response):
yield {"url": response.url, "body": response.body}
# Pipeline adds item["extraction"] with title, main_content, page_type, etc.
Firecrawl
from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result
app = FirecrawlApp(api_key="...")
result = app.scrape("https://example.com", formats=["html"])
extracted = extract_firecrawl_result(result)
print(extracted.title, extracted.main_content, extracted.page_type)
Crawlee
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from rs_trafilatura.crawlee import extract_crawlee_context
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def handler(context):
extracted = extract_crawlee_context(context)
print(extracted.title, extracted.main_content, extracted.page_type)
Benchmarks
Tested on the Web Content Extraction Benchmark (WCXB) — 1,497 pages across 7 page types:
| System | F1 | Speed |
|---|---|---|
| rs-trafilatura | 0.859 | 44 ms/page |
| MinerU-HTML (0.6B) | 0.827 | 1,570 ms/page |
| Trafilatura (Python) | 0.791 | 94 ms/page |
| ReaderLM-v2 (1.5B) | 0.741 | 10,410 ms/page |
Per-page-type F1:
| Page Type | F1 |
|---|---|
| Article | 0.932 |
| Documentation | 0.931 |
| Service | 0.843 |
| Forum | 0.792 |
| Collection | 0.713 |
| Listing | 0.704 |
| Product | 0.670 |
What's Inside
This package bundles four Rust crates compiled into a single Python extension:
| Crate | What it does |
|---|---|
| rs-trafilatura | Content extraction with page-type-aware profiles |
| web-page-classifier | XGBoost page type classification + quality prediction |
| html-cleaning | HTML sanitisation and tag removal |
| quick_html2md | HTML to GFM Markdown conversion |
Links
- Website: webcontentextraction.org
- Benchmark: GitHub
- Rust crate: crates.io/crates/rs-trafilatura
- Author: Murrough Foley · LinkedIn · ORCID
License
MIT OR Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rs_trafilatura-0.1.1.tar.gz.
File metadata
- Download URL: rs_trafilatura-0.1.1.tar.gz
- Upload date:
- Size: 942.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d149bb64482ea5e09909205f98fe701733fa4cf61a2b15c281b8a4f3bf6cea7d
|
|
| MD5 |
8308e3d6fcc58cd6d5e6f30747d7b365
|
|
| BLAKE2b-256 |
6643e68dc0e3cea78fe6defbb33924dfaa1572473500859da16f62c3419b73eb
|
File details
Details for the file rs_trafilatura-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: rs_trafilatura-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58eb83eb86b0629c7177c1c9a406ebbe75594d15689774cd0a6a41d4259ae08a
|
|
| MD5 |
7162113c75afecb9f2948e8e422c3e1f
|
|
| BLAKE2b-256 |
6f1e3bc4fe0b3c49f5d1b8f6097ff62839ffc763f036de1cb82ea805187329bd
|