Skip to main content

Fast web content extraction, page classification, HTML cleaning, and Markdown conversion — powered by Rust

Project description

rs-trafilatura

Fast web content extraction, page type classification, HTML cleaning, and Markdown conversion for Python — powered by Rust.

rs-trafilatura is a Python package built with PyO3 that wraps four Rust crates into a single pip install. It extracts the main content from web pages, classifies page types, predicts extraction quality, cleans HTML, and converts HTML to Markdown — all at native Rust speed.

Why rs-trafilatura?

  • Fast: Rust-native extraction at ~44ms per page on commodity hardware — 36x faster than neural approaches
  • Accurate: F1 0.859 across 7 page types on the Web Content Extraction Benchmark, outperforming Trafilatura (0.791), MinerU-HTML (0.827), and ReaderLM-v2 (0.741)
  • Page-type aware: XGBoost classifier detects articles, forums, products, collections, listings, documentation, and service pages — then applies type-specific extraction profiles
  • Quality scoring: ML-based confidence predictor (0.0–1.0) tells you when extraction might be unreliable, enabling hybrid pipelines with LLM fallback
  • Framework adapters: Drop-in integrations for crawl4ai, Scrapy, Firecrawl, and Crawlee

Install

pip install rs-trafilatura

Quick Start

import rs_trafilatura

# Extract main content from HTML
result = rs_trafilatura.extract(html, url="https://example.com")
print(result.title)                # Page title
print(result.main_content)         # Clean extracted text
print(result.page_type)            # article, forum, product, etc.
print(result.extraction_quality)   # 0.0–1.0 confidence score

API Reference

Content Extraction

# From a string
result = rs_trafilatura.extract(
    html,
    url="https://example.com",      # URL for page type classification
    page_type="product",             # Force a page type (bypasses classifier)
    favor_precision=True,            # Stricter filtering, less noise
    favor_recall=False,              # More inclusive extraction
    include_tables=True,             # Include table content
    include_images=True,             # Extract image metadata
    include_comments=False,          # Include comment sections
    output_markdown=True,            # Generate Markdown in content_markdown
)

# From raw bytes (auto-detects encoding)
result = rs_trafilatura.extract_bytes(
    response_bytes,
    url="https://example.com",
    output_markdown=True,
)

ExtractResult fields:

Field Type Description
title str | None Page title
author str | None Author name
date str | None Publication date (ISO 8601)
main_content str Extracted main content as plain text
content_markdown str | None Markdown output (when output_markdown=True)
content_html str | None Extracted content as HTML
page_type str | None Detected page type
extraction_quality float Confidence score (0.0–1.0)
classification_confidence float | None Page type classifier confidence
language str | None Detected language
sitename str | None Site name
description str | None Meta description
images list[ImageData] Extracted images with src, alt, caption

Page Type Classification

# Fast URL-based heuristic (no HTML needed)
page_type, confidence = rs_trafilatura.classify_url("https://docs.example.com/api")
# ("documentation", 0.9) — or ("article", None) when no pattern matches

# ML classifier with DOM features (higher accuracy)
page_type, confidence = rs_trafilatura.classify_page(
    numeric_features,   # 89 numeric features from the HTML DOM
    "page title text",  # Title + description for TF-IDF
)

Extraction Quality Prediction

# Predict how reliable an extraction is (for hybrid pipeline routing)
quality = rs_trafilatura.predict_quality(features)  # 27 post-extraction features
# Returns 0.0–1.0. Below 0.80 suggests routing to an LLM fallback.

HTML Cleaning

# Remove scripts, styles, comments, SVGs, iframes — keep content
cleaned = rs_trafilatura.clean_html(raw_html)

HTML to Markdown

# Convert HTML to GitHub Flavored Markdown
markdown = rs_trafilatura.html_to_markdown(html)

Framework Integrations

crawl4ai

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from rs_trafilatura.crawl4ai import RsTrafilaturaStrategy

config = CrawlerRunConfig(extraction_strategy=RsTrafilaturaStrategy(output_markdown=True))
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com", config=config)
    data = json.loads(result.extracted_content)
    print(data[0]["main_content"])

Scrapy

# settings.py
ITEM_PIPELINES = {
    "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
}
RS_TRAFILATURA_MARKDOWN = True  # optional

# spider.py
def parse(self, response):
    yield {"url": response.url, "body": response.body}
    # Pipeline adds item["extraction"] with title, main_content, page_type, etc.

Firecrawl

from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result

app = FirecrawlApp(api_key="...")
result = app.scrape("https://example.com", formats=["html"])
extracted = extract_firecrawl_result(result)
print(extracted.title, extracted.main_content, extracted.page_type)

Crawlee

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from rs_trafilatura.crawlee import extract_crawlee_context

crawler = BeautifulSoupCrawler()

@crawler.router.default_handler
async def handler(context):
    extracted = extract_crawlee_context(context)
    print(extracted.title, extracted.main_content, extracted.page_type)

Benchmarks

Tested on the Web Content Extraction Benchmark (WCXB) — 1,497 pages across 7 page types:

System F1 Speed
rs-trafilatura 0.859 44 ms/page
MinerU-HTML (0.6B) 0.827 1,570 ms/page
Trafilatura (Python) 0.791 94 ms/page
ReaderLM-v2 (1.5B) 0.741 10,410 ms/page

Per-page-type F1:

Page Type F1
Article 0.932
Documentation 0.931
Service 0.843
Forum 0.792
Collection 0.713
Listing 0.704
Product 0.670

What's Inside

This package bundles four Rust crates compiled into a single Python extension:

Crate What it does
rs-trafilatura Content extraction with page-type-aware profiles
web-page-classifier XGBoost page type classification + quality prediction
html-cleaning HTML sanitisation and tag removal
quick_html2md HTML to GFM Markdown conversion

Links

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rs_trafilatura-0.1.1.tar.gz (942.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rs_trafilatura-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl (3.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file rs_trafilatura-0.1.1.tar.gz.

File metadata

  • Download URL: rs_trafilatura-0.1.1.tar.gz
  • Upload date:
  • Size: 942.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for rs_trafilatura-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d149bb64482ea5e09909205f98fe701733fa4cf61a2b15c281b8a4f3bf6cea7d
MD5 8308e3d6fcc58cd6d5e6f30747d7b365
BLAKE2b-256 6643e68dc0e3cea78fe6defbb33924dfaa1572473500859da16f62c3419b73eb

See more details on using hashes here.

File details

Details for the file rs_trafilatura-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for rs_trafilatura-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 58eb83eb86b0629c7177c1c9a406ebbe75594d15689774cd0a6a41d4259ae08a
MD5 7162113c75afecb9f2948e8e422c3e1f
BLAKE2b-256 6f1e3bc4fe0b3c49f5d1b8f6097ff62839ffc763f036de1cb82ea805187329bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page