A comprehensive benchmark for web main content extraction

These details have not been verified by PyPI

Project links

Project description

WebMainBench

简体中文 | English

WebMainBench is a specialized benchmark tool for end-to-end evaluation of web main content extraction quality.

Features

🎯 Core Features

Multiple Extractor Support: Supports various extraction tools such as trafilatura, resiliparse, and more
Comprehensive Evaluation Metrics: Includes multi-dimensional metrics such as text edit distance, table structure similarity (TEDS), formula extraction quality, etc.
Manual Annotation Support: 100% manually annotated evaluation dataset

Metric Details

Metric Name	Calculation Method	Value Range	Description
`overall`	Average of all successful metrics	0.0-1.0	Comprehensive quality score, higher is better
`text_edit`	`1 - (edit distance / max text length)`	0.0-1.0	Plain text similarity, higher is better
`code_edit`	`1 - (edit distance / max code length)`	0.0-1.0	Code content similarity, higher is better
`table_TEDS`	`1 - (tree edit distance / max nodes)`	0.0-1.0	Table structure similarity, higher is better
`table_edit`	`1 - (edit distance / max table length)`	0.0-1.0	Table content similarity, higher is better
`formula_edit`	`1 - (edit distance / max formula length)`	0.0-1.0	Formula content similarity, higher is better

🏗️ System Architecture

WebMainBench Architecture

🔧 Core Modules

data module: Read/write management of evaluation sets and results
extractors module: Unified interface for various extraction tools
metrics module: Implementation of evaluation metrics calculation
evaluator module: Execution and result output of evaluation tasks

Quick Start

Installation

# Basic installation
pip install webmainbench

# Install with all optional dependencies
pip install webmainbench[all]

# Development environment installation
pip install webmainbench[dev]

Basic Usage

from webmainbench import DataLoader, Evaluator, ExtractorFactory

# 1. Load evaluation dataset
dataset = DataLoader.load_jsonl("your_dataset.jsonl")

# 2. Create extractor
extractor = ExtractorFactory.create("trafilatura")

# 3. Run evaluation
evaluator = Evaluator()
result = evaluator.evaluate(dataset, extractor)

# 4. View results
print(f"Overall Score: {result.overall_metrics['overall']:.4f}")

Data Format

Evaluation datasets should contain the following fields:

{
  "track_id": "0b7f2636-d35f-40bf-9b7f-94be4bcbb396",
  "html": "<html><body><h1 cc-select=\"true\">This is a title</h1></body></html>",   # Manually annotated with cc-select="true" attribute
  "url": "https://orderyourbooks.com/product-category/college-books-p-u/?products-per-page=all",
  "main_html": "<h1 cc-select=\"true\">This is a title</h1>",  # Main content HTML pruned from html
  "convert_main_content": "# This is a title",  # Converted from main_html + html2text
  "groundtruth_content": "# This is a title",  # Manually calibrated markdown (partially provided)
  "meta": {
    "language": "en",  # Web page language
    "style": "artical",  # Web page style
    "table": [],  # [], ["layout"], ["data"], ["layout", "data"]
    "equation": [],  # [], ["inline"], ["interline"], ["inline", "interline"]
    "code": [],  # [], ["inline"], ["interline"], ["inline", "interline"]
    "level": "mid"  # simple, mid, hard
  }
}

Supported Extractors

trafilatura: trafilatura extractor
resiliparse: resiliparse extractor
mineru-html: mineru-html extractor
magic-html: magic-html extractor
Custom extractors: Implement by inheriting from BaseExtractor

Evaluation Leaderboard

extractor	extractor_version	dataset	total_samples	overall (macro avg)	code_edit	formula_edit	table_TEDS	table_edit	text_edit
mineru-html	4.1.1	WebMainBench1.0	545	0.8256	0.9093	0.9399	0.7388	0.678	0.8621
magic-html	0.1.5	WebMainBench1.0	545	0.5141	0.4117	0.7204	0.3984	0.2611	0.7791
trafilatura_md	2.0.0	WebMainBench1.0	545	0.3858	0.1305	0.6242	0.3203	0.1653	0.6887
trafilatura_txt	2.0.0	WebMainBench1.0	545	0.2657	0	0.6162	0	0	0.7126
resiliparse	0.14.5	WebMainBench1.0	545	0.2954	0.0641	0.6747	0	0	0.7381

Advanced Features

Multi-Extractor Comparison

# Compare multiple extractors
extractors = ["trafilatura", "resiliparse"]
results = evaluator.compare_extractors(dataset, extractors)

for name, result in results.items():
    print(f"{name}: {result.overall_metrics['overall']:.4f}")

Detailed Example

python examples/multi_extractor_compare.py

This example demonstrates how to:

Load test dataset: Use sample data containing multiple content types such as code, formulas, tables, text, etc.
Create multiple extractors:
- magic-html: Extractor based on magic-html library
- trafilatura: Extractor based on trafilatura library
- resiliparse: Extractor based on resiliparse library
Batch evaluation comparison: Use evaluator.compare_extractors() to evaluate all extractors simultaneously
Generate comparison report: Automatically save evaluation results in multiple formats

Output File Description

After evaluation is complete, three important files will be generated in the results/ directory:

File Name	Format	Content Description
`leaderboard.csv`	CSV	Leaderboard file: Contains overall rankings and sub-metric comparisons for each extractor, for quick performance comparison
`evaluation_results.json`	JSON	Detailed evaluation results: Contains complete evaluation data, metric details and metadata for each extractor
`dataset_with_results.jsonl`	JSONL	Enhanced dataset: Original test data plus extraction results from all extractors, for manual inspection and analysis

leaderboard.csv content example:

extractor,dataset,total_samples,success_rate,overall,code_edit,formula_edit,table_TEDS,table_edit,text_edit
magic-html,sample_dataset,4,1.0,0.1526,0.1007,0.0,0.0,0.0,0.6624
resiliparse,sample_dataset,4,1.0,0.1379,0.0,0.0,0.0,0.0,0.6897
trafilatura,sample_dataset,4,1.0,0.1151,0.1007,0.0,0.0,0.0,0.4746

Custom Metrics

from webmainbench.metrics import BaseMetric, MetricResult

class CustomMetric(BaseMetric):
    def _setup(self):
        pass
    
    def _calculate_score(self, predicted, groundtruth, **kwargs):
        # Implement custom evaluation logic
        score = your_calculation(predicted, groundtruth)
        return MetricResult(
            metric_name=self.name,
            score=score,
            details={"custom_info": "value"}
        )

# Add to evaluator
evaluator.metric_calculator.add_metric("custom", CustomMetric("custom"))

Custom Extractors

from webmainbench.extractors import BaseExtractor, ExtractionResult

class MyExtractor(BaseExtractor):
    def _setup(self):
        # Initialize extractor
        pass
    
    def _extract_content(self, html, url=None):
        # Implement extraction logic
        content = your_extraction_logic(html)
        
        return ExtractionResult(
            content=content,
            content_list=[...],
            success=True
        )

# Register custom extractor
ExtractorFactory.register("my-extractor", MyExtractor)

Project Architecture

webmainbench/
├── data/           # Data processing module
│   ├── dataset.py  # Dataset class
│   ├── loader.py   # Data loader
│   └── saver.py    # Data saver
├── extractors/     # Extractor module
│   ├── base.py     # Base interface
│   ├── factory.py  # Factory pattern
│   └── ...         # Specific implementations
├── metrics/        # Metrics module
│   ├── base.py     # Base interface
│   ├── text_metrics.py    # Text metrics
│   ├── table_metrics.py   # Table metrics
│   └── calculator.py      # Metric calculator
├── evaluator/      # Evaluator module
│   └── evaluator.py       # Main evaluator
└── utils/          # Utility module
    └── helpers.py          # Helper functions

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webmainbench-0.1.0.tar.gz (462.0 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webmainbench-0.1.0-py3-none-any.whl (85.5 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file webmainbench-0.1.0.tar.gz.

File metadata

Download URL: webmainbench-0.1.0.tar.gz
Upload date: Mar 27, 2026
Size: 462.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for webmainbench-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`41a93a57c6185c4380b2b1e1e4c0436a6f51867d6b8d0f1b675751d9fe0377d1`
MD5	`81db908e187056bcee7ce5929103d9c7`
BLAKE2b-256	`ff16c21b8a16ebb668e26b4f684f1d67b592aa6cdc6fc71e2464485f54468335`

See more details on using hashes here.

File details

Details for the file webmainbench-0.1.0-py3-none-any.whl.

File metadata

Download URL: webmainbench-0.1.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 85.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for webmainbench-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3f2e1534257f90e544bdb595d1334502ffc61d00df406feef451feae22b95dd2`
MD5	`bab5d131445f8d8e05390bb32ef6b63e`
BLAKE2b-256	`f2a6284d01c0492031d9239d6127fcc87952b08b18b0a33bc0ebcb62476e5bae`

See more details on using hashes here.

webmainbench 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WebMainBench

Features

🎯 Core Features

Metric Details

🏗️ System Architecture

🔧 Core Modules

Quick Start

Installation

Basic Usage

Data Format

Supported Extractors

Evaluation Leaderboard

Advanced Features

Multi-Extractor Comparison

Detailed Example

Output File Description

Custom Metrics

Custom Extractors

Project Architecture

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes