Fast content extraction from HTML using encoder models.
Project description
Pareto-optimal models for cleaning the web. Extract main content from HTML at one twentieth the cost.
Install • Usage • Models • How it works • Benchmarks • Blog
Pulpie extracts the main content from raw HTML, stripping navigation, ads, sidebars, and footers. It uses small encoder models that label every block in a single forward pass, approaching state-of-the-art extraction quality while running up to 20x faster and 20x cheaper than autoregressive extractors on an L4 GPU.
- Fast. An encoder labels every block in one forward pass (13.7 pages/sec on an L4).
- Accurate. Matches state-of-the-art quality: 0.862 to 0.873 ROUGE-5 F1 on WebMainBench.
- Small. The recommended model is 210M parameters and fits on any GPU.
- Cheap. Clean 1 billion pages for ~$7,900, versus ~$159,000 for the leading decoder.
- Simple. Run
pip install pulpie, thenExtractor().extract(html). - Batched. An overlapped CPU and GPU pipeline scales across multiple GPUs.
Installation
pip install pulpie
For Markdown output, install the markdown extra:
pip install "pulpie[markdown]"
Or with uv:
uv pip install "pulpie[markdown]"
Usage
Basic
from pulpie import Extractor
extractor = Extractor() # defaults to pulpie-orange-small (210M)
result = extractor.extract(html)
print(result.markdown) # clean Markdown
print(result.html) # clean HTML
print(result.n_main, result.n_other) # blocks kept vs dropped
The model downloads from Hugging Face on first use.
Choosing a model
extractor = Extractor(model="orange-large") # "orange-small" (default), "orange-base", "orange-large"
extractor = Extractor(model="path/to/model") # or a custom checkpoint
extractor = Extractor(device="cpu") # force CPU
Batch processing
For bulk extraction, Pipeline overlaps CPU preprocessing with GPU inference and self-balances across one or more GPUs:
from pulpie import Pipeline, PageInput
pipeline = Pipeline(model="orange-small")
results = pipeline.extract_batch(
[PageInput(html=h, page_id=i) for i, h in enumerate(pages)]
)
Models
All three models are built on EuroBERT, share a tokenizer, and use the same <|sep|> block-marker architecture. Large is the teacher; Base and Small are distilled from it.
| Model | Hugging Face | Params | ROUGE-5 F1 | Notes |
|---|---|---|---|---|
| Orange Small | feyninc/pulpie-orange-small |
210M | 0.862 | Recommended, best size-to-quality ratio |
| Orange Base | feyninc/pulpie-orange-base |
610M | 0.863 | Distilled from Large |
| Orange Large | feyninc/pulpie-orange-large |
2.1B | 0.873 | Teacher (highest quality) |
orange-small is the default. Despite being a third the size of Dripper (the leading extractor), it matches its quality (0.862 vs 0.864) while running 20x faster.
How it works
Pulpie keeps the "read the page" approach of model-based extractors but moves the bottleneck from memory bandwidth to compute by using an encoder instead of a decoder. The pipeline runs in four stages:
- Simplify. Remove scripts, styles, and formatting noise; tag each content block with a unique ID.
- Chunk. Split, tokenize, and pack blocks into chunks of up to 8,192 tokens (≈80% of pages fit in one chunk).
- Classify. A single encoder forward pass labels every block as content or boilerplate.
- Reconstruct. Return the kept blocks as HTML, or convert them to Markdown.
A decoder emits labels one token at a time, re-reading the full model from GPU memory each step. An encoder runs one dense forward pass over the whole input, so the gap widens on bandwidth-limited GPUs (7x faster than Dripper on A100, 20x on L4).
Benchmarks
Quality on the English subset of WebMainBench (6,647 pages), ROUGE-5 F1:
| Method | Params | ROUGE-5 F1 | Empty pages |
|---|---|---|---|
| Pulpie Orange Large | 2.1B | 0.873 | 21 |
| Dripper | 0.6B | 0.864 | 135 |
| Pulpie Orange Base | 610M | 0.863 | 36 |
| Pulpie Orange Small | 210M | 0.862 | 45 |
| magic-html | - | 0.700 | 384 |
| Trafilatura | - | 0.619 | 16 |
Speed and cost (Pulpie Orange Small vs Dripper, 1 billion pages):
| Pulpie Orange Small | Dripper | |
|---|---|---|
| Throughput (L4) | 13.7 pages/sec | 0.68 pages/sec |
| Cost / 1B pages (L4) | ~$7,900 | ~$159,000 |
Pulpie Orange Small matches Dripper's quality at 20x the throughput and 20x lower cost on an L4. See BENCHMARKS.md for the full comparison, per-difficulty breakdown, and reproduction command.
Acknowledgements
Pulpie builds directly on the work of the MinerU-HTML and Dripper team (Ma et al., 2025). Their simplify_html preprocessing, block-level annotation scheme, and the WebMainBench benchmark are foundational to this work. We also use their Dripper 0.6B model to cross-validate our training labels. We're grateful they released their tools and data.
Citation
If you use Pulpie in your research, please cite:
@note{pulpie2026,
title = {Pulpie: Pareto-Optimal Models for Cleaning the Web},
author = {Minhas, Bhavnick and Nigam, Shreyash and Feyn Research},
year = {2026},
venue = {Feyn Field Notes}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pulpie-0.0.2.tar.gz.
File metadata
- Download URL: pulpie-0.0.2.tar.gz
- Upload date:
- Size: 29.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55b6790f5d0d17dffda120485c5e87ee833fde7cda95eb0747d61410e4241be1
|
|
| MD5 |
fbe7dee8a2d78dbc6eca04ae0954163a
|
|
| BLAKE2b-256 |
b077f191b95ae3b584fd7b65399a119c1613472c162bbcb1e78184ba1eb790a6
|
File details
Details for the file pulpie-0.0.2-py3-none-any.whl.
File metadata
- Download URL: pulpie-0.0.2-py3-none-any.whl
- Upload date:
- Size: 29.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a54d01b462b7a51b635dc9ffbd3b0ecd2dc703834940ac9ce6354a590d12130
|
|
| MD5 |
c2d9ffb19d665054862a49e10c0d741f
|
|
| BLAKE2b-256 |
f0b3bcf0e29e5dd95758687a3a03396841746599619fef91d4a55d694605291a
|