pulpie

Fast content extraction from HTML using encoder models.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bhavnicksm

These details have not been verified by PyPI

Project description

Pareto-optimal models for cleaning the web. Extract main content from HTML at one twentieth the cost.

Install • Usage • Models • How it works • Benchmarks • Blog

Pulpie extracts the main content from raw HTML, stripping navigation, ads, sidebars, and footers. It uses small encoder models that label every block in a single forward pass, approaching state-of-the-art extraction quality while running up to 20x faster and 20x cheaper than autoregressive extractors on an L4 GPU.

Fast. An encoder labels every block in one forward pass (13.7 pages/sec on an L4).
Accurate. Matches state-of-the-art quality: 0.862 to 0.873 ROUGE-5 F1 on WebMainBench.
Small. The recommended model is 210M parameters and fits on any GPU.
Cheap. Clean 1 billion pages for ~$7,900, versus ~$159,000 for the leading decoder.
Simple. Run pip install pulpie, then Extractor().extract(html).
Batched. An overlapped CPU and GPU pipeline scales across multiple GPUs.

Installation

pip install pulpie

For Markdown output, install the markdown extra:

pip install "pulpie[markdown]"

Or with uv:

uv pip install "pulpie[markdown]"

Usage

Basic

from pulpie import Extractor

extractor = Extractor()                # defaults to pulpie-orange-small (210M)
result = extractor.extract(html)

print(result.markdown)                 # clean Markdown
print(result.html)                     # clean HTML
print(result.n_main, result.n_other)   # blocks kept vs dropped

The model downloads from Hugging Face on first use.

Choosing a model

extractor = Extractor(model="orange-large")   # "orange-small" (default), "orange-base", "orange-large"
extractor = Extractor(model="path/to/model")  # or a custom checkpoint
extractor = Extractor(device="cpu")           # force CPU

Batch processing

For bulk extraction, Pipeline overlaps CPU preprocessing with GPU inference and self-balances across one or more GPUs:

from pulpie import Pipeline, PageInput

pipeline = Pipeline(model="orange-small")
results = pipeline.extract_batch(
    [PageInput(html=h, page_id=i) for i, h in enumerate(pages)]
)

Models

All three models are built on EuroBERT, share a tokenizer, and use the same <|sep|> block-marker architecture. Large is the teacher; Base and Small are distilled from it.

Model	Hugging Face	Params	ROUGE-5 F1	Notes
Orange Small	`feyninc/pulpie-orange-small`	210M	0.862	Recommended, best size-to-quality ratio
Orange Base	`feyninc/pulpie-orange-base`	610M	0.863	Distilled from Large
Orange Large	`feyninc/pulpie-orange-large`	2.1B	0.873	Teacher (highest quality)

orange-small is the default. Despite being a third the size of Dripper (the leading extractor), it matches its quality (0.862 vs 0.864) while running 20x faster.

How it works

Pulpie keeps the "read the page" approach of model-based extractors but moves the bottleneck from memory bandwidth to compute by using an encoder instead of a decoder. The pipeline runs in four stages:

Simplify. Remove scripts, styles, and formatting noise; tag each content block with a unique ID.
Chunk. Split, tokenize, and pack blocks into chunks of up to 8,192 tokens (≈80% of pages fit in one chunk).
Classify. A single encoder forward pass labels every block as content or boilerplate.
Reconstruct. Return the kept blocks as HTML, or convert them to Markdown.

A decoder emits labels one token at a time, re-reading the full model from GPU memory each step. An encoder runs one dense forward pass over the whole input, so the gap widens on bandwidth-limited GPUs (7x faster than Dripper on A100, 20x on L4).

Benchmarks

Quality on the English subset of WebMainBench (6,647 pages), ROUGE-5 F1:

Method	Params	ROUGE-5 F1	Empty pages
Pulpie Orange Large	2.1B	0.873	21
Dripper	0.6B	0.864	135
Pulpie Orange Base	610M	0.863	36
Pulpie Orange Small	210M	0.862	45
magic-html	-	0.700	384
Trafilatura	-	0.619	16

Speed and cost (Pulpie Orange Small vs Dripper, 1 billion pages):

	Pulpie Orange Small	Dripper
Throughput (L4)	13.7 pages/sec	0.68 pages/sec
Cost / 1B pages (L4)	~$7,900	~$159,000

Pulpie Orange Small matches Dripper's quality at 20x the throughput and 20x lower cost on an L4. See BENCHMARKS.md for the full comparison, per-difficulty breakdown, and reproduction command.

Acknowledgements

Pulpie builds directly on the work of the MinerU-HTML and Dripper team (Ma et al., 2025). Their simplify_html preprocessing, block-level annotation scheme, and the WebMainBench benchmark are foundational to this work. We also use their Dripper 0.6B model to cross-validate our training labels. We're grateful they released their tools and data.

Citation

If you use Pulpie in your research, please cite:

@note{pulpie2026,
  title  = {Pulpie: Pareto-Optimal Models for Cleaning the Web},
  author = {Minhas, Bhavnick and Nigam, Shreyash and Feyn Research},
  year   = {2026},
  venue  = {Feyn Field Notes}
}

Built by Feyn.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bhavnicksm

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.2

Jul 1, 2026

0.0.1

Jun 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pulpie-0.0.2.tar.gz (29.9 kB view details)

Uploaded Jul 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pulpie-0.0.2-py3-none-any.whl (29.6 kB view details)

Uploaded Jul 1, 2026 Python 3

File details

Details for the file pulpie-0.0.2.tar.gz.

File metadata

Download URL: pulpie-0.0.2.tar.gz
Upload date: Jul 1, 2026
Size: 29.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pulpie-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`55b6790f5d0d17dffda120485c5e87ee833fde7cda95eb0747d61410e4241be1`
MD5	`fbe7dee8a2d78dbc6eca04ae0954163a`
BLAKE2b-256	`b077f191b95ae3b584fd7b65399a119c1613472c162bbcb1e78184ba1eb790a6`

See more details on using hashes here.

File details

Details for the file pulpie-0.0.2-py3-none-any.whl.

File metadata

Download URL: pulpie-0.0.2-py3-none-any.whl
Upload date: Jul 1, 2026
Size: 29.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pulpie-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9a54d01b462b7a51b635dc9ffbd3b0ecd2dc703834940ac9ce6354a590d12130`
MD5	`c2d9ffb19d665054862a49e10c0d741f`
BLAKE2b-256	`f0b3bcf0e29e5dd95758687a3a03396841746599619fef91d4a55d694605291a`

See more details on using hashes here.

pulpie 0.0.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Installation

Usage

Basic

Choosing a model

Batch processing

Models

How it works

Benchmarks

Acknowledgements

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes