Skip to main content

Fast content extraction from HTML using encoder models.

Project description

pulpie

PyPI version Python License Downloads Blog GitHub stars

Pareto-optimal models for cleaning the web. Extract main content from HTML at one twentieth the cost.

InstallUsageModelsHow it worksBenchmarksBlog

Pulpie extracts the main content from raw HTML, stripping navigation, ads, sidebars, and footers. It uses small encoder models that label every block in a single forward pass, approaching state-of-the-art extraction quality while running up to 20x faster and 20x cheaper than autoregressive extractors on an L4 GPU.

  • Fast. An encoder labels every block in one forward pass (13.7 pages/sec on an L4).
  • Accurate. Matches state-of-the-art quality: 0.862 to 0.873 ROUGE-5 F1 on WebMainBench.
  • Small. The recommended model is 210M parameters and fits on any GPU.
  • Cheap. Clean 1 billion pages for ~$7,900, versus ~$159,000 for the leading decoder.
  • Simple. Run pip install pulpie, then Extractor().extract(html).
  • Batched. An overlapped CPU and GPU pipeline scales across multiple GPUs.

Installation

pip install pulpie

For Markdown output, install the markdown extra:

pip install "pulpie[markdown]"

Or with uv:

uv pip install "pulpie[markdown]"

Usage

Basic

from pulpie import Extractor

extractor = Extractor()                # defaults to pulpie-orange-small (210M)
result = extractor.extract(html)

print(result.markdown)                 # clean Markdown
print(result.html)                     # clean HTML
print(result.n_main, result.n_other)   # blocks kept vs dropped

The model downloads from Hugging Face on first use.

Choosing a model

extractor = Extractor(model="orange-large")   # "orange-small" (default), "orange-base", "orange-large"
extractor = Extractor(model="path/to/model")  # or a custom checkpoint
extractor = Extractor(device="cpu")           # force CPU

Batch processing

For bulk extraction, Pipeline overlaps CPU preprocessing with GPU inference and self-balances across one or more GPUs:

from pulpie import Pipeline, PageInput

pipeline = Pipeline(model="orange-small")
results = pipeline.extract_batch(
    [PageInput(html=h, page_id=i) for i, h in enumerate(pages)]
)

Models

All three models are built on EuroBERT, share a tokenizer, and use the same <|sep|> block-marker architecture. Large is the teacher; Base and Small are distilled from it.

Model Hugging Face Params ROUGE-5 F1 Notes
Orange Small feyninc/pulpie-orange-small 210M 0.862 Recommended, best size-to-quality ratio
Orange Base feyninc/pulpie-orange-base 610M 0.863 Distilled from Large
Orange Large feyninc/pulpie-orange-large 2.1B 0.873 Teacher (highest quality)

orange-small is the default. Despite being a third the size of Dripper (the leading extractor), it matches its quality (0.862 vs 0.864) while running 20x faster.

How it works

Pulpie keeps the "read the page" approach of model-based extractors but moves the bottleneck from memory bandwidth to compute by using an encoder instead of a decoder. The pipeline runs in four stages:

  1. Simplify. Remove scripts, styles, and formatting noise; tag each content block with a unique ID.
  2. Chunk. Split, tokenize, and pack blocks into chunks of up to 8,192 tokens (≈80% of pages fit in one chunk).
  3. Classify. A single encoder forward pass labels every block as content or boilerplate.
  4. Reconstruct. Return the kept blocks as HTML, or convert them to Markdown.

A decoder emits labels one token at a time, re-reading the full model from GPU memory each step. An encoder runs one dense forward pass over the whole input, so the gap widens on bandwidth-limited GPUs (7x faster than Dripper on A100, 20x on L4).

Benchmarks

Quality on the English subset of WebMainBench (6,647 pages), ROUGE-5 F1:

Method Params ROUGE-5 F1 Empty pages
Pulpie Orange Large 2.1B 0.873 21
Dripper 0.6B 0.864 135
Pulpie Orange Base 610M 0.863 36
Pulpie Orange Small 210M 0.862 45
magic-html - 0.700 384
Trafilatura - 0.619 16

Speed and cost (Pulpie Orange Small vs Dripper, 1 billion pages):

Pulpie Orange Small Dripper
Throughput (L4) 13.7 pages/sec 0.68 pages/sec
Cost / 1B pages (L4) ~$7,900 ~$159,000

Pulpie Orange Small matches Dripper's quality at 20x the throughput and 20x lower cost on an L4. See BENCHMARKS.md for the full comparison, per-difficulty breakdown, and reproduction command.

Acknowledgements

Pulpie builds directly on the work of the MinerU-HTML and Dripper team (Ma et al., 2025). Their simplify_html preprocessing, block-level annotation scheme, and the WebMainBench benchmark are foundational to this work. We also use their Dripper 0.6B model to cross-validate our training labels. We're grateful they released their tools and data.

Citation

If you use Pulpie in your research, please cite:

@note{pulpie2026,
  title  = {Pulpie: Pareto-Optimal Models for Cleaning the Web},
  author = {Minhas, Bhavnick and Nigam, Shreyash and Feyn Research},
  year   = {2026},
  venue  = {Feyn Field Notes}
}

Built by Feyn.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pulpie-0.0.2.tar.gz (29.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pulpie-0.0.2-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file pulpie-0.0.2.tar.gz.

File metadata

  • Download URL: pulpie-0.0.2.tar.gz
  • Upload date:
  • Size: 29.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pulpie-0.0.2.tar.gz
Algorithm Hash digest
SHA256 55b6790f5d0d17dffda120485c5e87ee833fde7cda95eb0747d61410e4241be1
MD5 fbe7dee8a2d78dbc6eca04ae0954163a
BLAKE2b-256 b077f191b95ae3b584fd7b65399a119c1613472c162bbcb1e78184ba1eb790a6

See more details on using hashes here.

File details

Details for the file pulpie-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: pulpie-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pulpie-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9a54d01b462b7a51b635dc9ffbd3b0ecd2dc703834940ac9ce6354a590d12130
MD5 c2d9ffb19d665054862a49e10c0d741f
BLAKE2b-256 f0b3bcf0e29e5dd95758687a3a03396841746599619fef91d4a55d694605291a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page