Fast content extraction from HTML using encoder models.
Project description
Pulpie
Fast content extraction from HTML using encoder models. 16x faster than autoregressive approaches at the same quality.
Install
pip install pulpie
For markdown output:
pip install pulpie[markdown]
Usage
from pulpie import Extractor
extractor = Extractor() # downloads pulpie-orange-small (210M) on first use
result = extractor.extract(html)
print(result.markdown) # clean markdown
print(result.html) # clean HTML
print(result.n_main) # number of content blocks
print(result.n_other) # number of boilerplate blocks
Models
| Model | Size | ROUGE-5 | Speed (L4) |
|---|---|---|---|
orange-small |
210M | 0.864 | 15 pps |
orange-base |
610M | 0.849 | ~6 pps |
orange-large |
2.1B | 0.862 | ~2 pps |
orange-small is the default and recommended model — it matches the 2.1B teacher at 1/10th the size.
# Use a specific model
extractor = Extractor(model="orange-large")
# Use a custom model path
extractor = Extractor(model="path/to/your/model")
# Force CPU
extractor = Extractor(device="cpu")
How it works
Pulpie classifies each HTML block as "main content" or "boilerplate" using a bidirectional encoder. The pipeline:
- Simplify — Strip scripts, styles, normalize HTML (via MinerU-HTML)
- Chunk — Pack blocks into sequences separated by
<|sep|>tokens - Classify — Single encoder forward pass classifies all blocks simultaneously
- Reconstruct — Extract content blocks, convert to markdown
Performance
On 500 real Common Crawl pages (NVIDIA L4 GPU):
- 15.1 pages/sec (single GPU, 210M model)
- $6,500 to clean 1 billion pages
- 16.4x faster than Dripper (autoregressive) on the same hardware
- 433 MB VRAM — fits on any GPU
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pulpie-0.0.1.tar.gz.
File metadata
- Download URL: pulpie-0.0.1.tar.gz
- Upload date:
- Size: 20.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28f0f29f1e4945e90873215a9f7a89f32e58194fc2fd06f3c6a30771c5ffd89d
|
|
| MD5 |
b7288b89b7c455418f2fdab2f417540c
|
|
| BLAKE2b-256 |
14cdbbcd40e0c039ab1c706c3f0f341a2878812dcc5b98e10438ede5b0dc87ac
|
File details
Details for the file pulpie-0.0.1-py3-none-any.whl.
File metadata
- Download URL: pulpie-0.0.1-py3-none-any.whl
- Upload date:
- Size: 21.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
822b4fa1c0cfde04b41e78b28190c5600276bf886e27ccdf0c7bbaadd5d14311
|
|
| MD5 |
fea242af99eaeae71e61df459ef4b519
|
|
| BLAKE2b-256 |
4aa1754d67346ece0d024629e5fd3a930e93c9c788e613c92af6199f67fc4c26
|