Skip to main content

Fast content extraction from HTML using encoder models.

Project description

Pulpie

Fast content extraction from HTML using encoder models. 16x faster than autoregressive approaches at the same quality.

Install

pip install pulpie

For markdown output:

pip install pulpie[markdown]

Usage

from pulpie import Extractor

extractor = Extractor()  # downloads pulpie-orange-small (210M) on first use

result = extractor.extract(html)
print(result.markdown)   # clean markdown
print(result.html)       # clean HTML
print(result.n_main)     # number of content blocks
print(result.n_other)    # number of boilerplate blocks

Models

Model Size ROUGE-5 Speed (L4)
orange-small 210M 0.864 15 pps
orange-base 610M 0.849 ~6 pps
orange-large 2.1B 0.862 ~2 pps

orange-small is the default and recommended model — it matches the 2.1B teacher at 1/10th the size.

# Use a specific model
extractor = Extractor(model="orange-large")

# Use a custom model path
extractor = Extractor(model="path/to/your/model")

# Force CPU
extractor = Extractor(device="cpu")

How it works

Pulpie classifies each HTML block as "main content" or "boilerplate" using a bidirectional encoder. The pipeline:

  1. Simplify — Strip scripts, styles, normalize HTML (via MinerU-HTML)
  2. Chunk — Pack blocks into sequences separated by <|sep|> tokens
  3. Classify — Single encoder forward pass classifies all blocks simultaneously
  4. Reconstruct — Extract content blocks, convert to markdown

Performance

On 500 real Common Crawl pages (NVIDIA L4 GPU):

  • 15.1 pages/sec (single GPU, 210M model)
  • $6,500 to clean 1 billion pages
  • 16.4x faster than Dripper (autoregressive) on the same hardware
  • 433 MB VRAM — fits on any GPU

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pulpie-0.0.1.tar.gz (20.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pulpie-0.0.1-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file pulpie-0.0.1.tar.gz.

File metadata

  • Download URL: pulpie-0.0.1.tar.gz
  • Upload date:
  • Size: 20.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pulpie-0.0.1.tar.gz
Algorithm Hash digest
SHA256 28f0f29f1e4945e90873215a9f7a89f32e58194fc2fd06f3c6a30771c5ffd89d
MD5 b7288b89b7c455418f2fdab2f417540c
BLAKE2b-256 14cdbbcd40e0c039ab1c706c3f0f341a2878812dcc5b98e10438ede5b0dc87ac

See more details on using hashes here.

File details

Details for the file pulpie-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: pulpie-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 21.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pulpie-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 822b4fa1c0cfde04b41e78b28190c5600276bf886e27ccdf0c7bbaadd5d14311
MD5 fea242af99eaeae71e61df459ef4b519
BLAKE2b-256 4aa1754d67346ece0d024629e5fd3a930e93c9c788e613c92af6199f67fc4c26

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page