Skip to main content

A god-level algorithmic multi-file parser (PDF/PPTX/XLSX/CSV) that scores its own confidence and only reaches for AI when it is genuinely stuck.

Project description

terbium: a periodic-table tile reading 65 Tb terbium

terbium

A god-level algorithmic multi-file parser that knows when it is stuck. It reconstructs a document's structure from geometry, scores its own confidence, and only reaches for an AI model when the algorithm cannot be sure.

License: MIT Python formats version

Website · Trove


A vendor document carries most of its content as text but almost none of its structure. A furniture catalogue page is a 2-D matrix: rows are sizes, columns are finishes, and the cells are article numbers. Flatten it to text and the grid is gone, the columns collapse into a single line, and the numbers lose their meaning. terbium rebuilds that structure from the raw position of every word, and it is honest about how sure it is.

Most parsers do one of two things: they fail silently on the hard pages, or they throw the whole document at an LLM and bill you for the easy pages too. terbium does neither. It solves what it can algorithmically, scores every record, and when a page is genuinely ambiguous it either routes just that page to the right model tier, or, if you gave it no key, tells you so in plain words.

The loop

FILE  ->  ADAPT  ->  RECONSTRUCT  ->  SCORE  ->  [ESCALATE]
             |            |             |            |
       pdf/pptx/     columns, rows,  confidence   hard pages only:
       xlsx/csv      matrices from   per record   AI if key, else
                     geometry                      "add a key" message
Phase What happens
Adapt One adapter per format normalizes bytes into positioned words + images
Reconstruct Strip repeated headers, split two-page spreads, rebuild columns/rows/matrices from word geometry
Score Every table gets a 0-1 confidence from grid regularity, header presence, and fill
Escalate Below threshold: route the page to Haiku/Sonnet/Opus, or announce that a key would resolve it

Quickstart

pip install terbium-parse
import terbium

doc = terbium.parse("Furniture Catalogue.pdf")     # algorithmic only, no key needed
print(doc.stats)                                    # Stats(total=725, confident=712, ambiguous=13)

for r in doc.records:
    print(r.sku, r.fields)

# opt into AI only for the pages the engine could not resolve
doc = terbium.parse("Furniture Catalogue.pdf",
                    schema="furniture",
                    ai=terbium.AI(anthropic_key="sk-..."))

Run it from the shell:

terbium "Furniture Catalogue.pdf" --schema furniture
terbium report.xlsx --json out.json

What it parses

Format Engine How
PDF word-level geometry rebuild columns/rows/matrices from the position of every word
PPTX python-pptx native slides, tables and images, straight from the deck structure
XLSX openpyxl cells, merged ranges propagated, wide/long layouts
CSV stdlib delimiter, encoding and type inference

PDF gets the full geometry engine because a PDF throws its structure away. PPTX, XLSX and CSV already carry native structure, so terbium leans on it and parses them cleanly and cheaply.

Confidence and escalation

terbium never pretends a shaky parse is solid. When it cannot be sure and no key is set, it prints exactly what it could not do:

terbium: 712/725 records parsed confidently.
3 table(s) on page(s) 15, 26, 30 are ambiguous (no product title found above
the table; sparse matrix: 5/9 cells filled; 2 row(s) do not line up).
-> set ANTHROPIC_API_KEY or pass ai=terbium.AI(...)   recommended tier: Sonnet

Every record exposes its own confidence and the reasons behind it, so you can filter, sort, or route on it yourself.

The AI lane

The AI lane is opt-in and only ever sees the hard pages.

  • Routing. Difficulty scales the tier: trivial to Haiku, moderate to Sonnet, hard or low-confidence to Opus. Pin a tier with terbium.AI(force_tier="opus").
  • Arrange. A hard table is handed to the routed model with the page's raw text and, for PDFs, a rendered image, and rebuilt into a clean matrix.
  • Vision. Material icons (FSC, oiled, varnished) and finish swatches live only in the pixels; terbium.read_images(path, page, ai) reads them with a vision model. Note: Nano Banana (Gemini image) is for generation, not reading, so it is not on the parse path.

Keys come from terbium.AI(...) or the ANTHROPIC_API_KEY / GEMINI_API_KEY environment variables.

Schemas

A schema turns reconstructed tables into typed records. Ships with two:

  • generic (default): one record per row for grids, one per cell for matrices.
  • furniture: product, size, finish, and metric + imperial dimensions per SKU.

Add your own by subclassing terbium.schema.Schema and registering it.

Install from source

git clone https://github.com/anishfyi/terbium.git
cd terbium
pip install -e .

License

MIT. Built by anishfyi.

terbium · Tb · 65

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

terbium_parse-0.1.0.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

terbium_parse-0.1.0-py3-none-any.whl (40.3 kB view details)

Uploaded Python 3

File details

Details for the file terbium_parse-0.1.0.tar.gz.

File metadata

  • Download URL: terbium_parse-0.1.0.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for terbium_parse-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a494f2af602449a9a83ed561861bf756645c628531a826c04fb725e110fc88dd
MD5 19d90f6c3b9579f972e072a64019ebe0
BLAKE2b-256 2d31e203c3c181f26da30b1fdc5741e09b71a6b3bcc9948996684c741c80584d

See more details on using hashes here.

File details

Details for the file terbium_parse-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: terbium_parse-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for terbium_parse-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 97ced4440223a34eeedb453a7bf374426cebfb264ae514c0632edc8cf461ce98
MD5 d827999acbe46d595f6657b29f92f8ad
BLAKE2b-256 1fb84ace11c942a655d023ada3fa288178f38afe46e9fd7a1cdee058a4ca7c04

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page