A god-level algorithmic multi-file parser (PDF/PPTX/XLSX/CSV) that scores its own confidence and only reaches for AI when it is genuinely stuck.

These details have not been verified by PyPI

Project links

Project description

terbium

A god-level algorithmic multi-file parser that knows when it is stuck. It reconstructs a document's structure from geometry, scores its own confidence, and only reaches for an AI model when the algorithm cannot be sure.

Website · Trove

The one thing terbium does

Point it at any vendor catalogue and get back a table of products, each with its name, its SKU, its materials or ingredients, and its image. Then a CSV.

import terbium

rows = terbium.build_catalog("vendor_catalogue.pdf", images_dir="images/")
terbium.to_catalog_csv(rows, "catalogue.csv")
# rows: {"sku": "RG-1001", "name": "Anatolia Kilim",
#        "materials": "wool", "image": "Anatolia_Kilim.jpeg", "page": 12}

Each product photo anchors a row: terbium extracts the image, names it from the label beneath it, and mines the nearby text for the SKU and the materials. Clean catalogues come out complete with no key; for a visual brochure that buries the name in a title and the material in a paragraph, pass ai=terbium.AI() and a vision model reads each photo plus the page text to fill the blanks. See docs/catalog.md.

Underneath: structure from geometry

A document carries most of its content as text but almost none of its structure. A table in a PDF, a financial grid, a spec sheet, a schedule, a furniture catalogue's size x finish matrix, is laid out for the eye; flatten it to text and the columns collapse into a single line and the grid is gone. terbium rebuilds that structure from the raw position of every word, on any column-aligned table, and it is honest about how sure it is. The default detector is content-agnostic; furniture is the worked example, not the limit.

Most parsers do one of two things: they fail silently on the hard pages, or they throw the whole document at an LLM and bill you for the easy pages too. terbium does neither. It solves what it can algorithmically, scores every record, and when a page is genuinely ambiguous it either routes just that page to the right model tier, or, if you gave it no key, tells you so in plain words.

The loop

FILE  ->  ADAPT  ->  RECONSTRUCT  ->  SCORE  ->  [ESCALATE]
             |            |             |            |
       pdf/pptx/     columns, rows,  confidence   hard pages only:
       xlsx/csv      matrices from   per record   AI if key, else
                     geometry                      "add a key" message

Phase	What happens
Adapt	One adapter per format normalizes bytes into positioned words + images
Reconstruct	Strip repeated headers, split two-page spreads, rebuild columns/rows/matrices from word geometry
Score	Every table gets a 0-1 confidence from grid regularity, header presence, and fill
Escalate	Below threshold: route the page to Haiku/Sonnet/Opus, or announce that a key would resolve it

Quickstart

pip install terbium-parse

import terbium

doc = terbium.parse("Furniture Catalogue.pdf")     # algorithmic only, no key needed
print(doc.stats)                                    # Stats(total=725, confident=712, ambiguous=13)

for r in doc.records:
    print(r.sku, r.fields)

# opt into AI only for the pages the engine could not resolve
doc = terbium.parse("Furniture Catalogue.pdf",
                    schema="furniture",
                    ai=terbium.AI(anthropic_key="sk-..."))

Pull out the product images, each named after the product it sits under:

manifest = terbium.export_images("lookbook.pdf", "out/")
# out/Kyoto_Bedside_Table.jpeg, out/Meadow_Bedside_Table.jpeg, ...
# manifest rows carry: product, collection, page, pixel size, colorspace,
# format, dpi, dominant_color, bbox

Run it from the shell:

terbium "Furniture Catalogue.pdf" --schema furniture
terbium report.xlsx --json out.json
terbium lookbook.pdf --images out/          # extract product photos + manifest.csv

What it parses

Format	Engine	How
PDF	word-level geometry	rebuild columns/rows/matrices from the position of every word
PPTX	python-pptx	native slides, tables and images, straight from the deck structure
XLSX	openpyxl	cells, merged ranges propagated, wide/long layouts
CSV	stdlib	delimiter, encoding and type inference

PDF gets the full geometry engine because a PDF throws its structure away. PPTX, XLSX and CSV already carry native structure, so terbium leans on it and parses them cleanly and cheaply.

Not every PDF is a matrix. When a document is a lookbook, a grid of product photos with a name under each, terbium reconstructs the label grid instead: one record per product, grouped under its collection title. And when a page is image-only, with no text layer at all, terbium does not return an empty result: it reports exactly which pages need the vision lane.

Confidence and escalation

terbium never pretends a shaky parse is solid. When it cannot be sure and no key is set, it prints exactly what it could not do:

terbium: 712/725 records parsed confidently.
3 table(s) on page(s) 15, 26, 30 are ambiguous (no product title found above
the table; sparse matrix: 5/9 cells filled; 2 row(s) do not line up).
-> set ANTHROPIC_API_KEY or pass ai=terbium.AI(...)   recommended tier: Sonnet

Every record exposes its own confidence and the reasons behind it, so you can filter, sort, or route on it yourself.

The AI lane

The AI lane is opt-in and only ever sees the hard pages.

Routing. Difficulty scales the tier: trivial to Haiku, moderate to Sonnet, hard or low-confidence to Opus. Pin a tier with terbium.AI(force_tier="opus").
Arrange. A hard table is handed to the routed model with the page's raw text and, for PDFs, a rendered image, and rebuilt into a clean matrix.
Vision. Material icons (FSC, oiled, varnished) and finish swatches live only in the pixels; terbium.read_images(path, page, ai) reads them with a vision model. Note: Nano Banana (Gemini image) is for generation, not reading, so it is not on the parse path.

Keys come from terbium.AI(...) or the ANTHROPIC_API_KEY / GEMINI_API_KEY environment variables.

Schemas

A schema turns reconstructed tables into typed records. Ships with two:

generic (default): one record per row for grids, one per cell for matrices.
furniture: product, size, finish, and metric + imperial dimensions per SKU.

Add your own by subclassing terbium.schema.Schema and registering it.

Install from source

git clone https://github.com/anishfyi/terbium.git
cd terbium
pip install -e .

License

MIT. Built by anishfyi.

_{terbium · Tb · 65}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.2

Jul 3, 2026

This version

0.9.1

Jul 3, 2026

0.3.0

Jul 3, 2026

0.2.0

Jul 3, 2026

0.1.0

Jul 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

terbium_parse-0.9.1.tar.gz (52.6 kB view details)

Uploaded Jul 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

terbium_parse-0.9.1-py3-none-any.whl (60.8 kB view details)

Uploaded Jul 3, 2026 Python 3

File details

Details for the file terbium_parse-0.9.1.tar.gz.

File metadata

Download URL: terbium_parse-0.9.1.tar.gz
Upload date: Jul 3, 2026
Size: 52.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for terbium_parse-0.9.1.tar.gz
Algorithm	Hash digest
SHA256	`8e975c6966c5a00d8eacfc3dd7474eac6dc86141c33fa8182210ed78e608fc61`
MD5	`f756e9eab2abc31a6d01306a4d0f4142`
BLAKE2b-256	`2e2f95540579270c976d7fa43c3d33495911624c824c0500a8e30092af6349e9`

See more details on using hashes here.

File details

Details for the file terbium_parse-0.9.1-py3-none-any.whl.

File metadata

Download URL: terbium_parse-0.9.1-py3-none-any.whl
Upload date: Jul 3, 2026
Size: 60.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for terbium_parse-0.9.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6157545aa8b104fc0bd59075da227438c13846187f9d7ac6ae8c22ac89a377dc`
MD5	`370974308f63d29b10832bac175078bd`
BLAKE2b-256	`3c9ffefa0be6dbc5bf10c9882d2979a19b69fa219bbd2762288f9a4318fdf79c`

See more details on using hashes here.

terbium-parse 0.9.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

terbium

The one thing terbium does

Underneath: structure from geometry

The loop

Quickstart

What it parses

Confidence and escalation

The AI lane

Schemas

Install from source

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes