Neutral Internal Representation and Codec registry for image-based table-recognition datasets
Project description
tablecodec
One lossless Internal Representation (IR) for image-based table-recognition datasets, plus a registry of codecs that translate between the IR and the fragmented public formats — PubTabNet, FinTabNet, OTSL, TableFormer, DocTags-tables, PubTables-1M, TableBank.
Read any of them into one neutral shape, validate it, convert between formats,
and get a static, data-free loss report for any conversion before you run it.
The core has zero third-party runtime dependencies — import tablecodec
works on a bare Python 3.11+; heavier features (TEDS, CLI, HF streaming) are
opt-in extras.
docs/spec.md is the source of truth. The 0.x line makes no
API-stability promises; the public surface freezes at 1.0 (SPEC §14).
Install
pip install tablecodec # stdlib-only core
pip install "tablecodec[cli]" # + command-line interface (click)
pip install "tablecodec[teds]" # + TEDS similarity metric (apted, lxml)
Quick start
import tablecodec
from tablecodec import codecs, validate, profiles, analyze_loss
from tablecodec.codecs.pubtabnet import PubTabNet20Codec
# Register a codec (the CLI self-registers the built-ins; in library use you
# register the ones you need).
codecs.register(PubTabNet20Codec())
# Stream-read a dataset into the neutral IR (constant memory).
with open("pubtabnet_val.jsonl", encoding="utf-8") as f:
for sample in codecs.get("pubtabnet-2.0.0").read(f):
errors = validate(sample, profile=profiles.DEFAULT)
if errors:
print(sample.filename, errors)
# Static, data-free loss analysis between two formats.
report = analyze_loss(source="pubtabnet-2.0.0", target="otsl-1.0.0")
print(report.round_trip_classification) # "structure-preserving"
Supported
Verified in CI (see .github/workflows/ci.yaml).
| Component | Supported | Notes |
|---|---|---|
| Python | 3.11 – 3.14 | core is stdlib-only (zero runtime deps, SPEC §13) |
| Codecs | 9 built-in | pubtabnet-1.0.0/2.0.0, otsl-1.0.0, fintabnet, fintabnet-otsl, tableformer, tablebank, pubtables-1m, doctags-tables |
| Extras | [cli] [teds] [hf] |
click · apted+lxml · datasets (occasional/local e2e) |
| Bridge | docling-tables |
a separate tablecodec-docling package (packages/, own version) |
Auto-generated capability tables: format support · loss matrix. Dependency bumps within these ranges are tracked by Dependabot.
TEDS similarity ([teds] extra)
A Tree-Edit-Distance-based Similarity score between two samples. It lives
outside the core (it imports apted/lxml), so import it from its submodule:
from tablecodec.teds import teds
score = teds(pred_sample, true_sample) # 0.0 .. 1.0
struct = teds(pred_sample, true_sample, structure_only=True) # ignore cell text
CLI ([cli] extra)
tablecodec codecs list
tablecodec analyze-loss --from pubtabnet-2.0.0 --to otsl-1.0.0
tablecodec validate path/to/dataset.jsonl --codec pubtabnet-2.0.0 --profile DEFAULT
tablecodec stats path/to/dataset.jsonl --codec pubtabnet-2.0.0 --json
tablecodec convert in.jsonl out.jsonl --from pubtabnet-2.0.0 --to otsl-1.0.0
tablecodec convert in.jsonl /dev/null --from pubtabnet-2.0.0 --to otsl-1.0.0 --dry-run
tablecodec diff a.jsonl b.jsonl --codec pubtabnet-2.0.0
All commands stream their input; exit codes are non-zero on validation failures or diffs (suitable for CI / data pipelines).
End-to-end check against real datasets
scripts/e2e_hf_check.py streams real datasets through the codecs and validates
the resulting IR. It is occasional / local-only (network + multi-GB
datasets), not part of CI. Every shipped codec gets at least one official-corpus
check, from three sources:
- the Docling OTSL family
(
docling-project/{PubTabNet,FinTabNet,PubTables-1M,SynthTabNet}_OTSL) — a uniform converted schema that feeds all nine codecs; - the native first-published PubTabNet annotation (
apoidea/pubtabnet-html) fed unmodified to thepubtabnetcodecs; - the native PubTables-1M PASCAL VOC structure annotation
(
bsmock/pubtables-1m, download-only) with the logical grid reconstructed for thepubtables-1mcodec.
just e2e-selftest # network-free adapter smoke test
just e2e 200 # 200 randomly-sampled rows per check (needs [hf] extra)
just e2e-fetch-pubtables1m # download native PubTables-1M VOC (~30MB) into input/
Rows are sampled randomly and each run prints its --seed, so repeated runs
progressively cover the corpora and any finding is reproducible. Failures are
appended to output/e2e_findings/ (gitignored) with a replayable payload. See
ADR 0003 and
ADR 0004 for the
data-source decisions and the canonical-vs-real-shape caveats.
Documentation
docs/spec.md— Specification (the single source of truth).docs/glossary.md— Precise vocabulary: terms tablecodec defines vs. borrows (e.g. "loss" vs a "degenerate" bbox).docs/intent.md— Implementation brief and roadmap (milestones, quality bar, §8 future work).docs/adr/— the decisions and their reasoning (the "Why").CHANGELOG.md— Keep a Changelog format.
Development
just install # editable install with dev + cli + teds extras
just ci # lint + pyright (strict) + pytest + semgrep + docs-check
just docs # regenerate the codec/loss tables (docs-check enforces freshness)
just ci-all # core + the in-repo tablecodec-docling bridge
Releases are published from GitHub Actions via PyPI OIDC Trusted Publishing (no long-lived token), carrying PEP 740 attestations and a SLSA build provenance (ADR 0014).
License
MIT. See LICENSE. The OTSL grid-reconstruction logic and the TEDS metric are adapted (with attribution) from upstream MIT / Apache-2.0 sources — see THIRD_PARTY_NOTICES.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tablecodec-0.0.19.tar.gz.
File metadata
- Download URL: tablecodec-0.0.19.tar.gz
- Upload date:
- Size: 85.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f62f974456a822ba3156c40569198f6d5a1df512440f4bb8038bd5989d281fa
|
|
| MD5 |
29b3f26f44c0a3fc2f9aa4f1142ea9ab
|
|
| BLAKE2b-256 |
3e64dc234e0ccbc8c631b770395ae30ad1dde6cf208c6e6ca91f8aaf3aa24a21
|
Provenance
The following attestation bundles were made for tablecodec-0.0.19.tar.gz:
Publisher:
release.yaml on hironow/tablecodec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tablecodec-0.0.19.tar.gz -
Subject digest:
6f62f974456a822ba3156c40569198f6d5a1df512440f4bb8038bd5989d281fa - Sigstore transparency entry: 1745401548
- Sigstore integration time:
-
Permalink:
hironow/tablecodec@10066b7d69957da2e31d56d6cf021358b4a4a425 -
Branch / Tag:
refs/tags/v0.0.19 - Owner: https://github.com/hironow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@10066b7d69957da2e31d56d6cf021358b4a4a425 -
Trigger Event:
push
-
Statement type:
File details
Details for the file tablecodec-0.0.19-py3-none-any.whl.
File metadata
- Download URL: tablecodec-0.0.19-py3-none-any.whl
- Upload date:
- Size: 50.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb5f65872e9c396a488b4a87c8aff9660c404449aea554fa9a73745866304bfc
|
|
| MD5 |
14b46745f316f97fd9e421316edeef9a
|
|
| BLAKE2b-256 |
e9e31e9805dd2bc833f92f7ad21b525b1783e9ac06285d46a15f379206f8e231
|
Provenance
The following attestation bundles were made for tablecodec-0.0.19-py3-none-any.whl:
Publisher:
release.yaml on hironow/tablecodec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tablecodec-0.0.19-py3-none-any.whl -
Subject digest:
cb5f65872e9c396a488b4a87c8aff9660c404449aea554fa9a73745866304bfc - Sigstore transparency entry: 1745401743
- Sigstore integration time:
-
Permalink:
hironow/tablecodec@10066b7d69957da2e31d56d6cf021358b4a4a425 -
Branch / Tag:
refs/tags/v0.0.19 - Owner: https://github.com/hironow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@10066b7d69957da2e31d56d6cf021358b4a4a425 -
Trigger Event:
push
-
Statement type: