Skip to main content

Neutral Internal Representation and Codec registry for image-based table-recognition datasets

Project description

tablecodec

PyPI CI Python License: MIT

One lossless Internal Representation (IR) for image-based table-recognition datasets, plus a registry of codecs that translate between the IR and the fragmented public formats — PubTabNet, FinTabNet, OTSL, TableFormer, DocTags-tables, PubTables-1M, TableBank.

Read any of them into one neutral shape, validate it, convert between formats, and get a static, data-free loss report for any conversion before you run it. The core has zero third-party runtime dependenciesimport tablecodec works on a bare Python 3.11+; heavier features (TEDS, CLI, HF streaming) are opt-in extras.

docs/spec.md is the source of truth. The 0.x line makes no API-stability promises; the public surface freezes at 1.0 (SPEC §14).

Install

pip install tablecodec            # stdlib-only core
pip install "tablecodec[cli]"     # + command-line interface (click)
pip install "tablecodec[teds]"    # + TEDS similarity metric (apted, lxml)

Quick start

import tablecodec
from tablecodec import codecs, validate, profiles, analyze_loss
from tablecodec.codecs.pubtabnet import PubTabNet20Codec

# Register a codec (the CLI self-registers the built-ins; in library use you
# register the ones you need).
codecs.register(PubTabNet20Codec())

# Stream-read a dataset into the neutral IR (constant memory).
with open("pubtabnet_val.jsonl", encoding="utf-8") as f:
    for sample in codecs.get("pubtabnet-2.0.0").read(f):
        errors = validate(sample, profile=profiles.DEFAULT)
        if errors:
            print(sample.filename, errors)

# Static, data-free loss analysis between two formats.
report = analyze_loss(source="pubtabnet-2.0.0", target="otsl-1.0.0")
print(report.round_trip_classification)  # "structure-preserving"

Supported

Verified in CI (see .github/workflows/ci.yaml).

Component Supported Notes
Python 3.11 – 3.14 core is stdlib-only (zero runtime deps, SPEC §13)
Codecs 9 built-in pubtabnet-1.0.0/2.0.0, otsl-1.0.0, fintabnet, fintabnet-otsl, tableformer, tablebank, pubtables-1m, doctags-tables
Extras [cli] [teds] [hf] click · apted+lxml · datasets (occasional/local e2e)
Bridge docling-tables a separate tablecodec-docling package (packages/, own version)

Auto-generated capability tables: format support · loss matrix. Dependency bumps within these ranges are tracked by Dependabot.

TEDS similarity ([teds] extra)

A Tree-Edit-Distance-based Similarity score between two samples. It lives outside the core (it imports apted/lxml), so import it from its submodule:

from tablecodec.teds import teds

score = teds(pred_sample, true_sample)                        # 0.0 .. 1.0
struct = teds(pred_sample, true_sample, structure_only=True)  # ignore cell text

CLI ([cli] extra)

tablecodec codecs list
tablecodec analyze-loss --from pubtabnet-2.0.0 --to otsl-1.0.0
tablecodec validate path/to/dataset.jsonl --codec pubtabnet-2.0.0 --profile DEFAULT
tablecodec stats path/to/dataset.jsonl --codec pubtabnet-2.0.0 --json
tablecodec convert in.jsonl out.jsonl --from pubtabnet-2.0.0 --to otsl-1.0.0
tablecodec convert in.jsonl /dev/null --from pubtabnet-2.0.0 --to otsl-1.0.0 --dry-run
tablecodec diff a.jsonl b.jsonl --codec pubtabnet-2.0.0

All commands stream their input; exit codes are non-zero on validation failures or diffs (suitable for CI / data pipelines).

End-to-end check against real datasets

scripts/e2e_hf_check.py streams real datasets through the codecs and validates the resulting IR. It is occasional / local-only (network + multi-GB datasets), not part of CI. Every shipped codec gets at least one official-corpus check, from three sources:

  • the Docling OTSL family (docling-project/{PubTabNet,FinTabNet,PubTables-1M,SynthTabNet}_OTSL) — a uniform converted schema that feeds all nine codecs;
  • the native first-published PubTabNet annotation (apoidea/pubtabnet-html) fed unmodified to the pubtabnet codecs;
  • the native PubTables-1M PASCAL VOC structure annotation (bsmock/pubtables-1m, download-only) with the logical grid reconstructed for the pubtables-1m codec.
just e2e-selftest              # network-free adapter smoke test
just e2e 200                   # 200 randomly-sampled rows per check (needs [hf] extra)
just e2e-fetch-pubtables1m     # download native PubTables-1M VOC (~30MB) into input/

Rows are sampled randomly and each run prints its --seed, so repeated runs progressively cover the corpora and any finding is reproducible. Failures are appended to output/e2e_findings/ (gitignored) with a replayable payload. See ADR 0003 and ADR 0004 for the data-source decisions and the canonical-vs-real-shape caveats.

Documentation

  • docs/spec.md — Specification (the single source of truth).
  • docs/glossary.md — Precise vocabulary: terms tablecodec defines vs. borrows (e.g. "loss" vs a "degenerate" bbox).
  • docs/intent.md — Implementation brief and roadmap (milestones, quality bar, §8 future work).
  • docs/adr/ — the decisions and their reasoning (the "Why").
  • CHANGELOG.md — Keep a Changelog format.

Development

just install      # editable install with dev + cli + teds extras
just ci           # lint + pyright (strict) + pytest + semgrep + docs-check
just docs         # regenerate the codec/loss tables (docs-check enforces freshness)
just ci-all       # core + the in-repo tablecodec-docling bridge

Releases are published from GitHub Actions via PyPI OIDC Trusted Publishing (no long-lived token), carrying PEP 740 attestations and a SLSA build provenance (ADR 0014).

License

MIT. See LICENSE. The OTSL grid-reconstruction logic and the TEDS metric are adapted (with attribution) from upstream MIT / Apache-2.0 sources — see THIRD_PARTY_NOTICES.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tablecodec-0.0.19.tar.gz (85.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tablecodec-0.0.19-py3-none-any.whl (50.3 kB view details)

Uploaded Python 3

File details

Details for the file tablecodec-0.0.19.tar.gz.

File metadata

  • Download URL: tablecodec-0.0.19.tar.gz
  • Upload date:
  • Size: 85.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for tablecodec-0.0.19.tar.gz
Algorithm Hash digest
SHA256 6f62f974456a822ba3156c40569198f6d5a1df512440f4bb8038bd5989d281fa
MD5 29b3f26f44c0a3fc2f9aa4f1142ea9ab
BLAKE2b-256 3e64dc234e0ccbc8c631b770395ae30ad1dde6cf208c6e6ca91f8aaf3aa24a21

See more details on using hashes here.

Provenance

The following attestation bundles were made for tablecodec-0.0.19.tar.gz:

Publisher: release.yaml on hironow/tablecodec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tablecodec-0.0.19-py3-none-any.whl.

File metadata

  • Download URL: tablecodec-0.0.19-py3-none-any.whl
  • Upload date:
  • Size: 50.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for tablecodec-0.0.19-py3-none-any.whl
Algorithm Hash digest
SHA256 cb5f65872e9c396a488b4a87c8aff9660c404449aea554fa9a73745866304bfc
MD5 14b46745f316f97fd9e421316edeef9a
BLAKE2b-256 e9e31e9805dd2bc833f92f7ad21b525b1783e9ac06285d46a15f379206f8e231

See more details on using hashes here.

Provenance

The following attestation bundles were made for tablecodec-0.0.19-py3-none-any.whl:

Publisher: release.yaml on hironow/tablecodec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page