Lightweight, performant, deep table extraction

These details have not been verified by PyPI

Project links

Project description

gmft

There are many pdfs out there, and many of those pdfs have tables. But despite a plethora of table extraction options, there is still no definitive extraction method.

About

gmft converts pdf tables to many formats. It is lightweight, modular, and performant. Batteries included: it just works, offering strong performance with the default settings.

It relies on microsoft's Table Transformers, qualitatively the most performant and reliable of the many alternatives.

Install: pip install gmft

Quickstarts: demo notebook, bulk extract, readthedocs.

Documentation: readthedocs

Why should I use gmft?

Fast, lightweight, and performant, gmft is a great choice for extracting tables from pdfs.

The extraction quality is superb: check out the bulk extract notebook for approximate quality. When testing the same tables across many table extraction options, gmft fares extremely well.

Many Formats

We support the following export options:

Pandas dataframe
By extension: markdown, latex, html, csv, json, etc.
List of text + positions
Cropped image of table
Table caption

Cropped images can be passed into a vision recognizer, like:

GPT-4 vision
Mathpix/Adobe/Google/Amazon/Azure/etc.
Or saved to disk for human evaluation

Lightweight

gmft is very lightweight. It can run on cpu - no GPU necessary.

High throughput

Benchmark using Colab's cpu indicates ~1.381 s/page; converting to df takes ~1.168 s/table. This makes gmft about 10x faster than alternatives like unstructured, nougat, and open-parse/unitable on cpu.

The base model, Smock et al.'s Table Transformer, is very efficient.
gmft focuses on table extraction, so figures, titles, sections, etc. are not extracted.
In most cases, OCR is not necessary; pdfs already contain text positional data. Using this existing data drastically speeds up inference. For images or scanned pdfs, bboxes can be exported for further processing.
PyPDFium2 is chosen for its high throughput and permissive license.

Few dependencies

gmft does not require any external dependencies (detectron2, poppler, paddleocr, tesseract etc.)

To install gmft, first install transformers and pytorch with the necessary GPU/CPU options. We also rely on pypdfium2 and transformers.

Dependable

The base model is Microsoft's Table Transformer (TATR) pretrained on PubTables-1M, which works best with scientific papers. TATR handles implicit table structure very well. Current failure modes include OCR issues, merged cells, or false positives. Even so, the text is highly useable, and alignment of a value to its row/column header remains very accurate because of the underlying procedural algorithm.

We invite you to explore the comparison notebooks to survey use cases and compare results.

As of gmft v0.3, the library supports multiple-column headers (TATRFormatConfig.enable_multi_header = True), spanning cells (TATRFormatConfig.semantic_spanning_cells = True), and rotated tables.

Why should I not use gmft?

gmft focuses on tables, and aims to maximize performance on tables alone. If you need to extract other document features like figures or table of contents, you may want a different tool. You should instead check out: (in no particular order) marker, nougat, open-parse, docling, unstructured, surya, deepdoctection, DocTR. For table detection, img2table is excellent for tables with explicit (solid) cell boundaries.

Current limitations include: false positives (references, indexes, and large columnar text), false negatives, and no OCR support.

Quickstart

See the docs and the config guide for more information. The demo notebook and bulk extract contain more comprehensive code examples.

# new in v0.3: gmft.auto
from gmft.auto import CroppedTable, TableDetector, AutoTableFormatter, AutoTableDetector
from gmft.pdf_bindings import PyPDFium2Document

detector = AutoTableDetector()
formatter = AutoTableFormatter()

def ingest_pdf(pdf_path): # produces list[CroppedTable]
    doc = PyPDFium2Document(pdf_path)
    tables = []
    for page in doc:
        tables += detector.extract(page)
    return tables, doc

tables, doc = ingest_pdf("path/to/pdf.pdf")
doc.close() # once you're done with the document

Configuration

See the config guide for discussion on gmft settings.

Development

git clone https://github.com/conjuncts/gmft
cd gmft
pip install -e .
pip install pytest

Run tests:

tests are in ./test directory

Build docs:

cd docs
make html

What does gmft stand for?

give

formatted

tables!

Acknowledgements

I gratefully acknowledge the support of Vanderbilt Data Science Institute and the Zhongyue Yang Lab at Vanderbilt.
The library builds upon work by:
- Smock, Brandon, Rohith Pesala, and Robin Abraham. "PubTables-1M: Towards comprehensive table extraction from unstructured documents." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
- Niels Rogge from huggingface.

License

GMFT is released under MIT.

PyMuPDF support is available in a separate repository in observance of pymupdf's AGPL 3.0 license.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.3

Feb 22, 2026

0.4.2

Jun 30, 2025

0.4.1

Mar 23, 2025

0.4.1rc1 pre-release

Mar 19, 2025

0.4.0

Oct 30, 2024

0.4.0rc1 pre-release

Oct 21, 2024

0.3.2

Oct 12, 2024

0.3.1

Sep 23, 2024

0.3.0 yanked

Sep 21, 2024

Reason this release was yanked:

broken build

0.2.2

Aug 29, 2024

0.2.1

Jul 11, 2024

0.2.0

Jul 11, 2024

0.2.0rc1 pre-release

Jul 8, 2024

0.2.0rc0 pre-release

Jul 6, 2024

0.1.1

Jul 1, 2024

0.1.0

Jun 29, 2024

0.0.4

Jun 17, 2024

0.0.3

Jun 11, 2024

0.0.2

Jun 10, 2024

0.0.1

Jun 8, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gmft-0.4.3.tar.gz (62.5 kB view details)

Uploaded Feb 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gmft-0.4.3-py3-none-any.whl (75.3 kB view details)

Uploaded Feb 22, 2026 Python 3

File details

Details for the file gmft-0.4.3.tar.gz.

File metadata

Download URL: gmft-0.4.3.tar.gz
Upload date: Feb 22, 2026
Size: 62.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.32.3

File hashes

Hashes for gmft-0.4.3.tar.gz
Algorithm	Hash digest
SHA256	`f5bb9810c0fca16bc4f2c125f5e644407a6de725024904c509ed6f8a7e2fdbf4`
MD5	`d7c4819200cbe740c15019ea8bb7acc8`
BLAKE2b-256	`660fe11d9c24f7291825cd37a727fb41214052bc428380f9c8fa19399cb4d726`

See more details on using hashes here.

File details

Details for the file gmft-0.4.3-py3-none-any.whl.

File metadata

Download URL: gmft-0.4.3-py3-none-any.whl
Upload date: Feb 22, 2026
Size: 75.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.32.3

File hashes

Hashes for gmft-0.4.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`483134df3d82857c939172631fce91aee80d73b121c5bef323a512878e0c8f4b`
MD5	`f7ff3b0783ad743d0b5966bffbc8047e`
BLAKE2b-256	`90c1ef65d7a6c585aabe3e2bad11ca0683361285ad7c02faf56673517e8a498c`

See more details on using hashes here.

gmft 0.4.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

gmft

About

Why should I use gmft?

Many Formats

Lightweight

High throughput

Few dependencies

Dependable

Why should I not use gmft?

Quickstart

Configuration

Development

What does gmft stand for?

Acknowledgements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes