Lightweight, performant, configurable, deep table extraction
Project description
gmft
give
me
the
formatted
tables!
There are many pdfs out there, and many of those pdfs have tables. But despite a plethora of table extraction options, there is still no consensus on a definitive extraction method.
About gmft
gmft is a high-throughput, configurable, and out-of-the-box toolkit for converting pdf tables to many formats, including cropped image, text + positions, plaintext, csv, and pandas dataframes.
gmft aims to work "out-of-the-box", offering strong performance with the default settings.
gmft relies on microsoft's Table Transformers, which qualitatively is the most reliable of all tested alternatives.
Why use gmft?
Reliable
gmft uses Microsoft's TATR, which is trained on a diverse dataset, PubTables-1M. Of all the alternatives tested, TATR was qualitatively the most reliable by far.
Many Formats
gmft supports the following export options:
- Pandas dataframe (!)
- By extension: csv, html, json, etc.
- List of text + positions
- Cropped image of table
Cropped images are useful for directly feeding into a vision recognizer, like:
- GPT4 vision
- Mathpix/Adobe/Google/Amazon/Azure/etc.
Cropped images are also excellent for verifying correctness of output.
High throughput
In most cases, OCR is not necessary; pdfs already contain text positional data. Using this existing data drastically speeds up inference. With that being said, gmft can still extract tables from images and scanned pdfs through the image output.
Gmft focuses on table extraction, so figures, titles, sections, etc. are not extracted.
PyPDFium2 is chosen for its high throughput and permissive license.
Few dependencies
Many pdf extractors require detectron2, poppler, paddleocr, tesseract etc., many of which require additional external installation. Detectron2 requires mac or linux/WSL. OCR models may require tesseract or paddleocr.
gmft aims for a convenient installation process. First, install transformers. Then, run
pip install gmft
gmft mostly relies on pypdfium2 and transformers. On the first run, gmft downloads Microsoft's TATR from huggingface, which requires ~270mB total and is saved to ~/.cache/huggingface/hub/models--microsoft--table-{transformer-detection, structure-recognition} and ~/.cache/huggingface/hub/models--timm--resnet18.a1_in1k.
No GPU necessary
Because of the relatively few dependencies and high throughput, gmft is very lightweight. This allows gmft to run on cpu.
Modular
As models are loaded from huggingface hub, you can homebrew a model and use it simply by specifying your own huggingface model.
By subclassing the BasePDFDocument and BasePage classes, you are also able to support other PDF extraction methods (like PyMuPDF, PyPDF, pdfplumber etc.) if you so desire.
Acknowledgements
A tremendous thank you to the TATR authors: Brandon Smock, Rohith Pesala, and Robin Abraham.
Thank you to Niels Rogge for porting TATR to huggingface and writing the visualization code.
gmft is released under MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.