Skip to main content

Detect and recognize tables in PDFs and images.

Project description

Tabled

Tabled is a small library for detecting and extracting tables. It uses surya to find all the tables in a PDF, identifies the rows/columns, and formats cells into markdown, csv, or html.

Example

Table image 0

Characteristic Population Change from 2016 to 2060
2016 2020 2030 2040 2050 2060 Number Percent
Total population 323.1 332.6 355.1 373.5 388.9 404.5 81.4 25.2
Under 18 years 73.6 74.0 75.7 77.1 78.2 80.1 6.5 8.8
18 to 44 years 116.0 119.2 125.0 126.4 129.6 132.7 16.7 14.4
45 to 64 years 84.3 83.4 81.3 89.1 95.4 97.0 12.7 15.1
65 years and over 49.2 56.1 73.1 80.8 85.7 94.7 45.4 92.3
85 years and over 6.4 6.7 9.1 14.4 18.6 19.0 12.6 198.1
100 years and over 0.1 0.1 0.1 0.2 0.4 0.6 0.5 618.3

Community

Discord is where we discuss future development.

Hosted API

There is a hosted API for tabled available here:

  • Works with PDF, images, word docs, and powerpoints
  • Consistent speed, with no latency spikes
  • High reliability and uptime

Commercial usage

I want tabled to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.

The weights for the models are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the Datalab API. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options here.

Installation

You'll need python 3.10+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.

Install with:

pip install tabled-pdf

Post-install:

  • Inspect the settings in tabled/settings.py. You can override any settings with environment variables.
  • Your torch device will be automatically detected, but you can override this. For example, TORCH_DEVICE=cuda.
  • Model weights will automatically download the first time you run tabled.

Usage

tabled DATA_PATH
  • DATA_PATH can be an image, pdf, or folder of images/pdfs
  • --format specifies output format for each table (markdown, html, or csv)
  • --save_json saves additional row and column information in a json file
  • --save_debug_images saves images showing the detected rows and columns
  • --skip_detection means that the images you pass in are all cropped tables and don't need any table detection.
  • --detect_cell_boxes by default, tabled will attempt to pull cell information out of the pdf. If you instead want cells to be detected by a detection model, specify this (usually you only need this with pdfs that have bad embedded text).
  • --save_images specifies that images of detected rows/columns and cells should be saved.

The results.json file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:

  • text_lines - the detected text and bounding boxes for each line
    • text - the text in the line
    • confidence - the confidence of the model in the detected text (0-1)
    • polygon - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
    • bbox - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
  • languages - the languages specified for the page
  • page - the page number in the file
  • image_bbox - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.

Interactive App

I've included a streamlit app that lets you interactively try tabled on images or PDF files. Run it with:

pip install streamlit
tabled_gui

From python

from tabled.extract import extract_tables
from tabled.fileinput import load_pdfs_images
from tabled.inference.models import load_detection_models, load_recognition_models

det_models, rec_models = load_detection_models(), load_recognition_models()
images, highres_images, names, text_lines = load_pdfs_images(IN_PATH)

page_results = extract_tables(images, highres_images, text_lines, det_models, rec_models)

Benchmarks

Avg score Time per table Total tables
0.847 0.029 688

Quality

Getting good ground truth data for tables is hard, since you're either constrained to simple layouts that can be heuristically parsed and rendered, or you need to use LLMs, which make mistakes. I chose to use GPT-4 table predictions as a pseudo-ground-truth.

Tabled gets a .847 alignment score when compared to GPT-4, which indicates alignment between the text in table rows/cells. Some of the misalignments are due to GPT-4 mistakes, or small inconsistencies in what GPT-4 considered the borders of the table. In general, extraction quality is quite high.

Performance

Running on an A10G with 10GB of VRAM usage and batch size 64, tabled takes .029 seconds per table.

Running the benchmark

Run the benchmark with:

python benchmarks/benchmark.py out.json

Acknowledgements

  • Thank you to Peter Jansen for the benchmarking dataset, and for discussion about table parsing.
  • Huggingface for inference code and model hosting
  • PyTorch for training/inference

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabled_pdf-0.1.1.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

tabled_pdf-0.1.1-py3-none-any.whl (30.9 kB view details)

Uploaded Python 3

File details

Details for the file tabled_pdf-0.1.1.tar.gz.

File metadata

  • Download URL: tabled_pdf-0.1.1.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.10 Linux/6.5.0-1025-azure

File hashes

Hashes for tabled_pdf-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1db0518881473fe33f402c59555d3d460ceed86a92b5fe0ecb07841f82760c5c
MD5 5eb49436cee6bdc2521534171551dc6c
BLAKE2b-256 efbde406242187a1b9c4659644ea61ffc64bfb10e5e8d7ed70cc2025d1e2ab67

See more details on using hashes here.

File details

Details for the file tabled_pdf-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: tabled_pdf-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 30.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.10 Linux/6.5.0-1025-azure

File hashes

Hashes for tabled_pdf-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ec40cef5d5348127ccfbbf519ec83a8f54511986944cbf391521af316904789b
MD5 987b487bb7dacb7941a54bef5f059f12
BLAKE2b-256 cd6b9f770af144a6f8ff6d325e12395cd849bdc8bf8cb381850af9f1cec6466c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page