Skip to main content

Trích xuất bảng từ ảnh/PDF thành Markdown cho RAG pipeline

Project description

RagTable

Extract tables from images/PDFs to Markdown — fast, accurate, RAG-ready

RagTable converts table images into structured Markdown using a segmentation model and per-cell OCR. Built for borderless and complex tables (academic papers, reports, scanned documents), with output that plugs directly into RAG pipelines or LLM contexts.


Features

  • Table structure segmentation via EfficientUNet (~19M params) — detects row, col, col_header, row_header, and span masks
  • Span detection — merged cells are identified and rendered correctly in Markdown
  • Per-cell OCR via PaddleOCR — each cell is cropped and read individually to minimize noise
  • Smart header handling — header rows get a dedicated OCR pass with spatial mapping for better accuracy
  • Automatic post-processing — removes phantom edge columns, empty rows, and footer artifacts
  • Fully offline — no external API calls, suitable for sensitive or air-gapped environments
  • Multi-format input: PNG, JPG, TIFF

Installation

pip install ragtab

Requirements:

  • Python ≥ 3.10
  • PyTorch (install separately if not already present)
  • PaddleOCR (installed automatically)

PaddleOCR requires paddlepaddle — install separately based on your platform:

pip install paddlepaddle

Install from source

git clone https://github.com/tai03102004/rag-table
cd ragtab
pip install -e .

Quickstart

from ragtab.pipeline import extract_table

markdown, cells = extract_table(
    "table.png",
    model_path="checkpoints/unet_best.pt",
    ocr_engine="paddleocr"
)

print(markdown)

Output:

| Item        | Price | Qty |
| ----------- | ----- | --- |
| iPhone 15   | 999   | 12  |
| Samsung S24 | 899   | 8   |

How It Works

Input image (resized to 384×384)
       │
       ▼
[1] EfficientUNet → 5 segmentation masks
       │
       ▼
[2] Projection analysis → row/column separator positions
       │
       ▼
[3] Span detection → connected components on span mask
       │
       ▼
[4] Grid construction → per-cell bounding boxes
       │
       ▼
[5] OCR — header rows: spatial mapping pass
        — body cells: per-cell crop + PaddleOCR
       │
       ▼
[6] Post-processing → drop phantom columns, empty rows, footers
       │
       ▼
[7] Markdown export

Each stage is independently accessible so you can customize or swap components.


Model & Checkpoints

  • Train from scratch using notebooks/02_table-recognition.ipynb
  • Download pretrained checkpoint:

Place the checkpoint at checkpoints/unet_best.pt and pass it via model_path.


Project Structure

RagTable/
├── python/
│   └── ragtab/
│       ├── __init__.py
│       ├── detection.py     # Mask → grid cells
│       ├── model.py         # EfficientUNet definition
│       ├── ocr.py           # PaddleOCR wrapper + text cleaning
│       ├── pipeline.py      # End-to-end extract_table()
│       └── utils.py
├── checkpoints/
├── notebooks/
│   └── 02_table-recognition.ipynb
└── README.md

License

MIT — free to use, including for commercial purposes.


Author

Dinh Duc Taidinhductai2004@gmail.com

If you find this useful, consider giving it a ⭐️ on GitHub!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragtab-0.1.3.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragtab-0.1.3-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file ragtab-0.1.3.tar.gz.

File metadata

  • Download URL: ragtab-0.1.3.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for ragtab-0.1.3.tar.gz
Algorithm Hash digest
SHA256 b0301c38be34aed3d3c24810f97468c08c1d2c3ab7277efce09f116b6bfec68b
MD5 766f53fbf68be7ab7d49faa28a72e7c1
BLAKE2b-256 70e44ea7fcf1fa85e39782cf2166738f5a4b7b407168604d6b3342ec6ec4e6e4

See more details on using hashes here.

File details

Details for the file ragtab-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: ragtab-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 14.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for ragtab-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0460112224a6d007322e2f538afa3c642a0344952f244e9ee035dded62af4e12
MD5 8a701264649448ce92c2ef66cac56398
BLAKE2b-256 2430e486ca41bd72905ffe651d3871bbe6954b59e0cb8f5ab6a0e05eec3bcdf7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page