Trích xuất bảng từ ảnh/PDF thành Markdown cho RAG pipeline
Project description
RagTable
Extract tables from images/PDFs to Markdown — fast, accurate, RAG-ready
RagTable converts table images into structured Markdown using a segmentation model and per-cell OCR. Built for borderless and complex tables (academic papers, reports, scanned documents), with output that plugs directly into RAG pipelines or LLM contexts.
Features
- Table structure segmentation via EfficientUNet (~19M params) — detects
row,col,col_header,row_header, andspanmasks - Span detection — merged cells are identified and rendered correctly in Markdown
- Per-cell OCR via PaddleOCR — each cell is cropped and read individually to minimize noise
- Smart header handling — header rows get a dedicated OCR pass with spatial mapping for better accuracy
- Automatic post-processing — removes phantom edge columns, empty rows, and footer artifacts
- Fully offline — no external API calls, suitable for sensitive or air-gapped environments
- Multi-format input: PNG, JPG, TIFF
Installation
pip install ragtab
Requirements:
- Python ≥ 3.10
- PyTorch (install separately if not already present)
- PaddleOCR (installed automatically)
PaddleOCR requires paddlepaddle — install separately based on your platform:
pip install paddlepaddle
Install from source
git clone https://github.com/tai03102004/rag-table
cd ragtab
pip install -e .
Quickstart
from ragtab.pipeline import extract_table
markdown, cells = extract_table(
"table.png",
model_path="checkpoints/unet_best.pt",
ocr_engine="paddleocr"
)
print(markdown)
Output:
| Item | Price | Qty |
| ----------- | ----- | --- |
| iPhone 15 | 999 | 12 |
| Samsung S24 | 899 | 8 |
How It Works
Input image (resized to 384×384)
│
▼
[1] EfficientUNet → 5 segmentation masks
│
▼
[2] Projection analysis → row/column separator positions
│
▼
[3] Span detection → connected components on span mask
│
▼
[4] Grid construction → per-cell bounding boxes
│
▼
[5] OCR — header rows: spatial mapping pass
— body cells: per-cell crop + PaddleOCR
│
▼
[6] Post-processing → drop phantom columns, empty rows, footers
│
▼
[7] Markdown export
Each stage is independently accessible so you can customize or swap components.
Model & Checkpoints
- Train from scratch using
notebooks/02_table-recognition.ipynb - Download pretrained checkpoint:
Place the checkpoint at checkpoints/unet_best.pt and pass it via model_path.
Project Structure
RagTable/
├── python/
│ └── ragtab/
│ ├── __init__.py
│ ├── detection.py # Mask → grid cells
│ ├── model.py # EfficientUNet definition
│ ├── ocr.py # PaddleOCR wrapper + text cleaning
│ ├── pipeline.py # End-to-end extract_table()
│ └── utils.py
├── checkpoints/
├── notebooks/
│ └── 02_table-recognition.ipynb
└── README.md
License
MIT — free to use, including for commercial purposes.
Author
Dinh Duc Tai — dinhductai2004@gmail.com
If you find this useful, consider giving it a ⭐️ on GitHub!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragtab-0.1.3.tar.gz.
File metadata
- Download URL: ragtab-0.1.3.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0301c38be34aed3d3c24810f97468c08c1d2c3ab7277efce09f116b6bfec68b
|
|
| MD5 |
766f53fbf68be7ab7d49faa28a72e7c1
|
|
| BLAKE2b-256 |
70e44ea7fcf1fa85e39782cf2166738f5a4b7b407168604d6b3342ec6ec4e6e4
|
File details
Details for the file ragtab-0.1.3-py3-none-any.whl.
File metadata
- Download URL: ragtab-0.1.3-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0460112224a6d007322e2f538afa3c642a0344952f244e9ee035dded62af4e12
|
|
| MD5 |
8a701264649448ce92c2ef66cac56398
|
|
| BLAKE2b-256 |
2430e486ca41bd72905ffe651d3871bbe6954b59e0cb8f5ab6a0e05eec3bcdf7
|