PDF table extraction for RAG and LLM — convert PDF tables to clean HTML. Fast, local, no GPU. Handles merged cells, line-wrapped text, no serialization.

These details have not been verified by PyPI

Project links

Project description

ragtable-extract

English | 中文

PDF Parsing for RAG — extracting tables precisely. Convert to HTML. Fast, local, no GPU.

A lightweight Python library that extracts tables from PDFs and converts them to clean HTML, designed for RAG pipelines and LLM retrieval. Runs entirely on CPU with no external APIs or GPU dependencies.

Features

Precise extraction — Character-level coordinate extraction for accurate cell boundaries
≥/≤ symbols — Correctly repositioned by coordinates (avoids pdfplumber's line-end placement bug)
Merged cells — Proper rowspan / colspan output for multi-column tables
Line-wrapped text — Auto-segments and concatenates text across line breaks within cells (no symbol/text serialization)
Fangzheng font — Handles full-width character ordering and decimal point encoding issues
Adaptive config — Per-page tuning based on character metrics
Fast & local — Pure Python, pdfplumber-based, no GPU required

Requirements

Python 3.8+
pdfplumber >= 0.10.0

Installation

pip install ragtable-extract

Or from source:

git clone https://github.com/ZhuJiaxin2/ragtable-extract.git
cd ragtable-extract
pip install -e .

Quick Start

import ragtable_extract

# Convert PDF tables to HTML file
ragtable_extract.convert(
    input_path="document.pdf",
    output_path="tables.html",
)

# Or extract as structured data
tables = ragtable_extract.extract(input_path="document.pdf")
for t in tables:
    print(f"Page {t['page']}: {t['html'][:80]}...")

CLI

python -m ragtable_extract document.pdf output.html

Web Quick Test (app.py)

Run the Flask web app to upload PDFs and preview extraction results in the browser:

pip install flask
python app.py

Then open http://localhost:1965 to upload a PDF and view extracted tables.

Test Results

Run python test.py to generate extraction results. Output files:

Source PDF	Extraction Result
test/example/zhejiang.pdf	test/result/test_adaptive_zhejiang.html
test/example/changsha.pdf	test/result/test_adaptive_changsha.html
test/example/shaanxi.pdf	test/result/test_adaptive_shaanxi.html
test/example/tongbao.pdf	test/result/test_adaptive_tongbao.html

API

Function	Description
`convert(input_path, output_path, pages?, config?, use_adaptive_config=True)`	Convert PDF tables to HTML file
`extract(input_path, pages?, config?, use_adaptive_config=True)`	Extract tables as list of dicts with `page`, `html`, `bbox`, `raw`
`build_full_html(pdf_filename, tables)`	Build full HTML document from extracted tables
`Config`	Dataclass for tuning extraction (multiline thresholds, font tolerance, etc.)

Configuration

import ragtable_extract

# Custom config
config = ragtable_extract.Config(
    multiline_cell_top_range=25,
    multiline_y_tolerance=4,
)
tables = ragtable_extract.extract("doc.pdf", config=config)

# Adaptive config (default) — infers parameters from page character metrics
tables = ragtable_extract.extract("doc.pdf")  # use_adaptive_config=True by default

Project Structure

ragtable-extract/
├── ragtable_extract/     # Core library
│   ├── __init__.py       # convert(), extract()
│   ├── _core.py          # Table extraction logic
│   ├── _config.py        # Config & adaptive metrics
│   ├── _font.py          # Special font handling
│   └── _html.py          # HTML template
├── pyproject.toml
├── demo.py               # CLI demo
└── app.py                # Optional Flask web API

How It Works

PDF → pdfplumber.find_tables()
  → Filter chars by bbox, cluster by top (y)
  → Reorder ≥/≤ symbols, fix Fangzheng font
  → Output <table> HTML

Comparison Report

We compare ragtable-extract with opendataloader-pdf on real government PDF tables. Our extraction:

Multi-column tables — Correctly recognizes complex layouts with merged cells
Line-wrapped text — Automatically segments and concatenates text across line breaks within cells
No serialization — Symbols and text stay in correct cells (e.g. no １２ or 万人％ wrongly merged)

Run python test_comparison.py to generate the report, then open comparison_report.html for side-by-side results.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Feb 26, 2026

0.1.0

Feb 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragtable_extract-0.1.1.tar.gz (13.1 kB view details)

Uploaded Feb 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragtable_extract-0.1.1-py3-none-any.whl (12.7 kB view details)

Uploaded Feb 26, 2026 Python 3

File details

Details for the file ragtable_extract-0.1.1.tar.gz.

File metadata

Download URL: ragtable_extract-0.1.1.tar.gz
Upload date: Feb 26, 2026
Size: 13.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for ragtable_extract-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`b4244f4485c17b58a01b5e23cc1aea197c780fd94900c19c8a0a50ca2355740f`
MD5	`dfbdb95e10071d294ca05febe1582ff3`
BLAKE2b-256	`7dcb3ba8a2143ff68abc054e76e017ed48aac10bee46cbdfbc1b86190893641a`

See more details on using hashes here.

File details

Details for the file ragtable_extract-0.1.1-py3-none-any.whl.

File metadata

Download URL: ragtable_extract-0.1.1-py3-none-any.whl
Upload date: Feb 26, 2026
Size: 12.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for ragtable_extract-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a687d6eceaf8a0539fd5caab9dc9b1b230126097ded9082817e22a3dcc6a6a07`
MD5	`d0d17e5231613cc575b25993353e5aba`
BLAKE2b-256	`55b0356e7c43d343bda8c9f50dbfc15260caaf4e825f58674648362787698b73`

See more details on using hashes here.

ragtable-extract 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ragtable-extract

Features

Requirements

Installation

Quick Start

CLI

Web Quick Test (app.py)

Test Results

API

Configuration

Project Structure

How It Works

Comparison Report

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes