Skip to main content

PDF Parsing for RAG — extracting tables precisely. Convert to HTML. Fast, local, no GPU.

Project description

ragtable-extract

English | 中文

Python 3.8+ License: MIT

PDF Parsing for RAG — extracting tables precisely. Convert to HTML. Fast, local, no GPU.

A lightweight Python library that extracts tables from PDFs and converts them to clean HTML, designed for RAG pipelines and LLM retrieval. Runs entirely on CPU with no external APIs or GPU dependencies.

Features

  • Precise extraction — Character-level coordinate extraction for accurate cell boundaries
  • ≥/≤ symbols — Correctly repositioned by coordinates (avoids pdfplumber's line-end placement bug)
  • Merged cells — Proper rowspan / colspan output for multi-column tables
  • Line-wrapped text — Auto-segments and concatenates text across line breaks within cells (no symbol/text serialization)
  • Fangzheng font — Handles full-width character ordering and decimal point encoding issues
  • Adaptive config — Per-page tuning based on character metrics
  • Fast & local — Pure Python, pdfplumber-based, no GPU required

Requirements

  • Python 3.8+
  • pdfplumber >= 0.10.0

Installation

pip install ragtable-extract

Or from source:

git clone https://github.com/ZhuJiaxin2/ragtable-extract.git
cd ragtable-extract
pip install -e .

Quick Start

import ragtable_extract

# Convert PDF tables to HTML file
ragtable_extract.convert(
    input_path="document.pdf",
    output_path="tables.html",
)

# Or extract as structured data
tables = ragtable_extract.extract(input_path="document.pdf")
for t in tables:
    print(f"Page {t['page']}: {t['html'][:80]}...")

CLI

python -m ragtable_extract document.pdf output.html

Web Quick Test (app.py)

Run the Flask web app to upload PDFs and preview extraction results in the browser:

pip install flask
python app.py

Then open http://localhost:1965 to upload a PDF and view extracted tables.

Test Results

Run python test.py to generate extraction results. Output files:

Source PDF Extraction Result
test/example/zhejiang.pdf test/result/test_adaptive_zhejiang.html
test/example/changsha.pdf test/result/test_adaptive_changsha.html
test/example/shaanxi.pdf test/result/test_adaptive_shaanxi.html
test/example/tongbao.pdf test/result/test_adaptive_tongbao.html

API

Function Description
convert(input_path, output_path, pages?, config?, use_adaptive_config=True) Convert PDF tables to HTML file
extract(input_path, pages?, config?, use_adaptive_config=True) Extract tables as list of dicts with page, html, bbox, raw
build_full_html(pdf_filename, tables) Build full HTML document from extracted tables
Config Dataclass for tuning extraction (multiline thresholds, font tolerance, etc.)

Configuration

import ragtable_extract

# Custom config
config = ragtable_extract.Config(
    multiline_cell_top_range=25,
    multiline_y_tolerance=4,
)
tables = ragtable_extract.extract("doc.pdf", config=config)

# Adaptive config (default) — infers parameters from page character metrics
tables = ragtable_extract.extract("doc.pdf")  # use_adaptive_config=True by default

Project Structure

ragtable-extract/
├── ragtable_extract/     # Core library
│   ├── __init__.py       # convert(), extract()
│   ├── _core.py          # Table extraction logic
│   ├── _config.py        # Config & adaptive metrics
│   ├── _font.py          # Special font handling
│   └── _html.py          # HTML template
├── pyproject.toml
├── demo.py               # CLI demo
└── app.py                # Optional Flask web API

How It Works

PDF → pdfplumber.find_tables()
  → Filter chars by bbox, cluster by top (y)
  → Reorder ≥/≤ symbols, fix Fangzheng font
  → Output <table> HTML

Comparison Report

We compare ragtable-extract with opendataloader-pdf on real government PDF tables. Our extraction:

  • Multi-column tables — Correctly recognizes complex layouts with merged cells
  • Line-wrapped text — Automatically segments and concatenates text across line breaks within cells
  • No serialization — Symbols and text stay in correct cells (e.g. no 1 2 or 万人 % wrongly merged)

Run python test_comparison.py to generate the report, then open comparison_report.html for side-by-side results.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragtable_extract-0.1.0.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragtable_extract-0.1.0-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file ragtable_extract-0.1.0.tar.gz.

File metadata

  • Download URL: ragtable_extract-0.1.0.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for ragtable_extract-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b65a74a7d0d5ffcc45203d5eeda1bd0168935545ac729cb167c84f41fb025d1d
MD5 aef9733692e38fa3725ce59bfff0b4b2
BLAKE2b-256 84b45b85a00562d0322545f36708af47bf3ecf12a9ddc782e37d5ec703c45398

See more details on using hashes here.

File details

Details for the file ragtable_extract-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ragtable_extract-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 42248be9787ad22ee0adf6716f8cc8b72f8bf25a14ced1e1312b352e3222121f
MD5 aab9a130b26b2a85c0084cd739299a1e
BLAKE2b-256 8fd7dbe2293aa85ffad8609de256c99e3d6ec9402e15ffdd870038aa5d52c490

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page