Skip to main content

Excel to structured JSON (tables, shapes, charts) for LLM/RAG pipelines

Project description

ExStruct — Excel Structured Extraction Engine

ExStruct reads Excel workbooks and outputs structured data (tables, shapes, charts) as JSON by default, with optional YAML/TOON formats. It targets both COM/Excel environments (rich extraction) and non-COM environments (cells + table candidates), with tunable detection heuristics and multiple output modes to fit LLM/RAG pipelines.

Features

  • Excel → Structured JSON: cells, shapes, charts, and table candidates per sheet.
  • Output modes: light (cells + table candidates only), standard (texted shapes + arrows, charts), verbose (all shapes with width/height).
  • Formats: JSON (compact by default, --pretty available), YAML, TOON (optional dependencies).
  • Table detection tuning: adjust heuristics at runtime via API.
  • CLI rendering (Excel required): optional PDF and per-sheet PNGs.
  • Graceful fallback: if Excel COM is unavailable, extraction falls back to cells + table candidates without crashing.

Installation

pip install exstruct

Optional extras:

  • YAML: pip install pyyaml
  • TOON: pip install python-toon
  • Rendering (PDF/PNG): Excel + pip install pypdfium2

Quick Start (CLI)

exstruct input.xlsx                # compact JSON (default)
exstruct input.xlsx --pretty       # pretty-printed JSON
exstruct input.xlsx --format yaml  # YAML (needs pyyaml)
exstruct input.xlsx --format toon  # TOON (needs python-toon)
exstruct input.xlsx --mode light   # cells + table candidates only
exstruct input.xlsx --pdf --image  # PDF and PNGs (Excel required)

Quick Start (Python)

from pathlib import Path
from exstruct import extract, export, set_table_detection_params

# Tune table detection (optional)
set_table_detection_params(table_score_threshold=0.3, density_min=0.04)

# Extract with modes: "light", "standard", "verbose"
wb = extract("input.xlsx", mode="standard")
export(wb, Path("out.json"), pretty=False)  # compact JSON

Table Detection Tuning

from exstruct import set_table_detection_params

set_table_detection_params(
    table_score_threshold=0.35,  # increase to be stricter
    density_min=0.05,
    coverage_min=0.2,
    min_nonempty_cells=3,
)

Use higher thresholds to reduce false positives; lower them if true tables are missed.

Output Modes

  • light: cells + table candidates (no COM needed).
  • standard: texted shapes + arrows, charts (COM if available), table candidates.
  • verbose: all shapes (with width/height), charts, table candidates.

Error Handling / Fallbacks

  • Excel COM unavailable → falls back to cells + table candidates; shapes/charts empty.
  • Shape extraction failure → logs warning, still returns cells + table candidates.
  • CLI prints errors to stdout/stderr and returns non-zero on failures.

Optional Rendering

Requires Excel and pypdfium2.

exstruct input.xlsx --pdf --image --dpi 144

Creates <output>.pdf and <output>_images/ PNGs per sheet.

Notes

  • Default JSON is compact to reduce tokens; use --pretty or pretty=True when readability matters.
  • Field table_candidates replaces tables; adjust downstream consumers accordingly.

License

BSD-3-Clause. See LICENSE for details.

Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exstruct-0.1.11.tar.gz (23.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

exstruct-0.1.11-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file exstruct-0.1.11.tar.gz.

File metadata

  • Download URL: exstruct-0.1.11.tar.gz
  • Upload date:
  • Size: 23.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.4

File hashes

Hashes for exstruct-0.1.11.tar.gz
Algorithm Hash digest
SHA256 1a1cfcbd7abd37001c72688e56ae468c9fe9b2a469899233effb75b50f5022be
MD5 2763112b70fc50b6e1379389cc1e15b1
BLAKE2b-256 16d0d71eda07655d916bed68e01b3ed3ee50510b6fbe34afb7a8685f5e6b9f2f

See more details on using hashes here.

File details

Details for the file exstruct-0.1.11-py3-none-any.whl.

File metadata

  • Download URL: exstruct-0.1.11-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.4

File hashes

Hashes for exstruct-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 ec3fe7dc065259395b87c386908c91e7789ac6127346b92e0678c7c0bb571a2b
MD5 bd29b4a120eab1a1dfbdc0ba74ea7f87
BLAKE2b-256 00072d4856ccd93b1f33ebe5a9147b7e7f5ac31b842f0e9c9aa8137dfb9c75eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page