Excel to structured JSON (tables, shapes, charts) for LLM/RAG pipelines
Project description
ExStruct — Excel Structured Extraction Engine
ExStruct reads Excel workbooks and outputs structured data (cells, table candidates, shapes, charts, print areas/views, auto page-break areas, hyperlinks) as JSON by default, with optional YAML/TOON formats. It targets both COM/Excel environments (rich extraction) and non-COM environments (cells + table candidates + print areas), with tunable detection heuristics and multiple output modes to fit LLM/RAG pipelines.
Features
- Excel → Structured JSON: cells, shapes, charts, table candidates, print areas/views, and auto page-break areas per sheet.
- Output modes:
light(cells + table candidates + print areas; no COM, shapes/charts empty),standard(texted shapes + arrows, charts, print areas),verbose(all shapes with width/height, charts with size, print areas). Verbose also emits cell hyperlinks. Size output is flag-controlled. - Auto page-break export (COM only): capture Excel-computed auto page breaks and write per-area JSON/YAML/TOON when requested (CLI option appears only when COM is available).
- Formats: JSON (compact by default,
--prettyavailable), YAML, TOON (optional dependencies). - Table detection tuning: adjust heuristics at runtime via API.
- CLI rendering (Excel required): optional PDF and per-sheet PNGs.
- Graceful fallback: if Excel COM is unavailable, extraction falls back to cells + table candidates without crashing.
Installation
pip install exstruct
Optional extras:
- YAML:
pip install pyyaml - TOON:
pip install python-toon - Rendering (PDF/PNG): Excel +
pip install pypdfium2 pillow - All extras at once:
pip install exstruct[yaml,toon,render]
Platform note:
- Full extraction (shapes/charts) targets Windows + Excel (COM via xlwings). On other platforms, use
mode=lightto get cells +table_candidates.
Quick Start (CLI)
exstruct input.xlsx > output.json # compact JSON to stdout (default)
exstruct input.xlsx -o out.json --pretty # pretty JSON to a file
exstruct input.xlsx --format yaml # YAML (needs pyyaml)
exstruct input.xlsx --format toon # TOON (needs python-toon)
exstruct input.xlsx --sheets-dir sheets/ # split per sheet in chosen format
exstruct input.xlsx --print-areas-dir areas/ # split per print area (if any)
exstruct input.xlsx --auto-page-breaks-dir auto_areas/ # COM only; option appears when available
exstruct input.xlsx --mode light # cells + table candidates only
exstruct input.xlsx --pdf --image # PDF and PNGs (Excel required)
Auto page-break exports are available via API and CLI when Excel/COM is available; the CLI exposes --auto-page-breaks-dir only in COM-capable environments.
Quick Start (Python)
from pathlib import Path
from exstruct import extract, export, set_table_detection_params
# Tune table detection (optional)
set_table_detection_params(table_score_threshold=0.3, density_min=0.04)
# Extract with modes: "light", "standard", "verbose"
wb = extract("input.xlsx", mode="standard")
export(wb, Path("out.json"), pretty=False) # compact JSON
# Model helpers: iterate, index, and serialize directly
first_sheet = wb["Sheet1"] # __getitem__ access
for name, sheet in wb: # __iter__ yields (name, SheetData)
print(name, len(sheet.rows))
wb.save("out.json", pretty=True) # WorkbookData → file (by extension)
first_sheet.save("sheet.json") # SheetData → file (by extension)
print(first_sheet.to_yaml()) # YAML text (requires pyyaml)
# ExStructEngine: per-instance options (nested configs)
from exstruct import (
DestinationOptions,
ExStructEngine,
FilterOptions,
FormatOptions,
OutputOptions,
StructOptions,
export_auto_page_breaks,
)
engine = ExStructEngine(
options=StructOptions(mode="verbose"), # verbose includes hyperlinks by default
output=OutputOptions(
format=FormatOptions(pretty=True),
filters=FilterOptions(include_shapes=False), # drop shapes in output
destinations=DestinationOptions(sheets_dir=Path("out_sheets")), # also write per-sheet files
),
)
wb2 = engine.extract("input.xlsx")
engine.export(wb2, Path("out_filtered.json")) # drops shapes via filters
# Enable hyperlinks in other modes
engine_links = ExStructEngine(options=StructOptions(mode="standard", include_cell_links=True))
with_links = engine_links.extract("input.xlsx")
# Export per print area (if print areas exist)
from exstruct import export_print_areas_as
export_print_areas_as(wb, "areas", fmt="json", pretty=True)
# Auto page-break extraction/output (COM only; raises if no auto breaks exist)
engine_auto = ExStructEngine(
output=OutputOptions(
destinations=DestinationOptions(auto_page_breaks_dir=Path("auto_areas"))
)
)
wb_auto = engine_auto.extract("input.xlsx") # includes SheetData.auto_print_areas
engine_auto.export(wb_auto, Path("out_with_auto.json")) # also writes auto_areas/*
export_auto_page_breaks(wb_auto, "auto_areas", fmt="json", pretty=True) # manual writer
Note (non-COM environments): If Excel COM is unavailable, extraction still runs and returns cells + table_candidates; shapes/charts will be empty.
Table Detection Tuning
from exstruct import set_table_detection_params
set_table_detection_params(
table_score_threshold=0.35, # increase to be stricter
density_min=0.05,
coverage_min=0.2,
min_nonempty_cells=3,
)
Use higher thresholds to reduce false positives; lower them if true tables are missed.
Output Modes
- light: cells + table candidates (no COM needed).
- standard: texted shapes + arrows, charts (COM if available), table candidates. Hyperlinks are off unless
include_cell_links=True. - verbose: all shapes (with width/height), charts, table candidates, and cell hyperlinks.
Error Handling / Fallbacks
- Excel COM unavailable → falls back to cells + table candidates; shapes/charts empty.
- Shape extraction failure → logs warning, still returns cells + table candidates.
- CLI prints errors to stdout/stderr and returns non-zero on failures.
Optional Rendering
Requires Excel and pypdfium2.
exstruct input.xlsx --pdf --image --dpi 144
Creates <output>.pdf and <output>_images/ PNGs per sheet.
Benchmark: Excel Structuring Demo
To show how well exstruct can structure Excel, we parse a workbook that combines three elements on one sheet and share an AI reasoning benchmark that uses the JSON output.
- Table (sales data)
- Line chart
- Flowchart built only with shapes
(Screenshot below is the actual sample Excel sheet)
Sample workbook:
sample/sample.xlsx
Sample workbook: sample/sample.xlsx
1. Input: Excel Sheet Overview
This sample Excel contains:
① Table (Sales Data)
| Month | Product A | Product B | Product C |
|---|---|---|---|
| Jan-25 | 120 | 80 | 60 |
| Feb-25 | 135 | 90 | 64 |
| Mar-25 | 150 | 100 | 70 |
| Apr-25 | 170 | 110 | 72 |
| May-25 | 160 | 120 | 75 |
| Jun-25 | 180 | 130 | 80 |
② Chart (Line Chart)
- Title: Sales Data
- Series: Product A / Product B / Product C (six months)
- Y axis: 0–200
③ Flowchart built with shapes
The sheet includes this flow:
- Start / End
- Format check
- Loop (items remaining?)
- Error handling
- Yes/No decision for sending email
2. Output: Structured JSON produced by exstruct (excerpt)
Below is a shortened JSON output example from parsing this Excel workbook.
{
"book_name": "sample.xlsx",
"sheets": {
"Sheet1": {
"rows": [
{
"r": 3,
"c": {
"1": "月",
"2": "製品A",
"3": "製品B",
"4": "製品C"
}
},
...
],
"shapes": [
{
"text": "開始",
"l": 148,
"t": 220,
"type": "AutoShape-FlowchartProcess"
},
{
"text": "入力データ読み込み",
"l": 132,
"t": 282,
"type": "AutoShape-FlowchartProcess"
},
{
"l": 193,
"t": 246,
"type": "AutoShape-Mixed",
"begin_arrow_style": 1,
"end_arrow_style": 2,
"direction": "N"
},
...
],
"charts": [
{
"name": "Chart 1",
"chart_type": "Line",
"title": "売上データ",
"y_axis_range": [
0.0,
200.0
],
"series": [
{
"name": "製品A",
"name_range": "Sheet1!$C$3",
"x_range": "Sheet1!$B$4:$B$9",
"y_range": "Sheet1!$C$4:$C$9"
},
...
],
"l": 377,
"t": 25
}
],
"table_candidates": [
"B3:E9"
]
}
}
}
3. How AI (Copilot / LLM) interprets the JSON
Below is the Markdown reconstruction of the Excel workbook. The table, chart, and flowchart are all represented.
---
## 📊 Sales Data Table
| Month | Product A | Product B | Product C |
| ---------- | --------- | --------- | --------- |
| 2025-01-01 | 120 | 80 | 60 |
| 2025-02-01 | 135 | 90 | 64 |
| 2025-03-01 | 150 | 100 | 70 |
| 2025-04-01 | 170 | 110 | 72 |
| 2025-05-01 | 160 | 120 | 75 |
| 2025-06-01 | 180 | 130 | 80 |
---
## 📈 Sales Data (Line Chart)
- Chart title: **売上データ (Sales Data)**
- Chart type: Line
- Y-axis range: 0 to 200
- Data series:
- Product A: 120 → 135 → 150 → 170 → 160 → 180
- Product B: 80 → 90 → 100 → 110 → 120 → 130
- Product C: 60 → 64 → 70 → 72 → 75 → 80
---
## 🔄 Process Flow (Mermaid Flowchart)
```mermaid
flowchart TD
A[Start]
B[Load input data]
C{Is format valid?}
D[Show error]
E[Process one item]
F{Items remaining?}
G[Generate output]
H{Send email?}
I[Send email]
J[Finish]
A --> B
B --> C
C -- no --> D
C -- yes --> E
E --> F
F -- yes --> E
F -- no --> G
G --> H
H -- yes --> I
H -- no --> J
I --> J
```
From this we can see:
exstruct's JSON is already in a format that AI can read and reason over directly.
4. Summary
This benchmark confirms exstruct can:
- Parse tables, charts, and shapes (flowcharts) simultaneously
- Convert the semantic structure of Excel into JSON
- Let AI/LLMs read that JSON directly and reconstruct the workbook contents
In short, exstruct = “an engine that converts Excel into a format AI can understand.”
Notes
- Default JSON is compact to reduce tokens; use
--prettyorpretty=Truewhen readability matters. - Field
table_candidatesreplacestables; adjust downstream consumers accordingly.
Enterprise Use
ExStruct is used primarily as a library, not a service.
- No official support or SLA is provided
- Long-term stability is prioritized over rapid feature growth
- Forking and internal modification are expected in enterprise use
This project is suitable for teams that:
- need transparency over black-box tools
- are comfortable maintaining internal forks if necessary
Print Areas and Auto Page Breaks (PrintArea / PrintAreaView)
SheetData.print_areasholds print areas (cell coordinates) in light/standard/verbose.SheetData.auto_print_areasholds Excel COM-computed auto page-break areas when auto page-break extraction is enabled (COM only).- Use
export_print_areas_as(...)or CLI--print-areas-dirto write one file per print area (nothing is written if none exist). - Use CLI
--auto-page-breaks-dir(COM only),DestinationOptions.auto_page_breaks_dir(preferred), orexport_auto_page_breaks(...)to write per-auto-page-break files; the API raisesValueErrorif no auto page breaks exist. PrintAreaViewincludes rows and table candidates inside the area, plus shapes/charts that overlap the area (size-less shapes are treated as points).normalize=Truerebases row/col indices to the area origin.
License
BSD-3-Clause. See LICENSE for details.
Documentation
- API Reference (GitHub Pages): https://harumiweb.github.io/exstruct/
- JSON Schemas: see
schemas/(one file per model); regenerate viapython scripts/gen_json_schema.py.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file exstruct-0.2.80.tar.gz.
File metadata
- Download URL: exstruct-0.2.80.tar.gz
- Upload date:
- Size: 43.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0635bb652e98105d107ebeabcd91a5dfe75c04435a1b2466272feb5eba185473
|
|
| MD5 |
12c6586916340f4ff47d51ee0e288bb0
|
|
| BLAKE2b-256 |
ebc7282da6de678cc6f6eac79880b368dc58bbdcb8712c4c0c6ae2f914f60a88
|
File details
Details for the file exstruct-0.2.80-py3-none-any.whl.
File metadata
- Download URL: exstruct-0.2.80-py3-none-any.whl
- Upload date:
- Size: 49.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
577115d1d3a7dab0ee61691d5e62147aa49cb64319aaf1b1da15b6442f0d43ce
|
|
| MD5 |
a7be5f3c89bf709b25ce7df3750c4c5b
|
|
| BLAKE2b-256 |
27891672734098c110822e7e949dfd8818aaf6cf50efcaf078dbf329fd702d4c
|