Skip to main content

Comprehensive Excel extraction SDK — structured chunks for AI/LLM/RAG pipelines

Project description

omnidoc-excel-extractor

Comprehensive Excel extraction SDK — 57 structured chunk types for AI, LLM, and RAG pipelines.

Supports .xlsx · .xlsm · .xlsb · .xls


Table of Contents


Overview

omnidoc-excel-extractor converts any Excel file into a structured list of chunks — typed, JSON-serialisable Pydantic objects that capture every layer of meaning in the workbook: raw data, schema, formulas, KPIs, charts, pivot tables, VBA macros, cross-sheet relationships, Power Query, and more.

Each chunk carries:

Field Description
chunk_id UUID — globally unique identifier
chunk_type String name of the chunk class
layer structural / semantic / analytical / validation / visual / vba / cross_ref / connectivity / operational
priority must / should / nice — extraction importance tier
(type-specific fields) All fields defined per chunk type

Why chunks? LLM context windows are finite. Chunking lets you embed, retrieve, and inject only the Excel knowledge relevant to each query — instead of dumping raw cell values into a prompt.


Architecture

omnidoc_excel_extractor/
│
├── __init__.py                  ← public API (ExcelExtractor, ExtractionResult)
│
├── chunks/
│   └── models.py                ← 57 Pydantic v2 chunk models + CHUNK_REGISTRY
│
├── core/
│   ├── extractor.py             ← orchestrator — runs every builder in order
│   ├── workbook_parser.py       ← workbook metadata, named ranges, external links
│   ├── sheet_parser.py          ← rows, merged cells, groups, validations, comments
│   ├── formula_extractor.py     ← formula detection, classification, lookup/agg detection
│   └── vba_extractor.py         ← VBA module extraction via oletools
│
├── builder/
│   ├── structural.py            ← chunks 1-9
│   ├── semantic.py              ← chunks 10-15
│   ├── analytical.py            ← chunks 16-23
│   ├── validation.py            ← chunks 24-28
│   ├── visual.py                ← chunks 29-41
│   ├── vba.py                   ← chunks 42-46
│   ├── crossref.py              ← chunks 47-50
│   ├── connectivity.py          ← chunks 51-54
│   └── operational.py           ← chunks 55-57
│
├── utils/
│   ├── cell_utils.py            ← cell address parsing, range iteration
│   ├── type_inference.py        ← dtype detection, semantic role, normalisation
│   └── stats.py                 ← aggregations, outlier detection, trend, frequency
│
└── _adapters/
    ├── xlrd_adapter.py          ← openpyxl-compatible wrapper for .xls (xlrd)
    └── xlsb_adapter.py          ← openpyxl-compatible wrapper for .xlsb (pyxlsb)

Extraction pipeline:

Excel file
    │
    ▼
_load_workbook()  ──► openpyxl (xlsx/xlsm)
                  ──► xlrd adapter  (xls)
                  ──► pyxlsb adapter (xlsb)
    │
    ▼
builder/* ──► 57 chunk builders run in priority order
    │
    ▼
ExtractionResult (chunks: list[BaseChunk], index: ChunkIndexChunk)

Installation

pip install omnidoc-excel-extractor

Optional — VBA/macro extraction (.xlsm, .xlsb, .xls):

pip install oletools

Full install with all optional extras:

pip install omnidoc-excel-extractor oletools networkx

Python version: 3.9+


Quick Start

from omnidoc_excel_extractor import ExcelExtractor

extractor = ExcelExtractor("report.xlsx")
result = extractor.extract()

# High-level summary
print(result.summary())
# {
#   'total_chunks': 412,
#   'type_counts': {
#     'WorkbookChunk': 1, 'SheetChunk': 4, 'TableChunk': 2,
#     'SchemaChunk': 4, 'RowChunk': 1800, 'FormulaDefinitionChunk': 37,
#     'KPIChunk': 5, 'ChartChunk': 3, ...
#   },
#   'workbook': 'report',
#   'created_at': '2026-05-16T10:00:00'
# }

# All chunks as JSON-serialisable dicts
dicts = result.to_dicts()

# Filter by type
schemas  = result.by_type("SchemaChunk")
formulas = result.by_type("FormulaDefinitionChunk")

# Filter by sheet
sales_chunks = result.by_sheet("Sales")

# Access the master manifest
index = result.index   # ChunkIndexChunk
print(index.type_counts)
print(index.all_chunk_ids)

Extraction Options

result = extractor.extract(
    # Selectively include only specific chunk types (list of class name strings)
    chunk_types=["WorkbookChunk", "SchemaChunk", "FormulaDefinitionChunk"],

    # Include RowChunks — can produce thousands of chunks for large sheets
    include_rows=True,

    # Include analytical chunks: AggregationChunk, TrendChunk, OutlierChunk,
    # KPIChunk, TemporalChunk, FormulaDefinitionChunk, FormulaResultChunk
    include_analytics=True,

    # Include visual chunks: ChartChunk, PivotTableChunk, SparklineChunk,
    # SlicerChunk, TimelineChunk, ShapeChunk, ImageChunk, FormControlChunk
    include_visual=True,

    # Include VBA chunks (only activates for .xlsm / .xlsb / .xls files
    # and requires oletools to be installed)
    include_vba=True,

    # Number of data rows per RowChunk batch
    row_batch_size=500,

    # Filter by extraction priority tier:
    #   "must"   — structural core + key analytics + validation
    #   "should" — adds trends, outliers, pivots, slicers, VBA events
    #   "nice"   — adds frozen panes, shapes, images, ribbon XML
    #   None     — include everything (default)
    priority=None,
)

Priority tiers at a glance

Priority Count Includes
must 27 types WorkbookChunk, SheetChunk, SchemaChunk, RowChunk, SummaryChunk, KPIChunk, FormulaDefinitionChunk, AggregationChunk, ValidationChunk, ErrorChunk, ChartChunk, PivotTableChunk, MacroChunk, VBAModuleChunk, NamedRangeChunk, RelationshipChunk, PowerQueryChunk, ChunkIndexChunk, and more
should 22 types TrendChunk, OutlierChunk, EntityChunk, CellAnnotationChunk, TemporalChunk, LookupMapChunk, DataQualityChunk, ProtectionChunk, PivotFieldChunk, SlicerChunk, TimelineChunk, SparklineChunk, FormControlChunk, ActiveXControlChunk, VBAEventChunk, CustomFunctionChunk, ExternalLinkChunk, DependencyGraphChunk, DataModelChunk, PowerPivotMeasureChunk, ConnectionChunk, PivotCacheChunk
nice 8 types FrozenPaneChunk, ShapeChunk, ImageChunk, RibbonCustomizationChunk, PrintSettingsChunk, ChangeLogChunk, ChartAnnotationChunk, GroupChunk

Convenience Methods

# Workbook overview — just metadata and sheet info
result = extractor.extract_sheets()
# Produces: WorkbookChunk, SheetChunk, SummaryChunk

# Schema discovery — column types and header normalisation
result = extractor.extract_schema()
# Produces: SchemaChunk, TableChunk, ColumnSemanticChunk, HeaderAliasChunk

# Formula audit — every formula, its type, and cross-sheet dependencies
result = extractor.extract_formulas()
# Produces: FormulaDefinitionChunk, FormulaResultChunk, LookupMapChunk

# VBA/macro inventory
result = extractor.extract_vba_only()
# Produces: MacroChunk, VBAModuleChunk, VBAEventChunk, CustomFunctionChunk

ExtractionResult API

result = extractor.extract()

result.chunks                         # list[BaseChunk] — all extracted chunks
result.index                          # ChunkIndexChunk — master manifest
result.summary()                      # dict with total_chunks, type_counts, workbook, created_at

result.to_dicts()                     # list[dict] — all chunks as plain dicts (JSON-safe)
result.by_type("SchemaChunk")         # list[BaseChunk] — filter by chunk_type string
result.by_sheet("Revenue")            # list[BaseChunk] — filter by sheet_name attribute

# Each chunk also has .to_dict()
chunk = result.chunks[0]
chunk.chunk_id                        # UUID string
chunk.chunk_type                      # "WorkbookChunk"
chunk.layer                           # "structural"
chunk.priority                        # "must"
chunk.to_dict()                       # dict

All 57 Chunk Types


Structural — priority: must

Capture the physical layout and raw data of the workbook.

1. WorkbookChunk

One per file. Top-level workbook metadata.

Field Type Description
name str File stem (without extension)
path str Absolute resolved file path
sheet_names list[str] All sheet names in order
author str | None Creator property
created_at datetime | None File creation timestamp
modified_at datetime | None Last modified timestamp
file_size_kb float | None File size in kilobytes
app_version str | None Excel application version
has_macros bool True for .xlsm/.xlsb/.xls
is_shared bool Shared workbook flag

2. SheetChunk

One per worksheet/chartsheet.

Field Type Description
sheet_name str Sheet tab name
sheet_index int 0-based position
sheet_type str worksheet / chart / dialog
is_visible bool Sheet visibility state
used_range str | None e.g. "A1:F100"
tab_color str | None RGB or theme colour
is_protected bool Sheet protection enabled
zoom_level int | None Zoom percentage

3. TableChunk

One per Excel ListObject (Table).

Field Type Description
table_name str Table display name
sheet_name str Host sheet
address str Range reference e.g. "A1:F50"
style str | None Table style name
has_total_row bool Totals row present
has_header_row bool Header row present
col_count int Number of columns
row_count int Number of data rows

4. SchemaChunk

One per sheet — inferred column schema.

Field Type Description
sheet_name str Source sheet
table_name str | None Source table if applicable
columns list[dict] [{name, data_type, null_rate, unique_rate, sample_values}]
header_row_addr str | None e.g. "A1:F1"
inferred_pk str | None Column with 100% unique non-null values

Column data_type values: numeric / date / boolean / text / mixed

5. RowChunk

One per data row (batched for large sheets).

Field Type Description
row_index int 1-based Excel row number
values dict[str, Any] {column_name: cell_value}
parent_table str | None Table name if row belongs to a table
parent_sheet str Sheet name
batch_start int | None First row index in this batch
batch_end int | None Last row index in this batch
is_total_row bool Flagged as a totals/summary row

6. GroupChunk

One per row/column outline group level.

Field Type Description
sheet_name str Host sheet
direction str row or col
level int Outline level (1 = outermost)
start_index int First row/column index
end_index int Last row/column index
is_collapsed bool Group collapsed state
label str | None Optional group label

7. ParentContextChunk

Breadcrumb injected per sheet — anchors RowChunks in their hierarchy.

Field Type Description
workbook str Workbook name
sheet str Sheet name
table str | None Table name
group_path list[str] Outline group path
row_range str | None Row range of this context

8. MergedCellChunk

One per merged cell region.

Field Type Description
sheet_name str Host sheet
merge_range str e.g. "B2:D4"
merged_value Any Value from top-left cell
row_span int Number of rows spanned
col_span int Number of columns spanned

9. FrozenPaneChunk (priority: nice)

One per sheet with frozen panes.

Field Type Description
sheet_name str Host sheet
freeze_row int | None Number of rows frozen from top
freeze_col int | None Number of columns frozen from left
split_horizontal float | None Horizontal split position
split_vertical float | None Vertical split position

Semantic — priority: must/should

Add meaning and context on top of raw data.

10. SummaryChunk

Statistical overview of each sheet/table.

Field Type Description
sheet_name str Source sheet
table_name str | None Source table
row_count int Data row count
col_count int Column count
numeric_cols list[str] Columns detected as numeric
date_cols list[str] Columns detected as date/time
text_cols list[str] Columns detected as text
null_rate float Fraction of null/blank values (0–1)
dupe_rate float Fraction of duplicate values (0–1)
size_kb float | None Approximate data size

11. SemanticNarrativeChunk

Auto-generated plain-English description of a sheet or table.

Field Type Description
scope str sheet or table
scope_name str Sheet/table name
narrative str Human-readable description
subject str | None Main topic inferred
time_scope str | None Detected time range
grain str | None Row grain (e.g. "daily transaction")
confidence float 0–1 confidence in narrative
generated_by str Generator identifier

12. ColumnSemanticChunk

Semantic role for each column.

Field Type Description
sheet_name str Host sheet
col_name str Column header
role str id / measure / dimension / date / flag / freetext
unit str | None Detected unit (e.g. %, kg, MB)
currency str | None Detected currency code (e.g. USD, EUR)
is_pk bool Inferred primary key
is_fk bool Inferred foreign key

13. HeaderAliasChunk

Header normalisation and alias variants.

Field Type Description
sheet_name str Host sheet
original_header str Raw header text
aliases list[str] Common alternative names
abbreviations list[str] Short-form variants
normalized_name str Snake-case normalised name

14. EntityChunk (priority: should)

Detected named entities across text columns.

Field Type Description
workbook str Workbook name
entity_type str person / org / product / location / date
value str Entity string
frequency int Occurrence count
source_columns list[str] Columns where entity appears
canonical str | None Canonical / normalised form

15. CellAnnotationChunk (priority: should)

Cell comments, notes, and tooltips.

Field Type Description
sheet_name str Host sheet
cell_address str e.g. "B3"
annotation_type str comment / note / tooltip
text str Annotation text content
author str | None Comment author
timestamp datetime | None When comment was added
resolved bool Whether comment is resolved

Analytical — priority: must/should

Deep analytics derived from the data.

16. FormulaDefinitionChunk

Every formula cell — source of truth for formula logic.

Field Type Description
sheet_name str Host sheet
cell_address str Cell reference e.g. "B5"
formula_string str Full formula text e.g. "=SUM(A1:A10)"
formula_type str scalar / array / dynamic / lambda
named_refs list[str] Function names used
precedents list[str] All cell/range references in formula

17. FormulaResultChunk

Cached computed value paired with its definition.

Field Type Description
sheet_name str Host sheet
cell_address str Cell reference
computed_value Any | None Last saved computed value
value_type str | None Python type name of value
has_error bool True if cell contains an error
error_type str | None #REF / #DIV0 / #NA / #VALUE / #NAME / #NULL / #NUM
linked_definition_id str | None chunk_id of matching FormulaDefinitionChunk

18. KPIChunk

Key performance indicators detected from label+value cell pairs.

Field Type Description
sheet_name str Host sheet
kpi_name str KPI label (e.g. "Total Revenue")
value Any | None Numeric value
unit str | None Unit of measure
target Any | None Target / budget value
variance float | None Actual minus target
variance_pct float | None Percentage variance
source_cell str | None Cell address of value
period str | None Period label if detected

19. AggregationChunk

Pre-computed aggregations for every numeric column.

Field Type Description
sheet_name str Host sheet
col_name str Column
agg_type str SUM / AVG / MIN / MAX / COUNT / MEDIAN / STDEV
value Any | None Computed result
source_range str | None Cell range used
filter_conditions list[str] Any filter conditions applied

20. TemporalChunk

Time series profile for date columns.

Field Type Description
sheet_name str Host sheet
date_col str Date column name
frequency str | None daily / weekly / monthly / quarterly / yearly
start_date str | None First date
end_date str | None Last date
gap_count int Missing period count
is_sorted bool Whether dates are ascending
fiscal_year_start int | None Month number for fiscal year start

21. TrendChunk (priority: should)

Statistical trend over a numeric column.

Field Type Description
sheet_name str Host sheet
col_name str Column
direction str | None up / down / flat
delta_abs float | None Absolute change (last − first)
delta_pct float | None Percentage change
periods_compared int | None Number of non-null values
regression_slope float | None Linear regression slope

22. OutlierChunk (priority: should)

Statistical outlier in a numeric column.

Field Type Description
sheet_name str Host sheet
col_name str Column
outlier_type str zscore / iqr / blank_spike / dupe
cell_addr str | None Cell address of outlier
value Any | None Outlier value
z_score float | None Z-score (zscore type only)
severity str | None low / medium / high

23. LookupMapChunk (priority: should)

Lookup formula detected in a cell.

Field Type Description
sheet_name str Host sheet
cell_address str Cell containing lookup
lookup_type str VLOOKUP / HLOOKUP / INDEX-MATCH / XLOOKUP
key_col str | None Lookup key column
return_col str | None Return column
lookup_range str | None Source lookup range
match_type str | None Exact / approximate match

Validation / QA — priority: must/should

Data quality and constraint enforcement.

24. ValidationChunk

Data validation rules applied to cell ranges.

Field Type Description
sheet_name str Host sheet
range_addr str Validated range
validation_type str list / decimal / date / whole / custom
allowed_values list[Any] Dropdown values (list type)
formula str | None Custom formula
error_msg str | None User-facing error message

25. ErrorChunk

Excel error values found in cells.

Field Type Description
sheet_name str Host sheet
cell_addr str Cell address
error_type str #REF / #DIV0 / #NA / #VALUE / #NAME / #NULL / #NUM
formula str | None Formula that produced the error
likely_cause str | None Human-readable explanation

26. ConditionalFormatChunk

Conditional formatting rules.

Field Type Description
sheet_name str Host sheet
range_addr str Formatted range
rule_type str | None Rule type (e.g. colorScale, dataBar)
condition_formula str | None Rule formula
threshold Any | None Threshold value
format_applied str | None Format description
business_meaning str | None Inferred business meaning

27. ProtectionChunk (priority: should)

Sheet or range protection settings.

Field Type Description
sheet_name str Host sheet
scope str sheet or range
is_password_protected bool Password set
locked_ranges list[str] Protected ranges
allowed_edit_ranges list[str] Ranges exempt from protection

28. DataQualityChunk (priority: should)

Per-column data quality metrics.

Field Type Description
sheet_name str Host sheet
col_name str Column
blank_count int Blank/null cell count
dupe_count int Duplicate value count
type_mismatch_count int Cells with unexpected type
out_of_range_count int Values outside expected bounds
flagged_cells list[str] Cell addresses with issues

Visual — Charts — priority: must/should

29. ChartChunk

One per embedded chart.

Field Type Description
sheet_name str Host sheet
chart_name str Chart title or generated name
chart_type str bar / line / pie / scatter / area / combo / waterfall / funnel / treemap / map / stock
title str | None Chart title text
x_axis str | None X-axis label
y_axis str | None Y-axis label
series list[dict] [{name, range}] for each data series
sheet_anchor str | None Top-left cell anchor
source_table str | None Source table if identifiable

30. ChartSeriesChunk

One per data series in a chart.

Field Type Description
chart_id str chunk_id of parent ChartChunk
series_name str | None Series legend name
source_range str | None Data range reference
series_type str | None Override type for combo charts
color str | None Series colour
trendline_type str | None linear / exponential / polynomial / etc.
trendline_formula str | None Trendline equation

31. ChartAnnotationChunk (priority: should)

Text boxes, callouts, and data labels on charts.

Field Type Description
chart_id str Parent chart chunk_id
annotation_type str data_label / callout / textbox
text str | None Annotation text
cell_ref str | None Linked cell
position dict | None Position metadata

Visual — Pivot Tables — priority: must/should

32. PivotTableChunk

One per pivot table.

Field Type Description
sheet_name str Host sheet
pivot_name str Pivot table name
source_range str | None Source data range
source_sheet str | None Source data sheet
row_fields list[str] Fields in row area
col_fields list[str] Fields in column area
value_fields list[dict] [{field, agg}] — value fields with aggregation
filter_fields list[str] Report filter fields
report_filter_values dict Active filter values

33. PivotFieldChunk

One per field in a pivot table.

Field Type Description
pivot_id str Parent PivotTableChunk chunk_id
field_name str Field name
field_type str row / col / value / filter
agg_function str | None SUM / COUNT / AVERAGE / etc.
sort_order str | None asc / desc
subtotal_enabled bool Subtotals shown
grouped_by str | None Grouping interval

34. PivotCacheChunk (priority: should)

Pivot table cache metadata.

Field Type Description
pivot_id str Parent PivotTableChunk chunk_id
cache_range str | None Cached data range
last_refreshed datetime | None Last refresh timestamp
record_count int | None Number of cached records
unique_items_per_field dict[str, int] Unique value counts per field

35. SlicerChunk (priority: should)

Slicer objects connected to pivot tables or tables.

Field Type Description
sheet_name str Host sheet
slicer_name str Slicer name
field_name str | None Filtered field
connected_objects list[str] Connected pivot/table names
active_filters list[str] Currently selected filter values
style str | None Slicer style
sheet_anchor str | None Position anchor

36. TimelineChunk (priority: should)

Date timeline slicers.

Field Type Description
sheet_name str Host sheet
timeline_name str Timeline name
date_field str | None Date field being filtered
connected_pivots list[str] Connected pivot table names
active_period str | None Currently selected period
granularity str | None year / quarter / month / day

Visual — Other Objects — priority: should/nice

37. SparklineChunk (priority: should)

Mini in-cell charts.

Field Type Description
sheet_name str Host sheet
sparkline_type str line / column / win_loss
host_cell str Cell containing the sparkline
source_range str Data range
color str | None Sparkline colour
markers bool Markers shown
axis_min float | None Axis minimum
axis_max float | None Axis maximum

38. ShapeChunk (priority: nice)

Drawing shapes and text boxes.

Field Type Description
sheet_name str Host sheet
shape_type str | None Shape type name
text_content str | None Text inside the shape
cell_anchor str | None Top-left cell anchor
position dict {left, top, width, height}
linked_cell str | None Cell linked to shape
shape_name str | None Shape object name

39. ImageChunk (priority: nice)

Embedded images.

Field Type Description
sheet_name str Host sheet
image_type str | None png / jpeg / gif / etc.
anchor_cell str | None Anchor cell
alt_text str | None Accessibility alt text
width_px int | None Width in pixels
height_px int | None Height in pixels
is_linked bool External linked image
source_url str | None Source URL (linked images)

40. FormControlChunk (priority: should)

Legacy form controls (button, checkbox, listbox, etc.).

Field Type Description
sheet_name str Host sheet
control_type str button / checkbox / listbox / combobox / spinner / scrollbar
linked_cell str | None Linked output cell
value Any | None Current value
range_source str | None Input range for listbox/combobox
macro_assigned str | None Assigned macro name

41. ActiveXControlChunk (priority: should)

ActiveX controls.

Field Type Description
sheet_name str Host sheet
control_type str | None ActiveX class identifier
name str | None Control object name
linked_cell str | None Linked cell
properties dict Raw property bag
event_macro str | None Attached event macro
sheet_anchor str | None Position anchor

VBA / Macro — priority: must/should

Requires oletools and a macro-enabled file (.xlsm, .xlsb, .xls).

42. MacroChunk

One per VBA procedure (Sub / Function).

Field Type Description
workbook str Workbook name
macro_name str Procedure name
module_name str | None Containing module
trigger_type str | None button / event / auto / ribbon
trigger_event str | None e.g. Workbook_Open
description str | None Docstring or comment
line_count int Number of lines
scope str workbook or sheet

43. VBAModuleChunk

One per VBA module.

Field Type Description
workbook str Workbook name
module_name str Module name
module_type str standard / class / sheet / workbook
procedure_names list[str] All Subs/Functions in this module
line_count int Total lines of code
references list[str] Library references declared
source_code str | None Full VBA source text

44. VBAEventChunk (priority: should)

Event handler procedures.

Field Type Description
module str Module containing the event
event_name str e.g. Workbook_Open, Worksheet_Change
trigger_condition str | None When the event fires
affected_range str | None Range affected by event
summary str | None What the event handler does

45. CustomFunctionChunk (priority: should)

User-defined functions (UDFs) callable from cells.

Field Type Description
module str Containing module
function_name str UDF name
parameters list[str] Parameter names
return_type str | None Return type if annotated
description str | None Function description
usage_cells list[str] Cells calling this UDF
is_udf bool Always True

46. RibbonCustomizationChunk (priority: nice)

Custom Ribbon UI (customUI XML).

Field Type Description
workbook str Workbook name
tab_name str | None Custom tab label
group_name str | None Group within tab
button_label str | None Button display label
macro_assigned str | None OnAction macro
icon str | None Button icon reference

Cross-Reference / Linkage — priority: must/should

47. RelationshipChunk

Cross-sheet formula dependencies.

Field Type Description
source_sheet str Sheet containing the formula
source_cell str Cell with the cross-reference
target_cell str Referenced cell
target_sheet str Referenced sheet
rel_type str vlookup / index_match / direct_ref / power_query_feed
formula str | None Source formula

48. NamedRangeChunk

Workbook or sheet-scoped named ranges.

Field Type Description
range_name str Defined name
refers_to str Formula or range reference
scope str workbook or sheet name
usage_count int Number of formulas referencing this name
used_in list[str] Cell addresses using this name

49. ExternalLinkChunk (priority: should)

References to other workbooks.

Field Type Description
workbook str Source workbook
source_cell str | None Cell containing the link
target_file str Referenced file path
target_range str | None Referenced range
last_updated datetime | None Last update timestamp
is_broken bool Link cannot be resolved
update_mode str | None Automatic / manual / on-open

50. DependencyGraphChunk (priority: should)

Formula precedent/dependent graph per sheet.

Field Type Description
sheet_name str Host sheet
nodes list[str] All cell addresses in the graph
edges list[dict] [{from, to, type}] directed edges
max_depth int Longest dependency chain
has_circular_ref bool Circular reference detected

Connectivity / Power Features — priority: must/should

51. PowerQueryChunk

Power Query (Get & Transform) queries.

Field Type Description
workbook str Workbook name
query_name str Query name
source_type str | None file / db / api / web / sharepoint
source_path str | None Connection source path
transformation_steps list[str] Applied step names
output_table str | None Output table name
last_refreshed datetime | None Last refresh timestamp

52. DataModelChunk (priority: should)

PowerPivot data model tables and relationships.

Field Type Description
workbook str Workbook name
tables list[str] Table names in the model
relationships list[dict] [{from_table, to_table, on_col}]
dax_measures list[str] DAX measure names
dax_columns list[str] DAX calculated column names

53. PowerPivotMeasureChunk (priority: should)

Individual DAX measure definitions.

Field Type Description
workbook str Workbook name
measure_name str Measure name
dax_expression str | None Full DAX expression
format_string str | None Display format
source_table str | None Host table
used_in_pivots list[str] Pivot tables using this measure

54. ConnectionChunk (priority: should)

External data connection definitions.

Field Type Description
workbook str Workbook name
connection_name str Connection name
connection_type str | None ODBC / OLE / web / sharepoint / odata
connection_string_sanitized str | None Redacted connection string
refresh_schedule str | None Auto-refresh schedule

Operational / Metadata — priority: must/nice

55. PrintSettingsChunk (priority: nice)

Print configuration per sheet.

Field Type Description
sheet_name str Host sheet
print_area str | None Print area range
page_break_rows list[int] Manual page break row indices
header_text str | None Page header text
footer_text str | None Page footer text
orientation str | None portrait or landscape
fit_to_pages bool | None Fit-to-page enabled

56. ChangeLogChunk (priority: nice)

Tracked changes from shared workbook revision history.

Field Type Description
sheet_name str Host sheet
cell_addr str Changed cell
old_value Any | None Previous value
new_value Any | None New value
author str | None Who made the change
changed_at datetime | None When the change was made
change_type str | None edit / insert / delete

57. ChunkIndexChunk (priority: must)

Master manifest — always the last chunk produced.

Field Type Description
workbook_ref str Workbook name
all_chunk_ids list[str] Every chunk_id in extraction order
type_counts dict[str, int] Count of each chunk type produced
source_address_map dict[str, str] {chunk_id: "SheetName!A1"} for cell-level chunks
created_at datetime UTC timestamp of extraction

Format Support Details

Format Parser Tables Charts Pivot VBA Power Query Notes
.xlsx openpyxl Full support
.xlsm openpyxl Macro-enabled
.xlsb pyxlsb + adapter Binary format; limited visual support
.xls xlrd + adapter Legacy format (Excel 97–2003)

VBA extraction requires oletools: pip install oletools

Dependency graphs require networkx: pip install networkx


RAG / LLM Integration

Feeding chunks into a vector store

from omnidoc_excel_extractor import ExcelExtractor

extractor = ExcelExtractor("financials.xlsx")
result = extractor.extract(include_rows=True, priority="must")

documents = []
for chunk in result.chunks:
    d = chunk.to_dict()

    # Build a text representation for embedding
    text_parts = [f"type: {d['chunk_type']}", f"layer: {d['layer']}"]
    for k, v in d.items():
        if k not in {"chunk_id", "chunk_type", "layer", "priority"} and v is not None:
            text_parts.append(f"{k}: {v}")
    text = "\n".join(text_parts)

    documents.append({
        "id":       d["chunk_id"],
        "text":     text,
        "metadata": {
            "chunk_type": d["chunk_type"],
            "layer":      d["layer"],
            "priority":   d["priority"],
            "sheet":      d.get("sheet_name"),
        },
    })

# e.g. with LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, etc.
# vector_store.add_documents(documents)

Selective extraction for query routing

# Fast schema-only retrieval for "what columns does this file have?"
schema_result = extractor.extract_schema()
schema_dicts = schema_result.to_dicts()

# KPI retrieval for "what is the total revenue?"
kpi_result = extractor.extract(
    chunk_types=["KPIChunk", "AggregationChunk"],
    include_analytics=True,
)

# Formula audit for "explain this formula"
formula_result = extractor.extract_formulas()

Serialising to JSON

import json

result = extractor.extract()
with open("chunks.json", "w") as f:
    json.dump(result.to_dicts(), f, default=str, indent=2)

Advanced Usage

Batch processing multiple files

from pathlib import Path
from omnidoc_excel_extractor import ExcelExtractor

results = {}
for path in Path("./reports").glob("*.xlsx"):
    extractor = ExcelExtractor(str(path))
    result = extractor.extract(include_rows=False)
    results[path.name] = result.summary()

print(results)

Accessing CHUNK_REGISTRY

from omnidoc_excel_extractor import CHUNK_REGISTRY

print(list(CHUNK_REGISTRY.keys()))  # all 57 chunk type names
ChunkClass = CHUNK_REGISTRY["SchemaChunk"]

Constructing a chunk manually

from omnidoc_excel_extractor.chunks.models import KPIChunk

kpi = KPIChunk(
    sheet_name="Dashboard",
    kpi_name="Total Revenue",
    value=1_250_000.0,
    unit="USD",
    target=1_500_000.0,
    variance=-250_000.0,
    variance_pct=-16.67,
    source_cell="B2",
)
print(kpi.to_dict())

Development Setup

# Clone the repository
git clone https://github.com/ganeshkinkargiri/omnidoc-excel-extractor-sdk.git
cd omnidoc-excel-extractor-sdk

# Create a virtual environment
python3 -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate

# Install in editable mode with dev dependencies
pip install -e ".[dev]"

# Optional extras
pip install oletools networkx

# Run tests
pytest

# Run tests with coverage
pytest --cov=omnidoc_excel_extractor --cov-report=term-missing

Project structure for contributors

omnidoc_excel_extractor/
├── chunks/models.py      ← Add new chunk class here first
├── builder/<category>.py ← Add builder function here
├── builder/__init__.py   ← Export the builder function
└── core/extractor.py     ← Call the builder inside extract()
tests/
├── conftest.py           ← Shared fixtures (programmatic .xlsx)
└── test_extractor.py     ← Tests

Adding a new chunk type:

  1. Define the Pydantic class in chunks/models.py and add it to CHUNK_REGISTRY.
  2. Write a build_<name>_chunks() function in the appropriate builder file.
  3. Export it from builder/__init__.py.
  4. Call it inside extract() in core/extractor.py.
  5. Add a test in tests/test_extractor.py.

Publishing to PyPI

One-time Setup

1. Create accounts

Registry URL Purpose
TestPyPI https://test.pypi.org/account/register/ Test uploads before going live
PyPI https://pypi.org/account/register/ Production — public installs

2. Generate API tokens

TestPyPI token:

  1. Log in to https://test.pypi.org
  2. Account Settings → API tokens → Add API token
  3. Scope: "Entire account" for first upload, or per-project after
  4. Copy the token (shown only once) — starts with pypi-

PyPI token:

  1. Log in to https://pypi.org
  2. Account Settings → API tokens → Add API token
  3. Copy the token

3. Configure ~/.pypirc

Create or edit ~/.pypirc:

[distutils]
index-servers =
    pypi
    testpypi

[pypi]
repository = https://upload.pypi.org/legacy/
username = __token__
password = pypi-YOUR_PYPI_TOKEN_HERE

[testpypi]
repository = https://test.pypi.org/legacy/
username = __token__
password = pypi-YOUR_TESTPYPI_TOKEN_HERE
chmod 600 ~/.pypirc    # keep token private

4. Install publishing tools

pip install build twine

Build the Package

# Clean previous builds first
rm -rf dist/ build/ *.egg-info

# Build both sdist (.tar.gz) and wheel (.whl)
python -m build

Expected output:

Successfully built omnidoc_excel_extractor-0.1.0.tar.gz
            and omnidoc_excel_extractor-0.1.0-py3-none-any.whl

Validate the build before uploading:

twine check dist/*

You should see PASSED for both files. Fix any warnings before uploading.


Publish to TestPyPI

Always publish to TestPyPI first to catch packaging issues.

twine upload --repository testpypi dist/*

You will be prompted for credentials if ~/.pypirc is not configured.

Verify the TestPyPI install:

# Use a fresh virtual environment
python3 -m venv /tmp/test-install-venv
source /tmp/test-install-venv/bin/activate

# Install from TestPyPI (note: dependencies come from real PyPI)
pip install --index-url https://test.pypi.org/simple/ \
            --extra-index-url https://pypi.org/simple/ \
            omnidoc-excel-extractor

# Quick smoke test
python -c "
from omnidoc_excel_extractor import ExcelExtractor, CHUNK_REGISTRY
print(f'Package imported OK. {len(CHUNK_REGISTRY)} chunk types registered.')
"

Expected: Package imported OK. 57 chunk types registered.


Publish to PyPI (Production)

Once TestPyPI install is confirmed working:

twine upload dist/*

Or explicitly specifying the repository:

twine upload --repository pypi dist/*

Verify the production install:

pip install omnidoc-excel-extractor
python -c "from omnidoc_excel_extractor import ExcelExtractor; print('OK')"

Your package will be live at:

https://pypi.org/project/omnidoc-excel-extractor/

Versioning

Version is declared in pyproject.toml:

[project]
version = "0.1.0"

Follow Semantic Versioning:

Change Version bump Example
Bug fix, no API change Patch 0.1.00.1.1
New chunk type or feature, backward-compatible Minor 0.1.00.2.0
Breaking change to chunk fields or API Major 0.1.01.0.0

Release workflow:

# 1. Bump version in pyproject.toml
#    version = "0.2.0"

# 2. Update CHANGELOG (optional but recommended)

# 3. Commit and tag
git add pyproject.toml
git commit -m "chore: bump version to 0.2.0"
git tag v0.2.0
git push origin master --tags

# 4. Build
rm -rf dist/
python -m build

# 5. Validate
twine check dist/*

# 6. TestPyPI first
twine upload --repository testpypi dist/*

# 7. Verify, then ship
twine upload dist/*

License

MIT — see LICENSE


Built with openpyxl, xlrd, pyxlsb, oletools, pydantic, networkx, and pandas.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omnidoc_excel_extractor-0.3.1.tar.gz (91.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omnidoc_excel_extractor-0.3.1-py3-none-any.whl (68.8 kB view details)

Uploaded Python 3

File details

Details for the file omnidoc_excel_extractor-0.3.1.tar.gz.

File metadata

  • Download URL: omnidoc_excel_extractor-0.3.1.tar.gz
  • Upload date:
  • Size: 91.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for omnidoc_excel_extractor-0.3.1.tar.gz
Algorithm Hash digest
SHA256 d578c5330d42138091e84c82fbdd4b511f030ff0a2c3fedb6d5c993442f59d09
MD5 f962c0c0dfecc56ca773122dd9fbb84f
BLAKE2b-256 9f6342901d69bc04382f724fe6901d8df8126c2cf7dba24ee6d212c33c04427a

See more details on using hashes here.

File details

Details for the file omnidoc_excel_extractor-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for omnidoc_excel_extractor-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fb9d9251fefe701cc4b8fee4205e6b4d573c285458c6b07caf8d2b45474db44d
MD5 9fbc5afe867b1be170d3cc8597dfbfe7
BLAKE2b-256 f0003879698ba7e90b7f8e6002aad4e4c839a499cc9fd880df27341560290b45

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page