Comprehensive Excel extraction SDK — structured chunks for AI/LLM/RAG pipelines

These details have not been verified by PyPI

Project links

Project description

omnidoc-excel-extractor

Comprehensive Excel extraction SDK — 57 structured chunk types for AI, LLM, and RAG pipelines.

Supports .xlsx · .xlsm · .xlsb · .xls

Overview
Architecture
Installation
Quick Start
Extraction Options
Convenience Methods
ExtractionResult API
All 57 Chunk Types
Format Support Details
RAG / LLM Integration
Advanced Usage
Development Setup
Publishing to PyPI
License

Overview

omnidoc-excel-extractor converts any Excel file into a structured list of chunks — typed, JSON-serialisable Pydantic objects that capture every layer of meaning in the workbook: raw data, schema, formulas, KPIs, charts, pivot tables, VBA macros, cross-sheet relationships, Power Query, and more.

Each chunk carries:

Field	Description
`chunk_id`	UUID — globally unique identifier
`chunk_type`	String name of the chunk class
`layer`	`structural` / `semantic` / `analytical` / `validation` / `visual` / `vba` / `cross_ref` / `connectivity` / `operational`
`priority`	`must` / `should` / `nice` — extraction importance tier
(type-specific fields)	All fields defined per chunk type

Why chunks? LLM context windows are finite. Chunking lets you embed, retrieve, and inject only the Excel knowledge relevant to each query — instead of dumping raw cell values into a prompt.

Architecture

omnidoc_excel_extractor/
│
├── __init__.py                  ← public API (ExcelExtractor, ExtractionResult)
│
├── chunks/
│   └── models.py                ← 57 Pydantic v2 chunk models + CHUNK_REGISTRY
│
├── core/
│   ├── extractor.py             ← orchestrator — runs every builder in order
│   ├── workbook_parser.py       ← workbook metadata, named ranges, external links
│   ├── sheet_parser.py          ← rows, merged cells, groups, validations, comments
│   ├── formula_extractor.py     ← formula detection, classification, lookup/agg detection
│   └── vba_extractor.py         ← VBA module extraction via oletools
│
├── builder/
│   ├── structural.py            ← chunks 1-9
│   ├── semantic.py              ← chunks 10-15
│   ├── analytical.py            ← chunks 16-23
│   ├── validation.py            ← chunks 24-28
│   ├── visual.py                ← chunks 29-41
│   ├── vba.py                   ← chunks 42-46
│   ├── crossref.py              ← chunks 47-50
│   ├── connectivity.py          ← chunks 51-54
│   └── operational.py           ← chunks 55-57
│
├── utils/
│   ├── cell_utils.py            ← cell address parsing, range iteration
│   ├── type_inference.py        ← dtype detection, semantic role, normalisation
│   └── stats.py                 ← aggregations, outlier detection, trend, frequency
│
└── _adapters/
    ├── xlrd_adapter.py          ← openpyxl-compatible wrapper for .xls (xlrd)
    └── xlsb_adapter.py          ← openpyxl-compatible wrapper for .xlsb (pyxlsb)

Extraction pipeline:

Excel file
    │
    ▼
_load_workbook()  ──► openpyxl (xlsx/xlsm)
                  ──► xlrd adapter  (xls)
                  ──► pyxlsb adapter (xlsb)
    │
    ▼
builder/* ──► 57 chunk builders run in priority order
    │
    ▼
ExtractionResult (chunks: list[BaseChunk], index: ChunkIndexChunk)

Installation

pip install omnidoc-excel-extractor

Optional — VBA/macro extraction (.xlsm, .xlsb, .xls):

pip install oletools

Full install with all optional extras:

pip install omnidoc-excel-extractor oletools networkx

Python version: 3.9+

Quick Start

from omnidoc_excel_extractor import ExcelExtractor

extractor = ExcelExtractor("report.xlsx")
result = extractor.extract()

# High-level summary
print(result.summary())
# {
#   'total_chunks': 412,
#   'type_counts': {
#     'WorkbookChunk': 1, 'SheetChunk': 4, 'TableChunk': 2,
#     'SchemaChunk': 4, 'RowChunk': 1800, 'FormulaDefinitionChunk': 37,
#     'KPIChunk': 5, 'ChartChunk': 3, ...
#   },
#   'workbook': 'report',
#   'created_at': '2026-05-16T10:00:00'
# }

# All chunks as JSON-serialisable dicts
dicts = result.to_dicts()

# Filter by type
schemas  = result.by_type("SchemaChunk")
formulas = result.by_type("FormulaDefinitionChunk")

# Filter by sheet
sales_chunks = result.by_sheet("Sales")

# Access the master manifest
index = result.index   # ChunkIndexChunk
print(index.type_counts)
print(index.all_chunk_ids)

Extraction Options

result = extractor.extract(
    # Selectively include only specific chunk types (list of class name strings)
    chunk_types=["WorkbookChunk", "SchemaChunk", "FormulaDefinitionChunk"],

    # Include RowChunks — can produce thousands of chunks for large sheets
    include_rows=True,

    # Include analytical chunks: AggregationChunk, TrendChunk, OutlierChunk,
    # KPIChunk, TemporalChunk, FormulaDefinitionChunk, FormulaResultChunk
    include_analytics=True,

    # Include visual chunks: ChartChunk, PivotTableChunk, SparklineChunk,
    # SlicerChunk, TimelineChunk, ShapeChunk, ImageChunk, FormControlChunk
    include_visual=True,

    # Include VBA chunks (only activates for .xlsm / .xlsb / .xls files
    # and requires oletools to be installed)
    include_vba=True,

    # Number of data rows per RowChunk batch
    row_batch_size=500,

    # Filter by extraction priority tier:
    #   "must"   — structural core + key analytics + validation
    #   "should" — adds trends, outliers, pivots, slicers, VBA events
    #   "nice"   — adds frozen panes, shapes, images, ribbon XML
    #   None     — include everything (default)
    priority=None,
)

Priority tiers at a glance

Priority	Count	Includes
`must`	27 types	WorkbookChunk, SheetChunk, SchemaChunk, RowChunk, SummaryChunk, KPIChunk, FormulaDefinitionChunk, AggregationChunk, ValidationChunk, ErrorChunk, ChartChunk, PivotTableChunk, MacroChunk, VBAModuleChunk, NamedRangeChunk, RelationshipChunk, PowerQueryChunk, ChunkIndexChunk, and more
`should`	22 types	TrendChunk, OutlierChunk, EntityChunk, CellAnnotationChunk, TemporalChunk, LookupMapChunk, DataQualityChunk, ProtectionChunk, PivotFieldChunk, SlicerChunk, TimelineChunk, SparklineChunk, FormControlChunk, ActiveXControlChunk, VBAEventChunk, CustomFunctionChunk, ExternalLinkChunk, DependencyGraphChunk, DataModelChunk, PowerPivotMeasureChunk, ConnectionChunk, PivotCacheChunk
`nice`	8 types	FrozenPaneChunk, ShapeChunk, ImageChunk, RibbonCustomizationChunk, PrintSettingsChunk, ChangeLogChunk, ChartAnnotationChunk, GroupChunk

Convenience Methods

# Workbook overview — just metadata and sheet info
result = extractor.extract_sheets()
# Produces: WorkbookChunk, SheetChunk, SummaryChunk

# Schema discovery — column types and header normalisation
result = extractor.extract_schema()
# Produces: SchemaChunk, TableChunk, ColumnSemanticChunk, HeaderAliasChunk

# Formula audit — every formula, its type, and cross-sheet dependencies
result = extractor.extract_formulas()
# Produces: FormulaDefinitionChunk, FormulaResultChunk, LookupMapChunk

# VBA/macro inventory
result = extractor.extract_vba_only()
# Produces: MacroChunk, VBAModuleChunk, VBAEventChunk, CustomFunctionChunk

ExtractionResult API

result = extractor.extract()

result.chunks                         # list[BaseChunk] — all extracted chunks
result.index                          # ChunkIndexChunk — master manifest
result.summary()                      # dict with total_chunks, type_counts, workbook, created_at

result.to_dicts()                     # list[dict] — all chunks as plain dicts (JSON-safe)
result.by_type("SchemaChunk")         # list[BaseChunk] — filter by chunk_type string
result.by_sheet("Revenue")            # list[BaseChunk] — filter by sheet_name attribute

# Each chunk also has .to_dict()
chunk = result.chunks[0]
chunk.chunk_id                        # UUID string
chunk.chunk_type                      # "WorkbookChunk"
chunk.layer                           # "structural"
chunk.priority                        # "must"
chunk.to_dict()                       # dict

All 57 Chunk Types

Structural — `priority: must`

Capture the physical layout and raw data of the workbook.

1. `WorkbookChunk`

One per file. Top-level workbook metadata.

Field	Type	Description
`name`	str	File stem (without extension)
`path`	str	Absolute resolved file path
`sheet_names`	list[str]	All sheet names in order
`author`	str \| None	Creator property
`created_at`	datetime \| None	File creation timestamp
`modified_at`	datetime \| None	Last modified timestamp
`file_size_kb`	float \| None	File size in kilobytes
`app_version`	str \| None	Excel application version
`has_macros`	bool	True for .xlsm/.xlsb/.xls
`is_shared`	bool	Shared workbook flag

2. `SheetChunk`

One per worksheet/chartsheet.

Field	Type	Description
`sheet_name`	str	Sheet tab name
`sheet_index`	int	0-based position
`sheet_type`	str	`worksheet` / `chart` / `dialog`
`is_visible`	bool	Sheet visibility state
`used_range`	str \| None	e.g. `"A1:F100"`
`tab_color`	str \| None	RGB or theme colour
`is_protected`	bool	Sheet protection enabled
`zoom_level`	int \| None	Zoom percentage

3. `TableChunk`

One per Excel ListObject (Table).

Field	Type	Description
`table_name`	str	Table display name
`sheet_name`	str	Host sheet
`address`	str	Range reference e.g. `"A1:F50"`
`style`	str \| None	Table style name
`has_total_row`	bool	Totals row present
`has_header_row`	bool	Header row present
`col_count`	int	Number of columns
`row_count`	int	Number of data rows

4. `SchemaChunk`

One per sheet — inferred column schema.

Field	Type	Description
`sheet_name`	str	Source sheet
`table_name`	str \| None	Source table if applicable
`columns`	list[dict]	`[{name, data_type, null_rate, unique_rate, sample_values}]`
`header_row_addr`	str \| None	e.g. `"A1:F1"`
`inferred_pk`	str \| None	Column with 100% unique non-null values

Column data_type values: numeric / date / boolean / text / mixed

5. `RowChunk`

One per data row (batched for large sheets).

Field	Type	Description
`row_index`	int	1-based Excel row number
`values`	dict[str, Any]	`{column_name: cell_value}`
`parent_table`	str \| None	Table name if row belongs to a table
`parent_sheet`	str	Sheet name
`batch_start`	int \| None	First row index in this batch
`batch_end`	int \| None	Last row index in this batch
`is_total_row`	bool	Flagged as a totals/summary row

6. `GroupChunk`

One per row/column outline group level.

Field	Type	Description
`sheet_name`	str	Host sheet
`direction`	str	`row` or `col`
`level`	int	Outline level (1 = outermost)
`start_index`	int	First row/column index
`end_index`	int	Last row/column index
`is_collapsed`	bool	Group collapsed state
`label`	str \| None	Optional group label

7. `ParentContextChunk`

Breadcrumb injected per sheet — anchors RowChunks in their hierarchy.

Field	Type	Description
`workbook`	str	Workbook name
`sheet`	str	Sheet name
`table`	str \| None	Table name
`group_path`	list[str]	Outline group path
`row_range`	str \| None	Row range of this context

8. `MergedCellChunk`

One per merged cell region.

Field	Type	Description
`sheet_name`	str	Host sheet
`merge_range`	str	e.g. `"B2:D4"`
`merged_value`	Any	Value from top-left cell
`row_span`	int	Number of rows spanned
`col_span`	int	Number of columns spanned

9. `FrozenPaneChunk` (priority: nice)

One per sheet with frozen panes.

Field	Type	Description
`sheet_name`	str	Host sheet
`freeze_row`	int \| None	Number of rows frozen from top
`freeze_col`	int \| None	Number of columns frozen from left
`split_horizontal`	float \| None	Horizontal split position
`split_vertical`	float \| None	Vertical split position

Semantic — `priority: must/should`

Add meaning and context on top of raw data.

10. `SummaryChunk`

Statistical overview of each sheet/table.

Field	Type	Description
`sheet_name`	str	Source sheet
`table_name`	str \| None	Source table
`row_count`	int	Data row count
`col_count`	int	Column count
`numeric_cols`	list[str]	Columns detected as numeric
`date_cols`	list[str]	Columns detected as date/time
`text_cols`	list[str]	Columns detected as text
`null_rate`	float	Fraction of null/blank values (0–1)
`dupe_rate`	float	Fraction of duplicate values (0–1)
`size_kb`	float \| None	Approximate data size

11. `SemanticNarrativeChunk`

Auto-generated plain-English description of a sheet or table.

Field	Type	Description
`scope`	str	`sheet` or `table`
`scope_name`	str	Sheet/table name
`narrative`	str	Human-readable description
`subject`	str \| None	Main topic inferred
`time_scope`	str \| None	Detected time range
`grain`	str \| None	Row grain (e.g. "daily transaction")
`confidence`	float	0–1 confidence in narrative
`generated_by`	str	Generator identifier

12. `ColumnSemanticChunk`

Semantic role for each column.

Field	Type	Description
`sheet_name`	str	Host sheet
`col_name`	str	Column header
`role`	str	`id` / `measure` / `dimension` / `date` / `flag` / `freetext`
`unit`	str \| None	Detected unit (e.g. `%`, `kg`, `MB`)
`currency`	str \| None	Detected currency code (e.g. `USD`, `EUR`)
`is_pk`	bool	Inferred primary key
`is_fk`	bool	Inferred foreign key

13. `HeaderAliasChunk`

Header normalisation and alias variants.

Field	Type	Description
`sheet_name`	str	Host sheet
`original_header`	str	Raw header text
`aliases`	list[str]	Common alternative names
`abbreviations`	list[str]	Short-form variants
`normalized_name`	str	Snake-case normalised name

14. `EntityChunk` (priority: should)

Detected named entities across text columns.

Field	Type	Description
`workbook`	str	Workbook name
`entity_type`	str	`person` / `org` / `product` / `location` / `date`
`value`	str	Entity string
`frequency`	int	Occurrence count
`source_columns`	list[str]	Columns where entity appears
`canonical`	str \| None	Canonical / normalised form

15. `CellAnnotationChunk` (priority: should)

Cell comments, notes, and tooltips.

Field	Type	Description
`sheet_name`	str	Host sheet
`cell_address`	str	e.g. `"B3"`
`annotation_type`	str	`comment` / `note` / `tooltip`
`text`	str	Annotation text content
`author`	str \| None	Comment author
`timestamp`	datetime \| None	When comment was added
`resolved`	bool	Whether comment is resolved

Analytical — `priority: must/should`

Deep analytics derived from the data.

16. `FormulaDefinitionChunk`

Every formula cell — source of truth for formula logic.

Field	Type	Description
`sheet_name`	str	Host sheet
`cell_address`	str	Cell reference e.g. `"B5"`
`formula_string`	str	Full formula text e.g. `"=SUM(A1:A10)"`
`formula_type`	str	`scalar` / `array` / `dynamic` / `lambda`
`named_refs`	list[str]	Function names used
`precedents`	list[str]	All cell/range references in formula

17. `FormulaResultChunk`

Cached computed value paired with its definition.

Field	Type	Description
`sheet_name`	str	Host sheet
`cell_address`	str	Cell reference
`computed_value`	Any \| None	Last saved computed value
`value_type`	str \| None	Python type name of value
`has_error`	bool	True if cell contains an error
`error_type`	str \| None	`#REF` / `#DIV0` / `#NA` / `#VALUE` / `#NAME` / `#NULL` / `#NUM`
`linked_definition_id`	str \| None	`chunk_id` of matching FormulaDefinitionChunk

18. `KPIChunk`

Key performance indicators detected from label+value cell pairs.

Field	Type	Description
`sheet_name`	str	Host sheet
`kpi_name`	str	KPI label (e.g. "Total Revenue")
`value`	Any \| None	Numeric value
`unit`	str \| None	Unit of measure
`target`	Any \| None	Target / budget value
`variance`	float \| None	Actual minus target
`variance_pct`	float \| None	Percentage variance
`source_cell`	str \| None	Cell address of value
`period`	str \| None	Period label if detected

19. `AggregationChunk`

Pre-computed aggregations for every numeric column.

Field	Type	Description
`sheet_name`	str	Host sheet
`col_name`	str	Column
`agg_type`	str	`SUM` / `AVG` / `MIN` / `MAX` / `COUNT` / `MEDIAN` / `STDEV`
`value`	Any \| None	Computed result
`source_range`	str \| None	Cell range used
`filter_conditions`	list[str]	Any filter conditions applied

20. `TemporalChunk`

Time series profile for date columns.

Field	Type	Description
`sheet_name`	str	Host sheet
`date_col`	str	Date column name
`frequency`	str \| None	`daily` / `weekly` / `monthly` / `quarterly` / `yearly`
`start_date`	str \| None	First date
`end_date`	str \| None	Last date
`gap_count`	int	Missing period count
`is_sorted`	bool	Whether dates are ascending
`fiscal_year_start`	int \| None	Month number for fiscal year start

21. `TrendChunk` (priority: should)

Statistical trend over a numeric column.

Field	Type	Description
`sheet_name`	str	Host sheet
`col_name`	str	Column
`direction`	str \| None	`up` / `down` / `flat`
`delta_abs`	float \| None	Absolute change (last − first)
`delta_pct`	float \| None	Percentage change
`periods_compared`	int \| None	Number of non-null values
`regression_slope`	float \| None	Linear regression slope

22. `OutlierChunk` (priority: should)

Statistical outlier in a numeric column.

Field	Type	Description
`sheet_name`	str	Host sheet
`col_name`	str	Column
`outlier_type`	str	`zscore` / `iqr` / `blank_spike` / `dupe`
`cell_addr`	str \| None	Cell address of outlier
`value`	Any \| None	Outlier value
`z_score`	float \| None	Z-score (zscore type only)
`severity`	str \| None	`low` / `medium` / `high`

23. `LookupMapChunk` (priority: should)

Lookup formula detected in a cell.

Field	Type	Description
`sheet_name`	str	Host sheet
`cell_address`	str	Cell containing lookup
`lookup_type`	str	`VLOOKUP` / `HLOOKUP` / `INDEX-MATCH` / `XLOOKUP`
`key_col`	str \| None	Lookup key column
`return_col`	str \| None	Return column
`lookup_range`	str \| None	Source lookup range
`match_type`	str \| None	Exact / approximate match

Validation / QA — `priority: must/should`

Data quality and constraint enforcement.

24. `ValidationChunk`

Data validation rules applied to cell ranges.

Field	Type	Description
`sheet_name`	str	Host sheet
`range_addr`	str	Validated range
`validation_type`	str	`list` / `decimal` / `date` / `whole` / `custom`
`allowed_values`	list[Any]	Dropdown values (list type)
`formula`	str \| None	Custom formula
`error_msg`	str \| None	User-facing error message

25. `ErrorChunk`

Excel error values found in cells.

Field	Type	Description
`sheet_name`	str	Host sheet
`cell_addr`	str	Cell address
`error_type`	str	`#REF` / `#DIV0` / `#NA` / `#VALUE` / `#NAME` / `#NULL` / `#NUM`
`formula`	str \| None	Formula that produced the error
`likely_cause`	str \| None	Human-readable explanation

26. `ConditionalFormatChunk`

Conditional formatting rules.

Field	Type	Description
`sheet_name`	str	Host sheet
`range_addr`	str	Formatted range
`rule_type`	str \| None	Rule type (e.g. `colorScale`, `dataBar`)
`condition_formula`	str \| None	Rule formula
`threshold`	Any \| None	Threshold value
`format_applied`	str \| None	Format description
`business_meaning`	str \| None	Inferred business meaning

27. `ProtectionChunk` (priority: should)

Sheet or range protection settings.

Field	Type	Description
`sheet_name`	str	Host sheet
`scope`	str	`sheet` or `range`
`is_password_protected`	bool	Password set
`locked_ranges`	list[str]	Protected ranges
`allowed_edit_ranges`	list[str]	Ranges exempt from protection

28. `DataQualityChunk` (priority: should)

Per-column data quality metrics.

Field	Type	Description
`sheet_name`	str	Host sheet
`col_name`	str	Column
`blank_count`	int	Blank/null cell count
`dupe_count`	int	Duplicate value count
`type_mismatch_count`	int	Cells with unexpected type
`out_of_range_count`	int	Values outside expected bounds
`flagged_cells`	list[str]	Cell addresses with issues

Visual — Charts — `priority: must/should`

29. `ChartChunk`

One per embedded chart.

Field	Type	Description
`sheet_name`	str	Host sheet
`chart_name`	str	Chart title or generated name
`chart_type`	str	`bar` / `line` / `pie` / `scatter` / `area` / `combo` / `waterfall` / `funnel` / `treemap` / `map` / `stock`
`title`	str \| None	Chart title text
`x_axis`	str \| None	X-axis label
`y_axis`	str \| None	Y-axis label
`series`	list[dict]	`[{name, range}]` for each data series
`sheet_anchor`	str \| None	Top-left cell anchor
`source_table`	str \| None	Source table if identifiable

30. `ChartSeriesChunk`

One per data series in a chart.

Field	Type	Description
`chart_id`	str	`chunk_id` of parent ChartChunk
`series_name`	str \| None	Series legend name
`source_range`	str \| None	Data range reference
`series_type`	str \| None	Override type for combo charts
`color`	str \| None	Series colour
`trendline_type`	str \| None	`linear` / `exponential` / `polynomial` / etc.
`trendline_formula`	str \| None	Trendline equation

31. `ChartAnnotationChunk` (priority: should)

Text boxes, callouts, and data labels on charts.

Field	Type	Description
`chart_id`	str	Parent chart `chunk_id`
`annotation_type`	str	`data_label` / `callout` / `textbox`
`text`	str \| None	Annotation text
`cell_ref`	str \| None	Linked cell
`position`	dict \| None	Position metadata

Visual — Pivot Tables — `priority: must/should`

32. `PivotTableChunk`

One per pivot table.

Field	Type	Description
`sheet_name`	str	Host sheet
`pivot_name`	str	Pivot table name
`source_range`	str \| None	Source data range
`source_sheet`	str \| None	Source data sheet
`row_fields`	list[str]	Fields in row area
`col_fields`	list[str]	Fields in column area
`value_fields`	list[dict]	`[{field, agg}]` — value fields with aggregation
`filter_fields`	list[str]	Report filter fields
`report_filter_values`	dict	Active filter values

33. `PivotFieldChunk`

One per field in a pivot table.

Field	Type	Description
`pivot_id`	str	Parent PivotTableChunk `chunk_id`
`field_name`	str	Field name
`field_type`	str	`row` / `col` / `value` / `filter`
`agg_function`	str \| None	`SUM` / `COUNT` / `AVERAGE` / etc.
`sort_order`	str \| None	`asc` / `desc`
`subtotal_enabled`	bool	Subtotals shown
`grouped_by`	str \| None	Grouping interval

34. `PivotCacheChunk` (priority: should)

Pivot table cache metadata.

Field	Type	Description
`pivot_id`	str	Parent PivotTableChunk `chunk_id`
`cache_range`	str \| None	Cached data range
`last_refreshed`	datetime \| None	Last refresh timestamp
`record_count`	int \| None	Number of cached records
`unique_items_per_field`	dict[str, int]	Unique value counts per field

35. `SlicerChunk` (priority: should)

Slicer objects connected to pivot tables or tables.

Field	Type	Description
`sheet_name`	str	Host sheet
`slicer_name`	str	Slicer name
`field_name`	str \| None	Filtered field
`connected_objects`	list[str]	Connected pivot/table names
`active_filters`	list[str]	Currently selected filter values
`style`	str \| None	Slicer style
`sheet_anchor`	str \| None	Position anchor

36. `TimelineChunk` (priority: should)

Date timeline slicers.

Field	Type	Description
`sheet_name`	str	Host sheet
`timeline_name`	str	Timeline name
`date_field`	str \| None	Date field being filtered
`connected_pivots`	list[str]	Connected pivot table names
`active_period`	str \| None	Currently selected period
`granularity`	str \| None	`year` / `quarter` / `month` / `day`

Visual — Other Objects — `priority: should/nice`

37. `SparklineChunk` (priority: should)

Mini in-cell charts.

Field	Type	Description
`sheet_name`	str	Host sheet
`sparkline_type`	str	`line` / `column` / `win_loss`
`host_cell`	str	Cell containing the sparkline
`source_range`	str	Data range
`color`	str \| None	Sparkline colour
`markers`	bool	Markers shown
`axis_min`	float \| None	Axis minimum
`axis_max`	float \| None	Axis maximum

38. `ShapeChunk` (priority: nice)

Drawing shapes and text boxes.

Field	Type	Description
`sheet_name`	str	Host sheet
`shape_type`	str \| None	Shape type name
`text_content`	str \| None	Text inside the shape
`cell_anchor`	str \| None	Top-left cell anchor
`position`	dict	`{left, top, width, height}`
`linked_cell`	str \| None	Cell linked to shape
`shape_name`	str \| None	Shape object name

39. `ImageChunk` (priority: nice)

Embedded images.

Field	Type	Description
`sheet_name`	str	Host sheet
`image_type`	str \| None	`png` / `jpeg` / `gif` / etc.
`anchor_cell`	str \| None	Anchor cell
`alt_text`	str \| None	Accessibility alt text
`width_px`	int \| None	Width in pixels
`height_px`	int \| None	Height in pixels
`is_linked`	bool	External linked image
`source_url`	str \| None	Source URL (linked images)

40. `FormControlChunk` (priority: should)

Legacy form controls (button, checkbox, listbox, etc.).

Field	Type	Description
`sheet_name`	str	Host sheet
`control_type`	str	`button` / `checkbox` / `listbox` / `combobox` / `spinner` / `scrollbar`
`linked_cell`	str \| None	Linked output cell
`value`	Any \| None	Current value
`range_source`	str \| None	Input range for listbox/combobox
`macro_assigned`	str \| None	Assigned macro name

41. `ActiveXControlChunk` (priority: should)

ActiveX controls.

Field	Type	Description
`sheet_name`	str	Host sheet
`control_type`	str \| None	ActiveX class identifier
`name`	str \| None	Control object name
`linked_cell`	str \| None	Linked cell
`properties`	dict	Raw property bag
`event_macro`	str \| None	Attached event macro
`sheet_anchor`	str \| None	Position anchor

VBA / Macro — `priority: must/should`

Requires oletools and a macro-enabled file (.xlsm, .xlsb, .xls).

42. `MacroChunk`

One per VBA procedure (Sub / Function).

Field	Type	Description
`workbook`	str	Workbook name
`macro_name`	str	Procedure name
`module_name`	str \| None	Containing module
`trigger_type`	str \| None	`button` / `event` / `auto` / `ribbon`
`trigger_event`	str \| None	e.g. `Workbook_Open`
`description`	str \| None	Docstring or comment
`line_count`	int	Number of lines
`scope`	str	`workbook` or `sheet`

43. `VBAModuleChunk`

One per VBA module.

Field	Type	Description
`workbook`	str	Workbook name
`module_name`	str	Module name
`module_type`	str	`standard` / `class` / `sheet` / `workbook`
`procedure_names`	list[str]	All Subs/Functions in this module
`line_count`	int	Total lines of code
`references`	list[str]	Library references declared
`source_code`	str \| None	Full VBA source text

44. `VBAEventChunk` (priority: should)

Event handler procedures.

Field	Type	Description
`module`	str	Module containing the event
`event_name`	str	e.g. `Workbook_Open`, `Worksheet_Change`
`trigger_condition`	str \| None	When the event fires
`affected_range`	str \| None	Range affected by event
`summary`	str \| None	What the event handler does

45. `CustomFunctionChunk` (priority: should)

User-defined functions (UDFs) callable from cells.

Field	Type	Description
`module`	str	Containing module
`function_name`	str	UDF name
`parameters`	list[str]	Parameter names
`return_type`	str \| None	Return type if annotated
`description`	str \| None	Function description
`usage_cells`	list[str]	Cells calling this UDF
`is_udf`	bool	Always True

46. `RibbonCustomizationChunk` (priority: nice)

Custom Ribbon UI (customUI XML).

Field	Type	Description
`workbook`	str	Workbook name
`tab_name`	str \| None	Custom tab label
`group_name`	str \| None	Group within tab
`button_label`	str \| None	Button display label
`macro_assigned`	str \| None	OnAction macro
`icon`	str \| None	Button icon reference

Cross-Reference / Linkage — `priority: must/should`

47. `RelationshipChunk`

Cross-sheet formula dependencies.

Field	Type	Description
`source_sheet`	str	Sheet containing the formula
`source_cell`	str	Cell with the cross-reference
`target_cell`	str	Referenced cell
`target_sheet`	str	Referenced sheet
`rel_type`	str	`vlookup` / `index_match` / `direct_ref` / `power_query_feed`
`formula`	str \| None	Source formula

48. `NamedRangeChunk`

Workbook or sheet-scoped named ranges.

Field	Type	Description
`range_name`	str	Defined name
`refers_to`	str	Formula or range reference
`scope`	str	`workbook` or sheet name
`usage_count`	int	Number of formulas referencing this name
`used_in`	list[str]	Cell addresses using this name

49. `ExternalLinkChunk` (priority: should)

References to other workbooks.

Field	Type	Description
`workbook`	str	Source workbook
`source_cell`	str \| None	Cell containing the link
`target_file`	str	Referenced file path
`target_range`	str \| None	Referenced range
`last_updated`	datetime \| None	Last update timestamp
`is_broken`	bool	Link cannot be resolved
`update_mode`	str \| None	Automatic / manual / on-open

50. `DependencyGraphChunk` (priority: should)

Formula precedent/dependent graph per sheet.

Field	Type	Description
`sheet_name`	str	Host sheet
`nodes`	list[str]	All cell addresses in the graph
`edges`	list[dict]	`[{from, to, type}]` directed edges
`max_depth`	int	Longest dependency chain
`has_circular_ref`	bool	Circular reference detected

Connectivity / Power Features — `priority: must/should`

51. `PowerQueryChunk`

Power Query (Get & Transform) queries.

Field	Type	Description
`workbook`	str	Workbook name
`query_name`	str	Query name
`source_type`	str \| None	`file` / `db` / `api` / `web` / `sharepoint`
`source_path`	str \| None	Connection source path
`transformation_steps`	list[str]	Applied step names
`output_table`	str \| None	Output table name
`last_refreshed`	datetime \| None	Last refresh timestamp

52. `DataModelChunk` (priority: should)

PowerPivot data model tables and relationships.

Field	Type	Description
`workbook`	str	Workbook name
`tables`	list[str]	Table names in the model
`relationships`	list[dict]	`[{from_table, to_table, on_col}]`
`dax_measures`	list[str]	DAX measure names
`dax_columns`	list[str]	DAX calculated column names

53. `PowerPivotMeasureChunk` (priority: should)

Individual DAX measure definitions.

Field	Type	Description
`workbook`	str	Workbook name
`measure_name`	str	Measure name
`dax_expression`	str \| None	Full DAX expression
`format_string`	str \| None	Display format
`source_table`	str \| None	Host table
`used_in_pivots`	list[str]	Pivot tables using this measure

54. `ConnectionChunk` (priority: should)

External data connection definitions.

Field	Type	Description
`workbook`	str	Workbook name
`connection_name`	str	Connection name
`connection_type`	str \| None	`ODBC` / `OLE` / `web` / `sharepoint` / `odata`
`connection_string_sanitized`	str \| None	Redacted connection string
`refresh_schedule`	str \| None	Auto-refresh schedule

Operational / Metadata — `priority: must/nice`

55. `PrintSettingsChunk` (priority: nice)

Print configuration per sheet.

Field	Type	Description
`sheet_name`	str	Host sheet
`print_area`	str \| None	Print area range
`page_break_rows`	list[int]	Manual page break row indices
`header_text`	str \| None	Page header text
`footer_text`	str \| None	Page footer text
`orientation`	str \| None	`portrait` or `landscape`
`fit_to_pages`	bool \| None	Fit-to-page enabled

56. `ChangeLogChunk` (priority: nice)

Tracked changes from shared workbook revision history.

Field	Type	Description
`sheet_name`	str	Host sheet
`cell_addr`	str	Changed cell
`old_value`	Any \| None	Previous value
`new_value`	Any \| None	New value
`author`	str \| None	Who made the change
`changed_at`	datetime \| None	When the change was made
`change_type`	str \| None	`edit` / `insert` / `delete`

57. `ChunkIndexChunk` (priority: must)

Master manifest — always the last chunk produced.

Field	Type	Description
`workbook_ref`	str	Workbook name
`all_chunk_ids`	list[str]	Every `chunk_id` in extraction order
`type_counts`	dict[str, int]	Count of each chunk type produced
`source_address_map`	dict[str, str]	`{chunk_id: "SheetName!A1"}` for cell-level chunks
`created_at`	datetime	UTC timestamp of extraction

Format Support Details

Format	Parser	Tables	Charts	Pivot	VBA	Power Query	Notes
`.xlsx`	openpyxl	✓	✓	✓	—	✓	Full support
`.xlsm`	openpyxl	✓	✓	✓	✓	✓	Macro-enabled
`.xlsb`	pyxlsb + adapter	✓	—	—	✓	—	Binary format; limited visual support
`.xls`	xlrd + adapter	✓	—	—	✓	—	Legacy format (Excel 97–2003)

VBA extraction requires oletools: pip install oletools

Dependency graphs require networkx: pip install networkx

RAG / LLM Integration

Feeding chunks into a vector store

from omnidoc_excel_extractor import ExcelExtractor

extractor = ExcelExtractor("financials.xlsx")
result = extractor.extract(include_rows=True, priority="must")

documents = []
for chunk in result.chunks:
    d = chunk.to_dict()

    # Build a text representation for embedding
    text_parts = [f"type: {d['chunk_type']}", f"layer: {d['layer']}"]
    for k, v in d.items():
        if k not in {"chunk_id", "chunk_type", "layer", "priority"} and v is not None:
            text_parts.append(f"{k}: {v}")
    text = "\n".join(text_parts)

    documents.append({
        "id":       d["chunk_id"],
        "text":     text,
        "metadata": {
            "chunk_type": d["chunk_type"],
            "layer":      d["layer"],
            "priority":   d["priority"],
            "sheet":      d.get("sheet_name"),
        },
    })

# e.g. with LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, etc.
# vector_store.add_documents(documents)

Selective extraction for query routing

# Fast schema-only retrieval for "what columns does this file have?"
schema_result = extractor.extract_schema()
schema_dicts = schema_result.to_dicts()

# KPI retrieval for "what is the total revenue?"
kpi_result = extractor.extract(
    chunk_types=["KPIChunk", "AggregationChunk"],
    include_analytics=True,
)

# Formula audit for "explain this formula"
formula_result = extractor.extract_formulas()

Serialising to JSON

import json

result = extractor.extract()
with open("chunks.json", "w") as f:
    json.dump(result.to_dicts(), f, default=str, indent=2)

Advanced Usage

Batch processing multiple files

from pathlib import Path
from omnidoc_excel_extractor import ExcelExtractor

results = {}
for path in Path("./reports").glob("*.xlsx"):
    extractor = ExcelExtractor(str(path))
    result = extractor.extract(include_rows=False)
    results[path.name] = result.summary()

print(results)

Accessing CHUNK_REGISTRY

from omnidoc_excel_extractor import CHUNK_REGISTRY

print(list(CHUNK_REGISTRY.keys()))  # all 57 chunk type names
ChunkClass = CHUNK_REGISTRY["SchemaChunk"]

Constructing a chunk manually

from omnidoc_excel_extractor.chunks.models import KPIChunk

kpi = KPIChunk(
    sheet_name="Dashboard",
    kpi_name="Total Revenue",
    value=1_250_000.0,
    unit="USD",
    target=1_500_000.0,
    variance=-250_000.0,
    variance_pct=-16.67,
    source_cell="B2",
)
print(kpi.to_dict())

Development Setup

# Clone the repository
git clone https://github.com/ganeshkinkargiri/omnidoc-excel-extractor-sdk.git
cd omnidoc-excel-extractor-sdk

# Create a virtual environment
python3 -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate

# Install in editable mode with dev dependencies
pip install -e ".[dev]"

# Optional extras
pip install oletools networkx

# Run tests
pytest

# Run tests with coverage
pytest --cov=omnidoc_excel_extractor --cov-report=term-missing

Project structure for contributors

omnidoc_excel_extractor/
├── chunks/models.py      ← Add new chunk class here first
├── builder/<category>.py ← Add builder function here
├── builder/__init__.py   ← Export the builder function
└── core/extractor.py     ← Call the builder inside extract()
tests/
├── conftest.py           ← Shared fixtures (programmatic .xlsx)
└── test_extractor.py     ← Tests

Adding a new chunk type:

Define the Pydantic class in chunks/models.py and add it to CHUNK_REGISTRY.
Write a build_<name>_chunks() function in the appropriate builder file.
Export it from builder/__init__.py.
Call it inside extract() in core/extractor.py.
Add a test in tests/test_extractor.py.

Publishing to PyPI

One-time Setup

1. Create accounts

Registry	URL	Purpose
TestPyPI	https://test.pypi.org/account/register/	Test uploads before going live
PyPI	https://pypi.org/account/register/	Production — public installs

2. Generate API tokens

TestPyPI token:

Log in to https://test.pypi.org
Account Settings → API tokens → Add API token
Scope: "Entire account" for first upload, or per-project after
Copy the token (shown only once) — starts with pypi-

PyPI token:

Log in to https://pypi.org
Account Settings → API tokens → Add API token
Copy the token

3. Configure `~/.pypirc`

Create or edit ~/.pypirc:

[distutils]
index-servers =
    pypi
    testpypi

[pypi]
repository = https://upload.pypi.org/legacy/
username = __token__
password = pypi-YOUR_PYPI_TOKEN_HERE

[testpypi]
repository = https://test.pypi.org/legacy/
username = __token__
password = pypi-YOUR_TESTPYPI_TOKEN_HERE

chmod 600 ~/.pypirc    # keep token private

4. Install publishing tools

pip install build twine

Build the Package

# Clean previous builds first
rm -rf dist/ build/ *.egg-info

# Build both sdist (.tar.gz) and wheel (.whl)
python -m build

Expected output:

Successfully built omnidoc_excel_extractor-0.1.0.tar.gz
            and omnidoc_excel_extractor-0.1.0-py3-none-any.whl

Validate the build before uploading:

twine check dist/*

You should see PASSED for both files. Fix any warnings before uploading.

Publish to TestPyPI

Always publish to TestPyPI first to catch packaging issues.

twine upload --repository testpypi dist/*

You will be prompted for credentials if ~/.pypirc is not configured.

Verify the TestPyPI install:

# Use a fresh virtual environment
python3 -m venv /tmp/test-install-venv
source /tmp/test-install-venv/bin/activate

# Install from TestPyPI (note: dependencies come from real PyPI)
pip install --index-url https://test.pypi.org/simple/ \
            --extra-index-url https://pypi.org/simple/ \
            omnidoc-excel-extractor

# Quick smoke test
python -c "
from omnidoc_excel_extractor import ExcelExtractor, CHUNK_REGISTRY
print(f'Package imported OK. {len(CHUNK_REGISTRY)} chunk types registered.')
"

Expected: Package imported OK. 57 chunk types registered.

Publish to PyPI (Production)

Once TestPyPI install is confirmed working:

twine upload dist/*

Or explicitly specifying the repository:

twine upload --repository pypi dist/*

Verify the production install:

pip install omnidoc-excel-extractor
python -c "from omnidoc_excel_extractor import ExcelExtractor; print('OK')"

Your package will be live at:

https://pypi.org/project/omnidoc-excel-extractor/

Versioning

Version is declared in pyproject.toml:

[project]
version = "0.1.0"

Follow Semantic Versioning:

Change	Version bump	Example
Bug fix, no API change	Patch	`0.1.0` → `0.1.1`
New chunk type or feature, backward-compatible	Minor	`0.1.0` → `0.2.0`
Breaking change to chunk fields or API	Major	`0.1.0` → `1.0.0`

Release workflow:

# 1. Bump version in pyproject.toml
#    version = "0.2.0"

# 2. Update CHANGELOG (optional but recommended)

# 3. Commit and tag
git add pyproject.toml
git commit -m "chore: bump version to 0.2.0"
git tag v0.2.0
git push origin master --tags

# 4. Build
rm -rf dist/
python -m build

# 5. Validate
twine check dist/*

# 6. TestPyPI first
twine upload --repository testpypi dist/*

# 7. Verify, then ship
twine upload dist/*

License

MIT — see LICENSE

Built with openpyxl, xlrd, pyxlsb, oletools, pydantic, networkx, and pandas.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.2

May 19, 2026

0.3.1

May 19, 2026

0.3.0

May 19, 2026

0.2.0

May 16, 2026

This version

0.1.0

May 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omnidoc_excel_extractor-0.1.0.tar.gz (68.5 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

omnidoc_excel_extractor-0.1.0-py3-none-any.whl (64.8 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file omnidoc_excel_extractor-0.1.0.tar.gz.

File metadata

Download URL: omnidoc_excel_extractor-0.1.0.tar.gz
Upload date: May 16, 2026
Size: 68.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for omnidoc_excel_extractor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`cb7752a5d876dfc9972491281123ebce2f03d58f990646503cfa3257ad46b5b9`
MD5	`ac216bd91c6ca7e0dff08f95c70d4ecb`
BLAKE2b-256	`24f73c1288c73047051b619cab007fa42d1eae8d077e3859944464b960bd966c`

See more details on using hashes here.

File details

Details for the file omnidoc_excel_extractor-0.1.0-py3-none-any.whl.

File metadata

Download URL: omnidoc_excel_extractor-0.1.0-py3-none-any.whl
Upload date: May 16, 2026
Size: 64.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for omnidoc_excel_extractor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eee00d6c01f03156e93ac9269e933b56f9473ba556fd82e62804c8ad02027a4f`
MD5	`1cc6a46c5f217a5b0a374583539f691d`
BLAKE2b-256	`4bac5528bc10c61e38af9e50a60fa02ff67f555ed9624753eb3392eba5cdbdc7`

See more details on using hashes here.

omnidoc-excel-extractor 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

omnidoc-excel-extractor

Table of Contents

Overview

Architecture

Installation

Quick Start

Extraction Options

Priority tiers at a glance

Convenience Methods

ExtractionResult API

All 57 Chunk Types

Structural — priority: must

1. WorkbookChunk

2. SheetChunk

3. TableChunk

4. SchemaChunk

5. RowChunk

6. GroupChunk

7. ParentContextChunk

8. MergedCellChunk

9. FrozenPaneChunk (priority: nice)

Semantic — priority: must/should

10. SummaryChunk

11. SemanticNarrativeChunk

12. ColumnSemanticChunk

13. HeaderAliasChunk

14. EntityChunk (priority: should)

15. CellAnnotationChunk (priority: should)

Analytical — priority: must/should

16. FormulaDefinitionChunk

17. FormulaResultChunk

18. KPIChunk

19. AggregationChunk

20. TemporalChunk

21. TrendChunk (priority: should)

22. OutlierChunk (priority: should)

23. LookupMapChunk (priority: should)

Validation / QA — priority: must/should

24. ValidationChunk

25. ErrorChunk

26. ConditionalFormatChunk

27. ProtectionChunk (priority: should)

28. DataQualityChunk (priority: should)

Visual — Charts — priority: must/should

29. ChartChunk

30. ChartSeriesChunk

31. ChartAnnotationChunk (priority: should)

Visual — Pivot Tables — priority: must/should

32. PivotTableChunk

33. PivotFieldChunk

34. PivotCacheChunk (priority: should)

35. SlicerChunk (priority: should)

36. TimelineChunk (priority: should)

Visual — Other Objects — priority: should/nice

37. SparklineChunk (priority: should)

38. ShapeChunk (priority: nice)

39. ImageChunk (priority: nice)

40. FormControlChunk (priority: should)

41. ActiveXControlChunk (priority: should)

VBA / Macro — priority: must/should

42. MacroChunk

43. VBAModuleChunk

44. VBAEventChunk (priority: should)

45. CustomFunctionChunk (priority: should)

46. RibbonCustomizationChunk (priority: nice)

Cross-Reference / Linkage — priority: must/should

47. RelationshipChunk

48. NamedRangeChunk

49. ExternalLinkChunk (priority: should)

50. DependencyGraphChunk (priority: should)

Connectivity / Power Features — priority: must/should

Structural — `priority: must`

1. `WorkbookChunk`

2. `SheetChunk`

3. `TableChunk`

4. `SchemaChunk`

5. `RowChunk`

6. `GroupChunk`

7. `ParentContextChunk`

8. `MergedCellChunk`

9. `FrozenPaneChunk` (priority: nice)

Semantic — `priority: must/should`

10. `SummaryChunk`

11. `SemanticNarrativeChunk`

12. `ColumnSemanticChunk`

13. `HeaderAliasChunk`

14. `EntityChunk` (priority: should)

15. `CellAnnotationChunk` (priority: should)

Analytical — `priority: must/should`

16. `FormulaDefinitionChunk`

17. `FormulaResultChunk`

18. `KPIChunk`

19. `AggregationChunk`

20. `TemporalChunk`

21. `TrendChunk` (priority: should)

22. `OutlierChunk` (priority: should)

23. `LookupMapChunk` (priority: should)

Validation / QA — `priority: must/should`

24. `ValidationChunk`

25. `ErrorChunk`

26. `ConditionalFormatChunk`

27. `ProtectionChunk` (priority: should)

28. `DataQualityChunk` (priority: should)

Visual — Charts — `priority: must/should`

29. `ChartChunk`

30. `ChartSeriesChunk`

31. `ChartAnnotationChunk` (priority: should)

Visual — Pivot Tables — `priority: must/should`

32. `PivotTableChunk`

33. `PivotFieldChunk`

34. `PivotCacheChunk` (priority: should)

35. `SlicerChunk` (priority: should)

36. `TimelineChunk` (priority: should)

Visual — Other Objects — `priority: should/nice`

37. `SparklineChunk` (priority: should)

38. `ShapeChunk` (priority: nice)

39. `ImageChunk` (priority: nice)

40. `FormControlChunk` (priority: should)

41. `ActiveXControlChunk` (priority: should)

VBA / Macro — `priority: must/should`

42. `MacroChunk`

43. `VBAModuleChunk`

44. `VBAEventChunk` (priority: should)

45. `CustomFunctionChunk` (priority: should)

46. `RibbonCustomizationChunk` (priority: nice)

Cross-Reference / Linkage — `priority: must/should`

47. `RelationshipChunk`

48. `NamedRangeChunk`

49. `ExternalLinkChunk` (priority: should)

50. `DependencyGraphChunk` (priority: should)

Connectivity / Power Features — `priority: must/should`

51. `PowerQueryChunk`

52. `DataModelChunk` (priority: should)

53. `PowerPivotMeasureChunk` (priority: should)

54. `ConnectionChunk` (priority: should)

Operational / Metadata — `priority: must/nice`

55. `PrintSettingsChunk` (priority: nice)

56. `ChangeLogChunk` (priority: nice)

57. `ChunkIndexChunk` (priority: must)

3. Configure `~/.pypirc`