Extract and normalize Excel workbook artifacts (sheets, connections, formulas) into a lightweight graph.
Project description
excelminer
excelminer extracts Excel workbook artifacts into a small, normalized in-memory graph (nodes + edges) that you can serialize to deterministic JSON.
It is designed for inventory, analysis, and reproducible diffs (stable ordering), not for “opening Excel” or evaluating formulas.
What you can extract
From OOXML files (.xlsx/.xlsm/.xltx/.xltm) without Excel installed:
- sheets
- defined names
- connections + basic source inference
- Power Query queries (when stored as
xl/queries/*.xml) - Power Query mashup-container detection (best-effort, metadata-only)
- pivot tables + pivot caches (best-effort)
- VBA project metadata + module text for macro-enabled OOXML (
.xlsm/.xltm/.xlam) - formula text + basic dependencies (via
openpyxl, when enabled)
Optional enrichment:
- used-range “value blocks” via calamine (fast scanning)
- Windows Excel COM automation (for legacy formats like
.xls/.xlsband opt-in enrichment for modern OOXML)
This package focuses on inventory and reproducible diffs, not evaluation:
- formulas are stored as text (not evaluated)
- macros are not executed
- many artifacts are “best-effort” depending on how a workbook was authored
Install
Base install:
pip install excelminer
Optional extras:
pip install "excelminer[calamine]" # pandas + python-calamine
pip install "excelminer[com]" # Windows + Microsoft Excel required
Core API
analyze_workbook(path, *, options=..., backends=...) -> (graph, reports, ctx)analyze_to_dict(path, *, options=..., backends=...) -> dict
reports is a per-backend list of stats/issues; ctx.issues includes top-level warnings.
Quickstart
JSON output
from excelminer import AnalysisOptions, analyze_to_dict
result = analyze_to_dict(
"workbook.xlsx",
options=AnalysisOptions(include_formulas=True),
)
print(result["graph"]["stats"]) # counts by node kind
print(result["reports"][0]["backend"]) # per-backend reports
Graph output
from excelminer import AnalysisOptions, analyze_workbook
graph, reports, ctx = analyze_workbook(
"workbook.xlsx",
options=AnalysisOptions(include_formulas=True),
)
print(graph.stats())
print([r.backend for r in reports])
print(ctx.issues)
Common usage patterns
1) Fast structural inventory (default)
from excelminer import analyze_to_dict
result = analyze_to_dict("workbook.xlsx")
print(result["graph"]["stats"]) # counts by node kind
2) Formula inventory (no Excel required)
from excelminer import AnalysisOptions, analyze_to_dict
result = analyze_to_dict(
"workbook.xlsx",
options=AnalysisOptions(include_formulas=True),
)
3) Used-range “value blocks” (optional)
Requires excelminer[calamine].
from excelminer import AnalysisOptions, analyze_to_dict
result = analyze_to_dict(
"workbook.xlsx",
options=AnalysisOptions(include_cells=True, max_cells_per_sheet=50_000),
)
4) Post-analysis distillation (optional)
If you want a condensed view for large graphs:
from excelminer import AnalysisOptions, analyze_workbook
graph, reports, ctx = analyze_workbook(
"workbook.xlsx",
options=AnalysisOptions(include_formulas=True, post_analysis_distillation=True),
)
5) COM enrichment (Windows + Excel required)
COM is opt-in for modern OOXML files (.xlsx/.xlsm/...).
from excelminer import AnalysisOptions, analyze_to_dict
result = analyze_to_dict(
"workbook.xlsx",
options=AnalysisOptions(include_com=True, include_connections=True),
)
Output shape (high level)
analyze_to_dict() returns:
path,options,issuesreports: per-backend stats/issuesgraph:{ nodes: [...], edges: [...], stats: {...} }
Common node kinds include: sheet, connection, source, powerquery, pivot_table, pivot_cache, vba_project, formula_cell, cell_block.
Optional post-processing can be enabled via AnalysisOptions(post_analysis_distillation=True) to add condensed artifacts like formula_group and to prune unused artifacts (best-effort).
Nodes and edges
- Node:
{ id, kind, key, attrs } - Edge:
{ src, dst, kind, attrs }
Common edge kinds:
contains(e.g.sheet -> formula_cell)uses_source(e.g.connection -> source)uses_connection(e.g.powerquery -> connection)uses_cache(e.g.pivot_table -> pivot_cache)scoped_to(e.g.defined_name -> sheet)
Default backend pipeline
By default, backends run in this order:
- OOXML zip parsing (structure)
- VBA projects (macro detection for
.xlsm/.xltm/.xlam) - Power Query (queries XML + mashup-container detection)
- Pivot tables (pivots + caches)
- Calamine (used-range/value blocks; optional)
- openpyxl (formula text)
- Excel COM (Windows-only enrichment; opt-in for modern OOXML)
You can override the pipeline via the backends= argument.
Options (most important)
The main tuning surface is excelminer.AnalysisOptions.
Feature flags:
include_connections(defaultTrue): workbook connections and inferredsourcenodesinclude_powerquery(defaultTrue): Power Query queries when stored asxl/queries/*.xmlinclude_pivots(defaultTrue): pivot tables + caches (best-effort)include_defined_names(defaultTrue): defined namesinclude_vba(defaultTrue): extract VBA project metadata and module textinclude_formulas(defaultFalse): formula text inventory via openpyxlinclude_cells(defaultFalse): used-range/value blocks via calamine (if installed)include_com(defaultFalse): enable Excel COM automation (Windows + Excel required)post_analysis_distillation(defaultFalse): optional graph distillation (best-effort)
Limits (for huge workbooks):
max_sheets,max_cells_per_sheetsample_rows_per_block,sample_cols_per_block(calamine sampling)
Data source discovery (connections / sources)
excelminer tries to normalize upstream data dependencies into source nodes.
Sources can be discovered via:
- OOXML connections (
xl/connections.xml): OLEDB/ODBC connection strings (sanitized KV stored inconnection_kv) - OOXML external links (
xl/externalLinks/*): external workbook/file links (best-effort) - Power Query M scanning: regex-based inference of SQL/file/web/sharepoint sources
- COM connections (when enabled): additional connection metadata and file/web hints when available
If you suspect sources are missing, see the “QA helpers” below.
QA helpers (recommended)
For quick inspection of whether sources and connections were detected:
from excelminer import analyze_workbook, summarize_connections, summarize_sources
graph, reports, ctx = analyze_workbook("workbook.xlsx")
print(summarize_sources(graph)["counts"])
print(summarize_connections(graph)["counts"])
The connection summary also includes a uses_source mapping so you can see which connections did (or did not) map to sources.
Security & privacy notes
- Connection parsing produces a sanitized key/value view (
password/user id/ etc masked) inconnection_kv. - The raw connection string may also be stored in
connection.raw.
Treat the output JSON as potentially sensitive. If you don’t need connections, use AnalysisOptions(include_connections=False).
Additional notes:
- Sanitization covers only a small set of common keys; connection strings and Power Query M can contain sensitive data in many forms.
- If you enable COM automation, Excel is started in the background; behavior can vary due to enterprise policies/add-ins.
Troubleshooting
- If you see
openpyxl import failed, install/upgradeopenpyxl. - If you enable
include_cells=Trueand see a calamine/pandas error, installexcelminer[calamine]. - If you enable
include_com=Trueand seepywin32 not available, installexcelminer[com](Windows only). - Some Power Query workbooks store queries in binary mashup parts;
excelminerreports presence/metadata but does not decode those binaries.
Development notes
COM integration tests are opt-in because some environments can crash the Python process when Excel COM is invoked.
PowerShell:
$env:EXCELMINER_RUN_COM_TESTS='1'
pytest -m integration
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file excelminer-0.0.3.tar.gz.
File metadata
- Download URL: excelminer-0.0.3.tar.gz
- Upload date:
- Size: 63.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c8d427dbd2c33bf9daa8be16c6af20ae8db2e4f6893435ea78b5d4cc43e2b7d
|
|
| MD5 |
c8a33de02004cf884e70ca6c5866ff1b
|
|
| BLAKE2b-256 |
d9bd5d2f00dc31abfb36077bc45b8c2396f1da9568fbab55d1c71505530354ae
|
Provenance
The following attestation bundles were made for excelminer-0.0.3.tar.gz:
Publisher:
publish.yml on brentwc/excelminer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
excelminer-0.0.3.tar.gz -
Subject digest:
5c8d427dbd2c33bf9daa8be16c6af20ae8db2e4f6893435ea78b5d4cc43e2b7d - Sigstore transparency entry: 825378735
- Sigstore integration time:
-
Permalink:
brentwc/excelminer@6300f0091d262e7c490b7cd02990f3546b6d3a59 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/brentwc
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6300f0091d262e7c490b7cd02990f3546b6d3a59 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file excelminer-0.0.3-py3-none-any.whl.
File metadata
- Download URL: excelminer-0.0.3-py3-none-any.whl
- Upload date:
- Size: 54.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f699635b8d249af1b1c2166712443b922db1349d5a75c6764e836048c6f6a9e8
|
|
| MD5 |
0e038da29a84bb950666f366da27c202
|
|
| BLAKE2b-256 |
f41104472ef0286561ccd9e6c4e702f0bdfe4beda8736deb08e9a4d648965f33
|
Provenance
The following attestation bundles were made for excelminer-0.0.3-py3-none-any.whl:
Publisher:
publish.yml on brentwc/excelminer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
excelminer-0.0.3-py3-none-any.whl -
Subject digest:
f699635b8d249af1b1c2166712443b922db1349d5a75c6764e836048c6f6a9e8 - Sigstore transparency entry: 825378784
- Sigstore integration time:
-
Permalink:
brentwc/excelminer@6300f0091d262e7c490b7cd02990f3546b6d3a59 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/brentwc
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6300f0091d262e7c490b7cd02990f3546b6d3a59 -
Trigger Event:
workflow_dispatch
-
Statement type: