Skip to main content

Extract and normalize Excel workbook artifacts (sheets, connections, formulas) into a lightweight graph.

Project description

excelminer

excelminer extracts Excel workbook artifacts into a small, normalized in-memory graph (nodes + edges) that you can serialize to deterministic JSON.

It is designed for inventory, analysis, and reproducible diffs (stable ordering), not for “opening Excel” or evaluating formulas.

Python 3.12+ License: MIT

What you can extract

From OOXML files (.xlsx/.xlsm/.xltx/.xltm) without Excel installed:

  • sheets
  • defined names
  • connections + basic source inference
  • Power Query queries (when stored as xl/queries/*.xml)
  • Power Query mashup-container detection (best-effort, metadata-only)
  • pivot tables + pivot caches (best-effort)
  • VBA project metadata + module text for macro-enabled OOXML (.xlsm/.xltm/.xlam)
  • formula text + basic dependencies (via openpyxl, when enabled)

Optional enrichment:

  • used-range “value blocks” via calamine (fast scanning)
  • Windows Excel COM automation (for legacy formats like .xls/.xlsb and opt-in enrichment for modern OOXML)

This package focuses on inventory and reproducible diffs, not evaluation:

  • formulas are stored as text (not evaluated)
  • macros are not executed
  • many artifacts are “best-effort” depending on how a workbook was authored

Install

Base install:

pip install excelminer

Optional extras:

pip install "excelminer[calamine]"  # pandas + python-calamine
pip install "excelminer[com]"       # Windows + Microsoft Excel required

Documentation (bundled)

Documentation ships inside the package so it is available offline:

python -m excelminer.docs --list
python -m excelminer.docs --show USAGE
python -m excelminer.docs --write-dir .\\excelminer-docs
python -m excelminer.docs --write-site .\\excelminer-site

Open excelminer-site\\index.html locally after writing the site files.

Core API

  • analyze_workbook(path, *, options=..., backends=...) -> (graph, reports, ctx)
  • analyze_to_dict(path, *, options=..., backends=...) -> dict

reports is a per-backend list of stats/issues; ctx.issues includes top-level warnings.

Quickstart

JSON output

from excelminer import AnalysisOptions, analyze_to_dict

result = analyze_to_dict(
    "workbook.xlsx",
    options=AnalysisOptions(include_formulas=True),
)

print(result["graph"]["stats"])          # counts by node kind
print(result["reports"][0]["backend"])    # per-backend reports

Graph output

from excelminer import AnalysisOptions, analyze_workbook

graph, reports, ctx = analyze_workbook(
    "workbook.xlsx",
    options=AnalysisOptions(include_formulas=True),
)

print(graph.stats())
print([r.backend for r in reports])
print(ctx.issues)

Common usage patterns

1) Fast structural inventory (default)

from excelminer import analyze_to_dict

result = analyze_to_dict("workbook.xlsx")
print(result["graph"]["stats"])  # counts by node kind

2) Formula inventory (no Excel required)

from excelminer import AnalysisOptions, analyze_to_dict

result = analyze_to_dict(
    "workbook.xlsx",
    options=AnalysisOptions(include_formulas=True),
)

3) Used-range “value blocks” (optional)

Requires excelminer[calamine].

from excelminer import AnalysisOptions, analyze_to_dict

result = analyze_to_dict(
    "workbook.xlsx",
    options=AnalysisOptions(include_cells=True, max_cells_per_sheet=50_000),
)

4) Post-analysis distillation (optional)

If you want a condensed view for large graphs:

from excelminer import AnalysisOptions, analyze_workbook

graph, reports, ctx = analyze_workbook(
    "workbook.xlsx",
    options=AnalysisOptions(include_formulas=True, post_analysis_distillation=True),
)

5) COM enrichment (Windows + Excel required)

COM is opt-in for modern OOXML files (.xlsx/.xlsm/...).

from excelminer import AnalysisOptions, analyze_to_dict

result = analyze_to_dict(
    "workbook.xlsx",
    options=AnalysisOptions(include_com=True, include_connections=True),
)

Output shape (high level)

analyze_to_dict() returns:

  • path, options, issues
  • reports: per-backend stats/issues
  • graph: { nodes: [...], edges: [...], stats: {...} }

Common node kinds include: sheet, connection, source, powerquery, pivot_table, pivot_cache, vba_project, formula_cell, cell_block.

Optional post-processing can be enabled via AnalysisOptions(post_analysis_distillation=True) to add condensed artifacts like formula_group and to prune unused artifacts (best-effort).

Nodes and edges

  • Node: { id, kind, key, attrs }
  • Edge: { src, dst, kind, attrs }

Common edge kinds:

  • contains (e.g. sheet -> formula_cell)
  • uses_source (e.g. connection -> source)
  • uses_connection (e.g. powerquery -> connection)
  • uses_cache (e.g. pivot_table -> pivot_cache)
  • scoped_to (e.g. defined_name -> sheet)

Default backend pipeline

By default, backends run in this order:

  1. OOXML zip parsing (structure)
  2. VBA projects (macro detection for .xlsm/.xltm/.xlam)
  3. Power Query (queries XML + mashup-container detection)
  4. Pivot tables (pivots + caches)
  5. Calamine (used-range/value blocks; optional)
  6. openpyxl (formula text)
  7. Excel COM (Windows-only enrichment; opt-in for modern OOXML)

You can override the pipeline via the backends= argument.

Options (most important)

The main tuning surface is excelminer.AnalysisOptions.

Feature flags:

  • include_connections (default True): workbook connections and inferred source nodes
  • include_powerquery (default True): Power Query queries when stored as xl/queries/*.xml
  • include_pivots (default True): pivot tables + caches (best-effort)
  • include_defined_names (default True): defined names
  • include_vba (default True): extract VBA project metadata and module text
  • include_formulas (default False): formula text inventory via openpyxl
  • include_cells (default False): used-range/value blocks via calamine (if installed)
  • include_com (default False): enable Excel COM automation (Windows + Excel required)
  • post_analysis_distillation (default False): optional graph distillation (best-effort)

Limits (for huge workbooks):

  • max_sheets, max_cells_per_sheet
  • sample_rows_per_block, sample_cols_per_block (calamine sampling)

Data source discovery (connections / sources)

excelminer tries to normalize upstream data dependencies into source nodes.

Sources can be discovered via:

  • OOXML connections (xl/connections.xml): OLEDB/ODBC connection strings (sanitized KV stored in connection_kv)
  • OOXML external links (xl/externalLinks/*): external workbook/file links (best-effort)
  • Power Query M scanning: regex-based inference of SQL/file/web/sharepoint sources
  • COM connections (when enabled): additional connection metadata and file/web hints when available

If you suspect sources are missing, see the “QA helpers” below.

QA helpers (recommended)

For quick inspection of whether sources and connections were detected:

from excelminer import analyze_workbook, summarize_connections, summarize_sources

graph, reports, ctx = analyze_workbook("workbook.xlsx")

print(summarize_sources(graph)["counts"])
print(summarize_connections(graph)["counts"])

The connection summary also includes a uses_source mapping so you can see which connections did (or did not) map to sources.

Security & privacy notes

  • Connection parsing produces a sanitized key/value view (password / user id / etc masked) in connection_kv.
  • The raw connection string may also be stored in connection.raw.

Treat the output JSON as potentially sensitive. If you don’t need connections, use AnalysisOptions(include_connections=False).

Additional notes:

  • Sanitization covers only a small set of common keys; connection strings and Power Query M can contain sensitive data in many forms.
  • If you enable COM automation, Excel is started in the background; behavior can vary due to enterprise policies/add-ins.

Troubleshooting

  • If you see openpyxl import failed, install/upgrade openpyxl.
  • If you enable include_cells=True and see a calamine/pandas error, install excelminer[calamine].
  • If you enable include_com=True and see pywin32 not available, install excelminer[com] (Windows only).
  • Some Power Query workbooks store queries in binary mashup parts; excelminer reports presence/metadata but does not decode those binaries.

Development notes

COM integration tests are opt-in because some environments can crash the Python process when Excel COM is invoked.

PowerShell:

$env:EXCELMINER_RUN_COM_TESTS='1'
pytest -m integration

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

excelminer-0.0.5.tar.gz (87.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

excelminer-0.0.5-py3-none-any.whl (87.8 kB view details)

Uploaded Python 3

File details

Details for the file excelminer-0.0.5.tar.gz.

File metadata

  • Download URL: excelminer-0.0.5.tar.gz
  • Upload date:
  • Size: 87.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for excelminer-0.0.5.tar.gz
Algorithm Hash digest
SHA256 a53870376ef47f14f016b0878f8383e74a7b950b440016ddf759217e05212784
MD5 c777a776a045d89b0e2e0d7fc5bc90b4
BLAKE2b-256 69a5496e11cfd0f717386a097e61800c441586fdc4c33a84c0330b5281b546a4

See more details on using hashes here.

Provenance

The following attestation bundles were made for excelminer-0.0.5.tar.gz:

Publisher: publish.yml on brentwc/excelminer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file excelminer-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: excelminer-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 87.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for excelminer-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 2f92d71a8c5e73f96df5dbf13a36b20ddfe46486314ac1cc55b615ecbeefd00c
MD5 754228516eccec0e36e9d7f8d6839941
BLAKE2b-256 06f9e8d100aae27cfdd6fd6a3eba046ba1f6f735cb4330795cb3ca2e0859caca

See more details on using hashes here.

Provenance

The following attestation bundles were made for excelminer-0.0.5-py3-none-any.whl:

Publisher: publish.yml on brentwc/excelminer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page