Skip to main content

Extract and normalize Excel workbook artifacts (sheets, connections, formulas) into a lightweight graph.

Project description

excelminer

excelminer extracts Excel workbook artifacts into a small, normalized in-memory graph (nodes + edges) that you can serialize to deterministic JSON.

It is designed for inventory, analysis, and reproducible diffs (stable ordering), not for “opening Excel” or evaluating formulas.

What you can extract

From OOXML files (.xlsx/.xlsm/.xltx/.xltm) without Excel installed:

  • sheets
  • defined names
  • connections + basic source inference
  • Power Query queries (when stored as xl/queries/*.xml)
  • Power Query mashup-container detection (best-effort, metadata-only)
  • pivot tables + pivot caches (best-effort)
  • VBA project presence for macro-enabled OOXML (.xlsm/.xltm/.xlam) (metadata-only)
  • formula text + basic dependencies (via openpyxl, when enabled)

Optional enrichment:

  • used-range “value blocks” via calamine (fast scanning)
  • Windows Excel COM automation (for legacy formats like .xls/.xlsb and opt-in enrichment for modern OOXML)

Install

Base install:

pip install excelminer

Optional extras:

pip install "excelminer[calamine]"  # pandas + python-calamine
pip install "excelminer[com]"       # Windows + Microsoft Excel required

Quickstart

JSON output

from excelminer import AnalysisOptions, analyze_to_dict

result = analyze_to_dict(
    "workbook.xlsx",
    options=AnalysisOptions(include_formulas=True),
)

print(result["graph"]["stats"])          # counts by node kind
print(result["reports"][0]["backend"])    # per-backend reports

Graph output

from excelminer import AnalysisOptions, analyze_workbook

graph, reports, ctx = analyze_workbook(
    "workbook.xlsx",
    options=AnalysisOptions(include_formulas=True),
)

print(graph.stats())
print([r.backend for r in reports])
print(ctx.issues)

Output shape (high level)

analyze_to_dict() returns:

  • path, options, issues
  • reports: per-backend stats/issues
  • graph: { nodes: [...], edges: [...], stats: {...} }

Common node kinds include: sheet, connection, source, powerquery, pivot_table, pivot_cache, vba_project, formula_cell, cell_block.

Default backend pipeline

By default, backends run in this order:

  1. OOXML zip parsing (structure)
  2. VBA projects (macro detection for .xlsm/.xltm/.xlam)
  3. Power Query (queries XML + mashup-container detection)
  4. Pivot tables (pivots + caches)
  5. Calamine (used-range/value blocks; optional)
  6. openpyxl (formula text)
  7. Excel COM (Windows-only enrichment; opt-in for modern OOXML)

You can override the pipeline via the backends= argument.

Security & privacy notes

  • Connection parsing produces a sanitized key/value view (password / user id / etc masked) in connection_kv.
  • The raw connection string may also be stored in connection.raw.

Treat the output JSON as potentially sensitive. If you don’t need connections, use AnalysisOptions(include_connections=False).

Documentation (in this repo)

  • docs/README.md: documentation index
  • docs/USAGE.md: usage patterns + backend ordering
  • docs/OPTIONS.md: AnalysisOptions flags and limits
  • docs/BACKENDS.md: backend behavior and requirements
  • docs/OUTPUT.md: output schema and common node/edge kinds
  • docs/SECURITY.md: security & privacy notes
  • docs/DEVELOPMENT.md: tests, COM opt-in, coverage profiles

Development notes

COM integration tests are opt-in because some environments can crash the Python process when Excel COM is invoked.

PowerShell:

$env:EXCELMINER_RUN_COM_TESTS='1'
pytest -m integration

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

excelminer-0.0.0.tar.gz (32.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

excelminer-0.0.0-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file excelminer-0.0.0.tar.gz.

File metadata

  • Download URL: excelminer-0.0.0.tar.gz
  • Upload date:
  • Size: 32.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for excelminer-0.0.0.tar.gz
Algorithm Hash digest
SHA256 94450a957255c9955d8358c018c4b6ccc5a1282abd7d9b256428614472d5ae12
MD5 bc87216f35f13f84d7056cb91e5715b6
BLAKE2b-256 9806d0ed0951b9c85dc5fb17154964295c8e5b00f5bf8271d4aed82f893a3d46

See more details on using hashes here.

File details

Details for the file excelminer-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: excelminer-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 29.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for excelminer-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d0874e67439cc594ee6a5fbdd16e3e1f9fe67e5b2e3cd0b01f96b6eaf41dd0aa
MD5 5bbf612054e82e0122b16770ec67e257
BLAKE2b-256 a66cffd725b0c654c302d2b3d3fda26340128f0836a800bdc3b9d6b24124a876

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page