Production-grade Excel Workflow Parser for RAG + auditability systems

These details have not been verified by PyPI

Project links

Project description

📊 Make XLSX LLM Ready 🤖

ks-xlsx-parser — the open-source Python library that parses Excel (.xlsx) files into citation-ready JSON for LLMs, RAG pipelines, and AI agents (LangChain, LangGraph, CrewAI, OpenAI Agents SDK, Claude, MCP).

[!TIP] .xlsx → structured, typed, citation-ready JSON that an LLM can actually reason about. Cells, formulas, merged regions, tables, charts, conditional formatting, dependency graphs, and RAG-ready chunks — deterministic, fully tested, MIT.

ks-xlsx-parser highlighting a financial model on the left and emitting typed, citation-linked chunks on the right
_{Raw workbook on the left (financial_model.xlsx) → parser output on the right: 4 chunks, each tied back to an exact sheet!range, ready to cite in an LLM response.}

Spreadsheets are still the #1 unstructured data source in the enterprise. Feeding a .xlsx directly to an LLM loses structure (rows, formulas, merges), loses provenance (which cell said what), and blows through context windows. ks-xlsx-parser turns an Excel workbook into a token-counted, source-addressable graph that drops straight into LangChain, LangGraph, CrewAI, the OpenAI Agents SDK, or any MCP-aware client (Claude Desktop, Cursor, Windsurf, Zed, …).

🏁 Benchmark — ks-xlsx-parser vs Docling on SpreadsheetBench

Apples-to-apples on SpreadsheetBench v0.1: 912 real-world task instances curated from ExcelHome / Mr.Excel / r/excel. For each instance we parse the input .xlsx, embed every chunk with BAAI/bge-small-en-v1.5, then check whether the chunk containing the ground-truth answer is in the top-k by similarity to the question.

Metric	🟢 ks-xlsx-parser	⚪ Docling 2.93	Δ
📊 Parse success _{5,458-file corpus}	_{5,461 ok · 3 timeouts · 0 errors}	_{not run at scale}	—
🎯 Recall@1 _text-match
🎯 Recall@3 _text-match
🎯 Recall@5 _text-match
📍 Geometric Recall@5 _{chunk's sheet!A1:Z99 overlaps the ground-truth range}
⚡ Mean parse time _{per file}
🧱 Parser errors _{across 912 instances}			—

💡 What the numbers mean

ks-xlsx-parser ties at recall@1 and wins recall@3 (+2.7 pp) and recall@5 (+1.8 pp). Text-match recall is parser-agnostic — it asks whether any parser surfaced a chunk containing the answer string, after normalising commas, percent signs, ISO dates, and booleans on both sides.
ks-xlsx-parser wins citation-grade (geometric) recall outright (0.369 vs 0.000). Docling produces markdown without per-chunk sheet!range anchors, so it can't render a citation that points at the exact source cells. This is the difference between "the answer is somewhere in the workbook" and "the answer is in Revenue!C7."
Marker is excluded by design. Its xlsx → HTML → PDF → layout-recognition pipeline clocks >30 min per workbook on CPU. The benchmark framework supports adding a Marker adapter when GPU is available — see tests/benchmarks/adapters/docling_adapter.py as a template.

🔁 Reproduce

make corpus-download   # one-time, ~100 MB; gitignored under data/corpora/
make bench             # robustness + retrieval, ~50 min on M-series CPU
open tests/benchmarks/reports/COMPARISON.md

Full methodology, capability matrix, error breakdown, and caveats live in tests/benchmarks/reports/COMPARISON.md. Adapter design notes in tests/benchmarks/README.md.

✨ What you get, at a glance

🧾 Typed cell graph _{values, formulas, styles, coords}	🧭 Citation URIs _{file.xlsx#Sheet!A1:F18}	🧮 Dependency graph _{upstream · downstream · cycles}	🧩 RAG-ready chunks _{HTML + text + token count}
📊 All 7 chart types _{bar · line · pie · scatter · area · radar · bubble}	🎨 Conditional formatting _{every Excel rule type}	📋 Tables & merges _{ListObjects + master/slave}	🔐 Safe by default _{no macros · no external links · ZIP-bomb guard}
⚡ Fast _{1054 workbooks / 70s in CI}	🧬 Deterministic _{xxhash64 content addressing}	🧰 Framework-agnostic _{LangChain · LangGraph · CrewAI · MCP}	📜 MIT licensed _{use it, fork it, ship it}

⭐ If this helps you

This project is free, open source (MIT), and part of the Knowledge Stack ecosystem — document intelligence for agents. Stars, contributions, and honest feedback are all first-class ways to keep the lights on.

Jump into the community:

💬 Discord — real-time help, roadmap conversations, show off what you're building. Drop in, say hi.
🗣 GitHub Discussions — async Q&A, RFCs, and long-form ideas.
🐞 Issues — report a bug, request a feature, or file a parser edge case.
🎯 Show & Tell — tell us about your production use.
🔐 Security — private vulnerability disclosure.
🙌 Contribute — every PR is reviewed; good-first-issue labels live on Issues.
🧰 Knowledge Stack org — see the rest of the ecosystem (ks-cookbook, ks-xlsx-parser, more on the way).

Not sure where to start? Run make bench-robust on SpreadsheetBench, find a file that breaks, open a Parser edge case. That's the fastest path to a merged PR.

🚀 30-second demo

pip install ks-xlsx-parser

from ks_xlsx_parser import parse_workbook

result = parse_workbook(path="q4_forecast.xlsx")

# LLM-ready chunks with citation URIs
for chunk in result.chunks:
    print(chunk.source_uri)          # q4_forecast.xlsx#Revenue!A1:F18
    print(chunk.token_count)         # 412
    print(chunk.render_text[:200])   # Pipe-delimited Markdown-ish text
    print(chunk.render_html[:200])   # HTML with proper colspan/rowspan

# Or dump the whole workbook graph
import json
json.dump(result.to_json(), open("workbook.json", "w"), default=str)

That's it. Every chunk has:

source_uri — cite back to exact cells
render_text / render_html — LLM-consumable bodies
token_count — cap your context window properly
dependency_summary — upstream/downstream formulas
content hash — dedupe across versions

🗺️ Table of Contents

🏁 Benchmark — vs Docling on SpreadsheetBench
🤔 Why a dedicated XLSX parser for LLMs?
🏗️ Architecture
📦 Installation
📚 Documentation
⚔️ How it compares
🎯 Who this is for
📊 Benchmarks
🚧 Limitations
🧰 Knowledge Stack ecosystem
📡 Stay in touch
🙌 Contributing
❓ FAQ
📜 License

🤔 Why a dedicated XLSX parser for LLMs?

Most Excel libraries answer one of two questions well: "read a rectangle of values" (pandas, openpyxl) or "run Excel headless" (xlwings, LibreOffice). ks-xlsx-parser answers a third one: "give me a structured, inspectable, loss-minimising graph that an LLM or auditor can reason about."

Output	Why an LLM cares
Typed cell graph (values, formulas, styles, coordinates)	Round-trips to JSON/DB/vector store without losing formulas or data types
Formula AST + directed dependency graph	Answer "what drives Q4 revenue?" via upstream traversal
Detected tables, merged regions, layout blocks	Multi-table sheets no longer collapse into one giant CSV
Chart extractions (bar / line / pie / scatter / area / radar / bubble)	Text summaries the model can read
Token-counted render chunks (HTML + pipe-text)	Plug straight into an embedding pipeline without blowing context
Citation-ready source URIs (`sheet!A1:B10`)	The LLM can cite the exact cell it's talking about
Deterministic content hashes (xxhash64)	Dedupe across versions, detect change between uploads

Everything is deterministic, everything is tested on a 1054-workbook stress corpus, and everything is open source.

🏗️ Architecture

The pipeline runs 8 deterministic stages: parse → analyse → annotate → segment → render → serialise → verify → compare/export. Full diagram, stage-by-stage breakdown, and module map in docs/wiki/Architecture.md. Stage internals in Pipeline Internals.

[!NOTE] The importable module is xlsx_parser; ks_xlsx_parser is a re-export matching the PyPI package name. The package is fully type-annotated (py.typed is shipped).

📦 Installation

Requires Python 3.10+.

pip install ks-xlsx-parser                 # core library
pip install "ks-xlsx-parser[api]"          # + FastAPI web server
pip install "ks-xlsx-parser[dev]"          # + test tooling

From source:

git clone https://github.com/knowledgestack/ks-xlsx-parser.git
cd ks-xlsx-parser
make install           # pip install -e ".[dev,api]"
make test              # default suite
make corpus-download   # fetch SpreadsheetBench (5,458 real-world xlsx)
make bench-robust      # parse-success + structural counts vs Docling
make bench-retrieval   # retrieval recall@k vs Docling

Runtime deps: openpyxl, pydantic, lxml, xxhash, tiktoken.

📚 Documentation

All implementation detail lives under docs/wiki/ (mirrored to the GitHub Wiki on each release) so this README stays scannable:

🚀 Quick Start — parse, iterate chunks, walk the dep graph, serialise, parse from bytes. Five short snippets, ~90 % of real usage.
📖 API Reference — full signatures for parse_workbook, compare_workbooks, export_importer, StageVerifier.
🌐 Web API — the bundled FastAPI server, Python + TypeScript clients, deployment notes.
📦 Data Models — every Pydantic DTO field by field.
🛠 Pipeline Internals — where to hook in if you want to extend the parser.
📜 Workbook Graph Spec — canonical schema for the output.
🐛 Known Issues — documented edge cases.
📝 CHANGELOG — release history.

⚔️ How it compares

This is the structural capability matrix. For head-to-head retrieval numbers (recall@k, geometric, latency) on a 912-instance real-world corpus, see 🏁 Benchmark — ks-xlsx-parser vs Docling on SpreadsheetBench up top.

	pandas / openpyxl	Docling	`ks-xlsx-parser`
Reads values	✅	✅	✅
Keeps formulas	⚠️ raw string	❌	✅ parsed + dependency graph
Preserves merges	⚠️ coords only	⚠️ partial	✅ master/slave with colspan/rowspan
Extracts charts	❌	❌	✅ all 7 chart types + text summary
Conditional formatting	❌	❌	✅ cell/color-scale/icon/data-bar/formula
Data validation (dropdowns)	❌	❌	✅ all types incl. cross-sheet lists
Multi-table sheet layout	❌	⚠️	✅ adaptive-gap segmentation
Per-chunk source URI (citation)	❌	⚠️	✅ `file.xlsx#Sheet!A1:F18`
Token counts per chunk	❌	❌	✅ via `tiktoken`
Dependency graph traversal	❌	❌	✅ upstream / downstream, cycle detection
Deterministic content hashes	❌	❌	✅ xxhash64 per cell / block / chunk
Streaming `.xlsx` > 100 MB	⚠️	❌	✅ (chunked parse)

Most tools give you a dataframe. ks-xlsx-parser gives you a graph an LLM can cite.

Looking for a tiny, edge-runtime I/O library with write support? See hucre by @productdevbook. For an unbiased head-to-head on the SpreadsheetBench corpus — perf numbers, extraction-count parity, where each side wins — see the wiki: ks-xlsx-parser vs hucre.

🎯 Who this is for

Teams shipping agents, RAG pipelines, or auditing tools that ingest Excel.

🏦
Banking & Finance
_{KPI extraction, formula lineage, regulator-ready citations}

⚖️
Legal & Contracts
_{schedules, fee tables, covenant matrices without flattening merges}

🏥
Healthcare & Insurance
_{normalise claims, pricing, and actuarial sheets into auditable JSON}

🏗️
Real Estate & Construction
_{quantity takeoffs and cost models that still live in XLSX}

📈
Sales Ops / HR / Engineering
_{"source of truth is a spreadsheet" → structured events, in minutes}

[!IMPORTANT] Not a fit if you need to execute Excel (recalculate, run VBA, pivot-refresh). Use xlwings or a headless Excel for that. ks-xlsx-parser reads; it doesn't run.

📊 Benchmarks

We benchmark against SpreadsheetBench v0.1 — 912 instruction × xlsx tasks (5,458 unique workbooks) covering financial models, project trackers, HR records, scientific data, and a long tail of small business spreadsheets.

Benchmark	What it measures	Cost
`make bench-robust`	Parse-success rate + structural counts vs Docling	~20 min
`make bench-retrieval`	Top-k retrieval recall + table fragmentation rate vs Docling	~40 min

Headline numbers and methodology live in tests/benchmarks/reports/COMPARISON.md. The corpus is downloaded on demand (make corpus-download) and gitignored — nothing is committed to the repo.

🚧 Limitations

.xls not supported — only .xlsx and .xlsm (OOXML). Convert legacy files externally.
Pivot tables — detected but not fully parsed.
Sparklines — not extracted.
VBA macros — flagged but never executed or analysed.
External links — recorded but not resolved.
Threaded comments — only legacy comments are supported (openpyxl limitation).
Embedded OLE objects — detected but not extracted.
Locale-dependent number formats — not interpreted.

Full list in docs/PARSER_KNOWN_ISSUES.md.

🧰 Knowledge Stack ecosystem

ks-xlsx-parser is one piece of the Knowledge Stack open-source family — document intelligence for agents, built so that engineering teams can focus on agents and we handle the messy parts of enterprise data.

Repo	What it does
ks-cookbook	32 production-style flagship agents + recipes for LangChain, LangGraph, CrewAI, Temporal, the OpenAI Agents SDK, and any MCP client.
ks-xlsx-parser (this repo)	Turn `.xlsx` into LLM-ready JSON with citations and dependency graphs.
@knowledgestack	Follow the org for upcoming repos — parsers, extractors, and MCP servers for PDF, DOCX, PPTX, HTML, and more.

Building on top of the stack? Tell us about it in Show & Tell or the #showcase channel on Discord.

📡 Stay in touch

💬 Join the Discord — our main real-time channel. Roadmap, help, job postings, show-and-tell, and the occasional meme.
🐙 Follow @knowledgestack on GitHub for new releases across the ecosystem.
📣 Watch this repo (→ Releases only) to get pinged when ks-xlsx-parser ships an update.

If you'd rather just peek first — run the benchmark suite against the public SpreadsheetBench corpus (make corpus-download && make bench-robust) and file an issue if your Excel does something weirder than ours.

🙌 Contributing

We love contributions. Three paths, in order of speed-to-merge:

Report a benchmark failure — run make bench-robust on SpreadsheetBench, find a file that breaks, attach it to a Parser edge case issue.
Submit an adversarial workbook — open a Parser edge case issue with the file attached; we'll fold it into the suite.
Fix a flagged issue — see docs/PARSER_KNOWN_ISSUES.md.

Full dev loop, PR checklist, and code style in CONTRIBUTING.md. See the Code of Conduct and Security policy before posting.

If you don't have time to contribute but the project helped you, please star the repo. That's the main signal that keeps this maintained.

❓ FAQ

What is the best Python library to parse Excel (.xlsx) for LLMs?

ks-xlsx-parser is purpose-built for it. Unlike pandas or openpyxl, it preserves formulas with a directed dependency graph, merged regions, tables, charts, and conditional formatting, and emits token-counted chunks with source_uri citations an LLM can quote. pip install ks-xlsx-parser.

How do I parse Excel for a LangChain or LangGraph agent?

Call parse_workbook(path=...), then expose result.chunks as a LangChain @tool or a LangGraph ToolNode. Each chunk carries source_uri, render_text, token_count, and a dependency_summary — everything the agent needs to cite and reason.

How do I use Excel in a CrewAI or OpenAI-Agents-SDK agent?

Same pattern — wrap parse_workbook in whatever tool abstraction your framework provides (@tool in CrewAI, @function_tool in the OpenAI Agents SDK). The parser's output is framework-agnostic.

Can Claude Desktop, Cursor, Windsurf, or another MCP client read Excel files?

Yes — run the bundled FastAPI server (pip install ks-xlsx-parser[api]; xlsx-parser-api) and call POST /parse. A native MCP server is on the Knowledge Stack roadmap.

How do I build a RAG pipeline over Excel spreadsheets?

Three steps: pip install ks-xlsx-parser, call parse_workbook() on each file, then result.serializer.to_vector_store_entries() to get id + text + metadata triples ready for Qdrant, pgvector, Weaviate, or Pinecone. Every entry has a content_hash for dedup and a source_uri the LLM cites in its answer.

How is ks-xlsx-parser different from openpyxl or pandas?

openpyxl and pandas give you a rectangle of values. ks-xlsx-parser gives you the full workbook graph: parsed formulas with dependency edges, merged regions, Excel ListObjects, all 7 chart types, every conditional-formatting rule type, and LLM chunks with citation URIs + token counts. It wraps openpyxl and uses lxml for the bits openpyxl loses.

Does ks-xlsx-parser run Excel formulas or macros?

No. The library reads .xlsx files; it never executes them. VBA macros are flagged but never run. External links are recorded but never resolved. ZIP-bomb and cell-count limits make it safe for untrusted uploads.

How fast is it?

SpreadsheetBench's full 5,458-workbook corpus parses end-to-end in roughly 20 minutes on a single machine (P50 parse time low double-digit ms). A real 21k-cell, 13-sheet financial model parses in ~4.6 s (down from 307 s pre-0.1.1 after a circular-ref caching fix). Sparse workbooks with extreme addresses parse in under 200 ms.

🔎 Also known as

Search queries this library answers: Python Excel parser for LLMs, XLSX to JSON for LangChain, Excel ingestion for LangGraph, spreadsheet reader for CrewAI, Excel tool for OpenAI Agents SDK, Excel for Claude Desktop, Excel for Cursor, Excel MCP server, openpyxl alternative for RAG, Excel dependency graph extractor, XLSX OOXML parser for AI, how to parse Excel for an LLM agent, how to feed a spreadsheet to ChatGPT, how to cite Excel cells in an LLM answer, best library to turn Excel into JSON, Python library for parsing formulas, Excel formula dependency traversal, document intelligence for spreadsheets, RAG over Excel files, Excel chunker with token counts, parse .xlsx for Qdrant / pgvector / Weaviate / Pinecone.

📜 License

MIT. Use it, fork it, ship it. Attribution appreciated but not required.

If you ship something built on top of ks-xlsx-parser, we'd love a Show & Tell post or a shoutout on Discord.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

May 19, 2026

0.2.0

May 11, 2026

0.1.1

Mar 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ks_xlsx_parser-0.2.1.tar.gz (151.2 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ks_xlsx_parser-0.2.1-py3-none-any.whl (134.2 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file ks_xlsx_parser-0.2.1.tar.gz.

File metadata

Download URL: ks_xlsx_parser-0.2.1.tar.gz
Upload date: May 19, 2026
Size: 151.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ks_xlsx_parser-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`f98399a86d4f1f48b82efd093d30caaeb18e734056173158e3462fd9899d45ff`
MD5	`b2eedaca41a9e2d96a434f1619360278`
BLAKE2b-256	`35336fa08bd2af59a80e0c45d508039d2b2c0ba26b537eee8ce3b3077706a06e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ks_xlsx_parser-0.2.1.tar.gz:

Publisher: release.yml on knowledgestack/excel-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ks_xlsx_parser-0.2.1.tar.gz
- Subject digest: f98399a86d4f1f48b82efd093d30caaeb18e734056173158e3462fd9899d45ff
- Sigstore transparency entry: 1575501323
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: knowledgestack/excel-parser@d8fd418e1e35ecefe485ab0ffc09482e5338b61c
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/knowledgestack
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d8fd418e1e35ecefe485ab0ffc09482e5338b61c
- Trigger Event: push

File details

Details for the file ks_xlsx_parser-0.2.1-py3-none-any.whl.

File metadata

Download URL: ks_xlsx_parser-0.2.1-py3-none-any.whl
Upload date: May 19, 2026
Size: 134.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ks_xlsx_parser-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`012f0774e78b34800c85d2be88357e20629592a0f40ed12547ffb5df01e4e620`
MD5	`3c82786564ab06d3209449430a84ff2e`
BLAKE2b-256	`7c5cf1e9a48fbb9da6896830e0ff4cddde5ec9c41e455af768c9df589dbb1703`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ks_xlsx_parser-0.2.1-py3-none-any.whl:

Publisher: release.yml on knowledgestack/excel-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ks_xlsx_parser-0.2.1-py3-none-any.whl
- Subject digest: 012f0774e78b34800c85d2be88357e20629592a0f40ed12547ffb5df01e4e620
- Sigstore transparency entry: 1575501329
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: knowledgestack/excel-parser@d8fd418e1e35ecefe485ab0ffc09482e5338b61c
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/knowledgestack
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d8fd418e1e35ecefe485ab0ffc09482e5338b61c
- Trigger Event: push

ks-xlsx-parser 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📊 Make XLSX LLM Ready 🤖

🏁 Benchmark — ks-xlsx-parser vs Docling on SpreadsheetBench

💡 What the numbers mean

🔁 Reproduce

✨ What you get, at a glance

⭐ If this helps you

🚀 30-second demo

🗺️ Table of Contents

🤔 Why a dedicated XLSX parser for LLMs?

🏗️ Architecture

📦 Installation

📚 Documentation

⚔️ How it compares

🎯 Who this is for

📊 Benchmarks

🚧 Limitations

🧰 Knowledge Stack ecosystem

📡 Stay in touch

🙌 Contributing

❓ FAQ

🔎 Also known as

📜 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance