Skip to main content

Production-grade Excel Workflow Parser for RAG + auditability systems

Project description

ks-xlsx-parser

Star on GitHub Fork on GitHub GitHub stargazers

Knowledge Stack

๐Ÿ“Š Make XLSX LLM Ready ๐Ÿค–

ks-xlsx-parser โ€” the open-source Python library that parses Excel (.xlsx) files into citation-ready JSON for LLMs, RAG pipelines, and AI agents (LangChain, LangGraph, CrewAI, OpenAI Agents SDK, Claude, MCP).

PyPI Python 3.10+ MIT License SpreadsheetBench CI

Discord Knowledge Stack Discussions GitHub stars Landing site

LangChain ready LangGraph ready CrewAI ready OpenAI Agents SDK MCP compatible

[!TIP] .xlsx โ†’ structured, typed, citation-ready JSON that an LLM can actually reason about. Cells, formulas, merged regions, tables, charts, conditional formatting, dependency graphs, and RAG-ready chunks โ€” deterministic, fully tested, MIT.

ks-xlsx-parser highlighting a financial model on the left and emitting typed, citation-linked chunks on the right
Raw workbook on the left (financial_model.xlsx) โ†’ parser output on the right: 4 chunks, each tied back to an exact sheet!range, ready to cite in an LLM response.

Spreadsheets are still the #1 unstructured data source in the enterprise. Feeding a .xlsx directly to an LLM loses structure (rows, formulas, merges), loses provenance (which cell said what), and blows through context windows. ks-xlsx-parser turns an Excel workbook into a token-counted, source-addressable graph that drops straight into LangChain, LangGraph, CrewAI, the OpenAI Agents SDK, or any MCP-aware client (Claude Desktop, Cursor, Windsurf, Zed, โ€ฆ).

Star the repo ย  Join our Discord

Quick start ย  Docs ย  Benchmarks


๐Ÿ Benchmark โ€” ks-xlsx-parser vs Docling on SpreadsheetBench

SpreadsheetBench Parse success Recall@3 vs Docling A1 anchors

Apples-to-apples on SpreadsheetBench v0.1: 912 real-world task instances curated from ExcelHome / Mr.Excel / r/excel. For each instance we parse the input .xlsx, embed every chunk with BAAI/bge-small-en-v1.5, then check whether the chunk containing the ground-truth answer is in the top-k by similarity to the question.

Metric ๐ŸŸข ks-xlsx-parser โšช Docling 2.93 ฮ”
๐Ÿ“Š Parse success
5,458-file corpus
99.945%
5,461 ok ยท 3 timeouts ยท 0 errors
not run at scale โ€”
๐ŸŽฏ Recall@1
text-match
0.580 0.579 tied
๐ŸŽฏ Recall@3
text-match
0.697 0.670 +2.7 pp
๐ŸŽฏ Recall@5
text-match
0.704 0.686 +1.8 pp
๐Ÿ“ Geometric Recall@5
chunk's sheet!A1:Z99 overlaps the ground-truth range
0.369 0.000 citation-grade only
โšก Mean parse time
per file
251 ms 265 ms ~5% faster
๐Ÿงฑ Parser errors
across 912 instances
0 0 โ€”

๐Ÿ’ก What the numbers mean

  • ks-xlsx-parser ties at recall@1 and wins recall@3 (+2.7 pp) and recall@5 (+1.8 pp). Text-match recall is parser-agnostic โ€” it asks whether any parser surfaced a chunk containing the answer string, after normalising commas, percent signs, ISO dates, and booleans on both sides.
  • ks-xlsx-parser wins citation-grade (geometric) recall outright (0.369 vs 0.000). Docling produces markdown without per-chunk sheet!range anchors, so it can't render a citation that points at the exact source cells. This is the difference between "the answer is somewhere in the workbook" and "the answer is in Revenue!C7."
  • Marker is excluded by design. Its xlsx โ†’ HTML โ†’ PDF โ†’ layout-recognition pipeline clocks >30 min per workbook on CPU. The benchmark framework supports adding a Marker adapter when GPU is available โ€” see tests/benchmarks/adapters/docling_adapter.py as a template.

๐Ÿ” Reproduce

make corpus-download   # one-time, ~100 MB; gitignored under data/corpora/
make bench             # robustness + retrieval, ~50 min on M-series CPU
open tests/benchmarks/reports/COMPARISON.md

Full methodology, capability matrix, error breakdown, and caveats live in tests/benchmarks/reports/COMPARISON.md. Adapter design notes in tests/benchmarks/README.md.


โœจ What you get, at a glance

๐Ÿงพ
Typed cell graph
values, formulas, styles, coords
๐Ÿงญ
Citation URIs
file.xlsx#Sheet!A1:F18
๐Ÿงฎ
Dependency graph
upstream ยท downstream ยท cycles
๐Ÿงฉ
RAG-ready chunks
HTML + text + token count
๐Ÿ“Š
All 7 chart types
bar ยท line ยท pie ยท scatter ยท area ยท radar ยท bubble
๐ŸŽจ
Conditional formatting
every Excel rule type
๐Ÿ“‹
Tables & merges
ListObjects + master/slave
๐Ÿ”
Safe by default
no macros ยท no external links ยท ZIP-bomb guard
โšก
Fast
1054 workbooks / 70s in CI
๐Ÿงฌ
Deterministic
xxhash64 content addressing
๐Ÿงฐ
Framework-agnostic
LangChain ยท LangGraph ยท CrewAI ยท MCP
๐Ÿ“œ
MIT licensed
use it, fork it, ship it

โญ If this helps you

This project is free, open source (MIT), and part of the Knowledge Stack ecosystem โ€” document intelligence for agents. Stars, contributions, and honest feedback are all first-class ways to keep the lights on.

Jump into the community:

  • ๐Ÿ’ฌ Discord โ€” real-time help, roadmap conversations, show off what you're building. Drop in, say hi.
  • ๐Ÿ—ฃ GitHub Discussions โ€” async Q&A, RFCs, and long-form ideas.
  • ๐Ÿž Issues โ€” report a bug, request a feature, or file a parser edge case.
  • ๐ŸŽฏ Show & Tell โ€” tell us about your production use.
  • ๐Ÿ” Security โ€” private vulnerability disclosure.
  • ๐Ÿ™Œ Contribute โ€” every PR is reviewed; good-first-issue labels live on Issues.
  • ๐Ÿงฐ Knowledge Stack org โ€” see the rest of the ecosystem (ks-cookbook, ks-xlsx-parser, more on the way).

Not sure where to start? Run make bench-robust on SpreadsheetBench, find a file that breaks, open a Parser edge case. That's the fastest path to a merged PR.


๐Ÿš€ 30-second demo

pip install ks-xlsx-parser
from ks_xlsx_parser import parse_workbook

result = parse_workbook(path="q4_forecast.xlsx")

# LLM-ready chunks with citation URIs
for chunk in result.chunks:
    print(chunk.source_uri)          # q4_forecast.xlsx#Revenue!A1:F18
    print(chunk.token_count)         # 412
    print(chunk.render_text[:200])   # Pipe-delimited Markdown-ish text
    print(chunk.render_html[:200])   # HTML with proper colspan/rowspan

# Or dump the whole workbook graph
import json
json.dump(result.to_json(), open("workbook.json", "w"), default=str)

That's it. Every chunk has:

  • source_uri โ€” cite back to exact cells
  • render_text / render_html โ€” LLM-consumable bodies
  • token_count โ€” cap your context window properly
  • dependency_summary โ€” upstream/downstream formulas
  • content hash โ€” dedupe across versions

๐Ÿ—บ๏ธ Table of Contents


๐Ÿค” Why a dedicated XLSX parser for LLMs?

Most Excel libraries answer one of two questions well: "read a rectangle of values" (pandas, openpyxl) or "run Excel headless" (xlwings, LibreOffice). ks-xlsx-parser answers a third one: "give me a structured, inspectable, loss-minimising graph that an LLM or auditor can reason about."

Output Why an LLM cares
Typed cell graph (values, formulas, styles, coordinates) Round-trips to JSON/DB/vector store without losing formulas or data types
Formula AST + directed dependency graph Answer "what drives Q4 revenue?" via upstream traversal
Detected tables, merged regions, layout blocks Multi-table sheets no longer collapse into one giant CSV
Chart extractions (bar / line / pie / scatter / area / radar / bubble) Text summaries the model can read
Token-counted render chunks (HTML + pipe-text) Plug straight into an embedding pipeline without blowing context
Citation-ready source URIs (sheet!A1:B10) The LLM can cite the exact cell it's talking about
Deterministic content hashes (xxhash64) Dedupe across versions, detect change between uploads

Everything is deterministic, everything is tested on a 1054-workbook stress corpus, and everything is open source.


๐Ÿ—๏ธ Architecture

The pipeline runs 8 deterministic stages: parse โ†’ analyse โ†’ annotate โ†’ segment โ†’ render โ†’ serialise โ†’ verify โ†’ compare/export. Full diagram, stage-by-stage breakdown, and module map in docs/wiki/Architecture.md. Stage internals in Pipeline Internals.

[!NOTE] The importable module is xlsx_parser; ks_xlsx_parser is a re-export matching the PyPI package name. The package is fully type-annotated (py.typed is shipped).


๐Ÿ“ฆ Installation

Requires Python 3.10+.

pip install ks-xlsx-parser                 # core library
pip install "ks-xlsx-parser[api]"          # + FastAPI web server
pip install "ks-xlsx-parser[dev]"          # + test tooling

From source:

git clone https://github.com/knowledgestack/ks-xlsx-parser.git
cd ks-xlsx-parser
make install           # pip install -e ".[dev,api]"
make test              # default suite
make corpus-download   # fetch SpreadsheetBench (5,458 real-world xlsx)
make bench-robust      # parse-success + structural counts vs Docling
make bench-retrieval   # retrieval recall@k vs Docling

Runtime deps: openpyxl, pydantic, lxml, xxhash, tiktoken.


๐Ÿ“š Documentation

All implementation detail lives under docs/wiki/ (mirrored to the GitHub Wiki on each release) so this README stays scannable:

  • ๐Ÿš€ Quick Start โ€” parse, iterate chunks, walk the dep graph, serialise, parse from bytes. Five short snippets, ~90 % of real usage.
  • ๐Ÿ“– API Reference โ€” full signatures for parse_workbook, compare_workbooks, export_importer, StageVerifier.
  • ๐ŸŒ Web API โ€” the bundled FastAPI server, Python + TypeScript clients, deployment notes.
  • ๐Ÿ“ฆ Data Models โ€” every Pydantic DTO field by field.
  • ๐Ÿ›  Pipeline Internals โ€” where to hook in if you want to extend the parser.
  • ๐Ÿ“œ Workbook Graph Spec โ€” canonical schema for the output.
  • ๐Ÿ› Known Issues โ€” documented edge cases.
  • ๐Ÿ“ CHANGELOG โ€” release history.

โš”๏ธ How it compares

This is the structural capability matrix. For head-to-head retrieval numbers (recall@k, geometric, latency) on a 912-instance real-world corpus, see ๐Ÿ Benchmark โ€” ks-xlsx-parser vs Docling on SpreadsheetBench up top.

pandas / openpyxl Docling ks-xlsx-parser
Reads values โœ… โœ… โœ…
Keeps formulas โš ๏ธ raw string โŒ โœ… parsed + dependency graph
Preserves merges โš ๏ธ coords only โš ๏ธ partial โœ… master/slave with colspan/rowspan
Extracts charts โŒ โŒ โœ… all 7 chart types + text summary
Conditional formatting โŒ โŒ โœ… cell/color-scale/icon/data-bar/formula
Data validation (dropdowns) โŒ โŒ โœ… all types incl. cross-sheet lists
Multi-table sheet layout โŒ โš ๏ธ โœ… adaptive-gap segmentation
Per-chunk source URI (citation) โŒ โš ๏ธ โœ… file.xlsx#Sheet!A1:F18
Token counts per chunk โŒ โŒ โœ… via tiktoken
Dependency graph traversal โŒ โŒ โœ… upstream / downstream, cycle detection
Deterministic content hashes โŒ โŒ โœ… xxhash64 per cell / block / chunk
Streaming .xlsx > 100 MB โš ๏ธ โŒ โœ… (chunked parse)

Most tools give you a dataframe. ks-xlsx-parser gives you a graph an LLM can cite.


Looking for a tiny, edge-runtime I/O library with write support? See hucre by @productdevbook. For an unbiased head-to-head on the SpreadsheetBench corpus โ€” perf numbers, extraction-count parity, where each side wins โ€” see the wiki: ks-xlsx-parser vs hucre.


๐ŸŽฏ Who this is for

Teams shipping agents, RAG pipelines, or auditing tools that ingest Excel.

๐Ÿฆ
Banking & Finance
KPI extraction, formula lineage, regulator-ready citations
โš–๏ธ
Legal & Contracts
schedules, fee tables, covenant matrices without flattening merges
๐Ÿฅ
Healthcare & Insurance
normalise claims, pricing, and actuarial sheets into auditable JSON
๐Ÿ—๏ธ
Real Estate & Construction
quantity takeoffs and cost models that still live in XLSX
๐Ÿ“ˆ
Sales Ops / HR / Engineering
"source of truth is a spreadsheet" โ†’ structured events, in minutes

[!IMPORTANT] Not a fit if you need to execute Excel (recalculate, run VBA, pivot-refresh). Use xlwings or a headless Excel for that. ks-xlsx-parser reads; it doesn't run.


๐Ÿ“Š Benchmarks

We benchmark against SpreadsheetBench v0.1 โ€” 912 instruction ร— xlsx tasks (5,458 unique workbooks) covering financial models, project trackers, HR records, scientific data, and a long tail of small business spreadsheets.

Benchmark What it measures Cost
make bench-robust Parse-success rate + structural counts vs Docling ~20 min
make bench-retrieval Top-k retrieval recall + table fragmentation rate vs Docling ~40 min

Headline numbers and methodology live in tests/benchmarks/reports/COMPARISON.md. The corpus is downloaded on demand (make corpus-download) and gitignored โ€” nothing is committed to the repo.


๐Ÿšง Limitations

  • .xls not supported โ€” only .xlsx and .xlsm (OOXML). Convert legacy files externally.
  • Pivot tables โ€” detected but not fully parsed.
  • Sparklines โ€” not extracted.
  • VBA macros โ€” flagged but never executed or analysed.
  • External links โ€” recorded but not resolved.
  • Threaded comments โ€” only legacy comments are supported (openpyxl limitation).
  • Embedded OLE objects โ€” detected but not extracted.
  • Locale-dependent number formats โ€” not interpreted.

Full list in docs/PARSER_KNOWN_ISSUES.md.


๐Ÿงฐ Knowledge Stack ecosystem

ks-xlsx-parser is one piece of the Knowledge Stack open-source family โ€” document intelligence for agents, built so that engineering teams can focus on agents and we handle the messy parts of enterprise data.

Repo What it does
ks-cookbook 32 production-style flagship agents + recipes for LangChain, LangGraph, CrewAI, Temporal, the OpenAI Agents SDK, and any MCP client.
ks-xlsx-parser (this repo) Turn .xlsx into LLM-ready JSON with citations and dependency graphs.
@knowledgestack Follow the org for upcoming repos โ€” parsers, extractors, and MCP servers for PDF, DOCX, PPTX, HTML, and more.

Building on top of the stack? Tell us about it in Show & Tell or the #showcase channel on Discord.


๐Ÿ“ก Stay in touch

Discord Follow Knowledge Stack Discussions

  • ๐Ÿ’ฌ Join the Discord โ€” our main real-time channel. Roadmap, help, job postings, show-and-tell, and the occasional meme.
  • ๐Ÿ™ Follow @knowledgestack on GitHub for new releases across the ecosystem.
  • ๐Ÿ“ฃ Watch this repo (โ†’ Releases only) to get pinged when ks-xlsx-parser ships an update.

If you'd rather just peek first โ€” run the benchmark suite against the public SpreadsheetBench corpus (make corpus-download && make bench-robust) and file an issue if your Excel does something weirder than ours.


๐Ÿ™Œ Contributing

We love contributions. Three paths, in order of speed-to-merge:

  1. Report a benchmark failure โ€” run make bench-robust on SpreadsheetBench, find a file that breaks, attach it to a Parser edge case issue.
  2. Submit an adversarial workbook โ€” open a Parser edge case issue with the file attached; we'll fold it into the suite.
  3. Fix a flagged issue โ€” see docs/PARSER_KNOWN_ISSUES.md.

Full dev loop, PR checklist, and code style in CONTRIBUTING.md. See the Code of Conduct and Security policy before posting.

If you don't have time to contribute but the project helped you, please star the repo. That's the main signal that keeps this maintained.


โ“ FAQ

What is the best Python library to parse Excel (.xlsx) for LLMs?

ks-xlsx-parser is purpose-built for it. Unlike pandas or openpyxl, it preserves formulas with a directed dependency graph, merged regions, tables, charts, and conditional formatting, and emits token-counted chunks with source_uri citations an LLM can quote. pip install ks-xlsx-parser.

How do I parse Excel for a LangChain or LangGraph agent?

Call parse_workbook(path=...), then expose result.chunks as a LangChain @tool or a LangGraph ToolNode. Each chunk carries source_uri, render_text, token_count, and a dependency_summary โ€” everything the agent needs to cite and reason.

How do I use Excel in a CrewAI or OpenAI-Agents-SDK agent?

Same pattern โ€” wrap parse_workbook in whatever tool abstraction your framework provides (@tool in CrewAI, @function_tool in the OpenAI Agents SDK). The parser's output is framework-agnostic.

Can Claude Desktop, Cursor, Windsurf, or another MCP client read Excel files?

Yes โ€” run the bundled FastAPI server (pip install ks-xlsx-parser[api]; xlsx-parser-api) and call POST /parse. A native MCP server is on the Knowledge Stack roadmap.

How do I build a RAG pipeline over Excel spreadsheets?

Three steps: pip install ks-xlsx-parser, call parse_workbook() on each file, then result.serializer.to_vector_store_entries() to get id + text + metadata triples ready for Qdrant, pgvector, Weaviate, or Pinecone. Every entry has a content_hash for dedup and a source_uri the LLM cites in its answer.

How is ks-xlsx-parser different from openpyxl or pandas?

openpyxl and pandas give you a rectangle of values. ks-xlsx-parser gives you the full workbook graph: parsed formulas with dependency edges, merged regions, Excel ListObjects, all 7 chart types, every conditional-formatting rule type, and LLM chunks with citation URIs + token counts. It wraps openpyxl and uses lxml for the bits openpyxl loses.

Does ks-xlsx-parser run Excel formulas or macros?

No. The library reads .xlsx files; it never executes them. VBA macros are flagged but never run. External links are recorded but never resolved. ZIP-bomb and cell-count limits make it safe for untrusted uploads.

How fast is it?

SpreadsheetBench's full 5,458-workbook corpus parses end-to-end in roughly 20 minutes on a single machine (P50 parse time low double-digit ms). A real 21k-cell, 13-sheet financial model parses in ~4.6 s (down from 307 s pre-0.1.1 after a circular-ref caching fix). Sparse workbooks with extreme addresses parse in under 200 ms.


๐Ÿ”Ž Also known as

Search queries this library answers: Python Excel parser for LLMs, XLSX to JSON for LangChain, Excel ingestion for LangGraph, spreadsheet reader for CrewAI, Excel tool for OpenAI Agents SDK, Excel for Claude Desktop, Excel for Cursor, Excel MCP server, openpyxl alternative for RAG, Excel dependency graph extractor, XLSX OOXML parser for AI, how to parse Excel for an LLM agent, how to feed a spreadsheet to ChatGPT, how to cite Excel cells in an LLM answer, best library to turn Excel into JSON, Python library for parsing formulas, Excel formula dependency traversal, document intelligence for spreadsheets, RAG over Excel files, Excel chunker with token counts, parse .xlsx for Qdrant / pgvector / Weaviate / Pinecone.


๐Ÿ“œ License

MIT. Use it, fork it, ship it. Attribution appreciated but not required.

If you ship something built on top of ks-xlsx-parser, we'd love a Show & Tell post or a shoutout on Discord.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ks_xlsx_parser-0.2.1.tar.gz (151.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ks_xlsx_parser-0.2.1-py3-none-any.whl (134.2 kB view details)

Uploaded Python 3

File details

Details for the file ks_xlsx_parser-0.2.1.tar.gz.

File metadata

  • Download URL: ks_xlsx_parser-0.2.1.tar.gz
  • Upload date:
  • Size: 151.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ks_xlsx_parser-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f98399a86d4f1f48b82efd093d30caaeb18e734056173158e3462fd9899d45ff
MD5 b2eedaca41a9e2d96a434f1619360278
BLAKE2b-256 35336fa08bd2af59a80e0c45d508039d2b2c0ba26b537eee8ce3b3077706a06e

See more details on using hashes here.

Provenance

The following attestation bundles were made for ks_xlsx_parser-0.2.1.tar.gz:

Publisher: release.yml on knowledgestack/excel-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ks_xlsx_parser-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: ks_xlsx_parser-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 134.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ks_xlsx_parser-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 012f0774e78b34800c85d2be88357e20629592a0f40ed12547ffb5df01e4e620
MD5 3c82786564ab06d3209449430a84ff2e
BLAKE2b-256 7c5cf1e9a48fbb9da6896830e0ff4cddde5ec9c41e455af768c9df589dbb1703

See more details on using hashes here.

Provenance

The following attestation bundles were made for ks_xlsx_parser-0.2.1-py3-none-any.whl:

Publisher: release.yml on knowledgestack/excel-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page