Extract structured data from Excel files with minimal token usage
Project description
carloforte
Extract structured data from Excel files with minimal token usage.
carloforte uses an island-detection algorithm to convert Excel sheets into a compact intermediate representation (CSV, Markdown, or JSON), making it efficient to pass spreadsheet data to LLMs.
Installation
uv add carloforte
Usage
import carloforte
# Extract all sheets as CSV (default)
text = carloforte.extract("data.xlsx")
# Extract specific sheets as Markdown
text = carloforte.extract("data.xlsx", sheets=["Revenue", "Costs"], fmt="markdown")
# Extract as JSON
text = carloforte.extract("data.xlsx", fmt="json")
Formats
| Format | Best for |
|---|---|
csv |
Compact, low token count |
markdown |
Readable, good for LLM prompts |
json |
Structured output, programmatic use |
CLI
carloforte data.xlsx --fmt markdown
carloforte data.xlsx --sheets Revenue Costs --fmt json
How it works
Excel sheets often contain multiple disconnected tables, empty rows, and metadata scattered around. carloforte detects each contiguous block of data ("island") independently and serialises only what matters — reducing token usage by 60–75% compared to passing raw Excel content to an LLM.
Performance
Benchmarked against raw CSV export (worst case baseline). Run examples/scripts/benchmark_tokens.py to reproduce.
Characters:
| File | Raw CSV | Clean CSV | CSV islands | Markdown | JSON |
|---|---|---|---|---|---|
large_sparse — 1 sheet, 5 islands spread over 165 rows |
4,470 | ↓ 1,480 (-67%) | ↓ 960 (-79%) | ↓ 1,453 (-68%) | ↓ 1,337 (-70%) |
fragmented — 4 sheets, 4% fill, tiny tables far apart |
10,664 | ↓ 5,334 (-50%) | ↓ 4,811 (-55%) | ↓ 6,690 (-37%) | ↓ 6,221 (-42%) |
stray_cells — 1 sheet, 7 islands + scattered stray cells |
2,143 | ↓ 1,668 (-22%) | ↓ 1,458 (-32%) | ↑ 2,175 (+2%) | ↓ 1,994 (-7%) |
multisheet — 4 sheets, mixed structure |
1,304 | ↓ 1,084 (-17%) | ↓ 1,156 (-11%) | ↑ 1,879 (+44%) | ↑ 1,755 (+35%) |
invoice — single invoice sheet |
4,180 | ↓ 3,745 (-10%) | ↓ 3,855 (-8%) | ↑ 5,097 (+22%) | ↑ 5,020 (+20%) |
minimal — 1 sheet, 3 small islands |
248 | ↓ 216 (-13%) | ↓ 244 (-2%) | ↑ 417 (+68%) | ↑ 380 (+53%) |
enterprise — 9 sheets, dense real-world complexity |
23,983 | ↓ 22,533 (-6%) | ↓ 22,639 (-6%) | ↑ 29,357 (+22%) | ↑ 28,741 (+20%) |
Tokens (cl100k_base):
| File | Raw CSV | Clean CSV | CSV islands | Markdown | JSON |
|---|---|---|---|---|---|
large_sparse |
1,033 | ↓ 513 (-50%) | ↓ 443 (-57%) | ↓ 616 (-40%) | ↓ 605 (-41%) |
fragmented |
2,556 | ↓ 1,504 (-41%) | ↓ 1,696 (-34%) | ↓ 2,375 (-7%) | ↓ 2,283 (-11%) |
stray_cells |
630 | ↓ 555 (-12%) | ↓ 575 (-9%) | ↑ 824 (+31%) | ↑ 811 (+29%) |
multisheet |
499 | ↓ 430 (-14%) | ↓ 484 (-3%) | ↑ 737 (+48%) | ↑ 721 (+45%) |
invoice |
1,317 | ↓ 1,230 (-7%) | ↑ 1,323 (+0%) | ↑ 1,934 (+47%) | ↑ 1,839 (+40%) |
minimal |
83 | ↓ 71 (-15%) | ↑ 89 (+7%) | ↑ 143 (+72%) | ↑ 139 (+67%) |
enterprise |
9,606 | ↓ 9,311 (-3%) | ↑ 9,627 (+0%) | ↑ 11,723 (+22%) | ↑ 12,386 (+29%) |
Optional Rust backend
carloforte ships a Rust implementation of the island-detection algorithm via PyO3. It is not enabled by default — the pure-Python implementation is used unless you build and install the extension manually.
The Rust backend (carloforte-rs) re-implements find_islands in Rust and exposes it as a native Python extension (_islands_rs). The BFS logic is identical to the Python version; the speedup comes from Rust's memory layout and the absence of Python object overhead on large grids.
Build from source:
cd carloforte-rs
maturin develop --release
Once built, the .so / .pyd extension will be importable as carloforte._islands_rs. Integration into the default pipeline is planned for a future release.
Architecture
carloforte/
├── __init__.py public API: re-exports extract()
├── _extract.py extract() — orchestrates the pipeline
├── _cli.py main() — CLI entry point
├── _reader.py load_workbook_sheets() — openpyxl → Grid
├── _islands.py find_islands() — BFS island detection
└── _serialiser.py serialise() — Grid → csv / markdown / json
carloforte.extract(path, sheets=None, fmt="csv")
│
│ 1. _reader.load_workbook_sheets(path, sheets)
│ openpyxl → dict[sheet_name, Grid]
│
│ 2. _islands.find_islands(grid) ← per sheet
│ BFS over non-empty cells → list[Island]
│ each Island: bounding box + header row + data rows
│
│ 3. _serialiser.serialise(sheet_islands, fmt)
│ "csv" → one block per island, blank-line separated
│ "markdown" → ## heading per sheet, fenced table per island
│ "json" → {"sheets": {"name": {"tables": [...]}}}
│
└─ returns str
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file carloforte-0.2.2.tar.gz.
File metadata
- Download URL: carloforte-0.2.2.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22c3d86fc957c78e0546476b17f8b849b925217e37e8f817845a40ea9d07cbc4
|
|
| MD5 |
a1db03667581e632ad5010d00450740a
|
|
| BLAKE2b-256 |
3a0c37c06d9da6c2921251578a59e1f6194fb18e97d33f403365b5347ebab441
|
File details
Details for the file carloforte-0.2.2-py3-none-any.whl.
File metadata
- Download URL: carloforte-0.2.2-py3-none-any.whl
- Upload date:
- Size: 8.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e0baa5ec634609056aa6d306e8904da304e6838d5f07a0cc436c1d9ba630e2b
|
|
| MD5 |
c1444adcbe5d03731284094a3958b492
|
|
| BLAKE2b-256 |
afb4b9e3e946381bf61cd9e208f8a4fafd23bfb26f9d060ee6f59f9ffd173944
|