Skip to main content

Extract structured data from Excel files with minimal token usage

Project description

carloforte

CI

Extract structured data from Excel files with minimal token usage.

carloforte uses an island-detection algorithm to convert Excel sheets into a compact intermediate representation (CSV, Markdown, or JSON), making it efficient to pass spreadsheet data to LLMs.

Installation

uv add carloforte

Usage

import carloforte

# Extract all sheets as CSV (default)
text = carloforte.extract("data.xlsx")

# Extract specific sheets as Markdown
text = carloforte.extract("data.xlsx", sheets=["Revenue", "Costs"], fmt="markdown")

# Extract as JSON
text = carloforte.extract("data.xlsx", fmt="json")

Formats

Format Best for
csv Compact, low token count
markdown Readable, good for LLM prompts
json Structured output, programmatic use

CLI

carloforte data.xlsx --fmt markdown
carloforte data.xlsx --sheets Revenue Costs --fmt json

How it works

Excel sheets often contain multiple disconnected tables, empty rows, and metadata scattered around. carloforte detects each contiguous block of data ("island") independently and serialises only what matters — reducing token usage by 60–75% compared to passing raw Excel content to an LLM.

Performance

Benchmarked against raw CSV export (worst case baseline). Run examples/scripts/benchmark_tokens.py to reproduce.

Characters:

File Raw CSV Clean CSV CSV islands Markdown JSON
large_sparse — 1 sheet, 5 islands spread over 165 rows 4,470 ↓ 1,480 (-67%) ↓ 960 (-79%) ↓ 1,453 (-68%) ↓ 1,337 (-70%)
fragmented — 4 sheets, 4% fill, tiny tables far apart 10,664 ↓ 5,334 (-50%) ↓ 4,811 (-55%) ↓ 6,690 (-37%) ↓ 6,221 (-42%)
stray_cells — 1 sheet, 7 islands + scattered stray cells 2,143 ↓ 1,668 (-22%) ↓ 1,458 (-32%) ↑ 2,175 (+2%) ↓ 1,994 (-7%)
multisheet — 4 sheets, mixed structure 1,304 ↓ 1,084 (-17%) ↓ 1,156 (-11%) ↑ 1,879 (+44%) ↑ 1,755 (+35%)
invoice — single invoice sheet 4,180 ↓ 3,745 (-10%) ↓ 3,855 (-8%) ↑ 5,097 (+22%) ↑ 5,020 (+20%)
minimal — 1 sheet, 3 small islands 248 ↓ 216 (-13%) ↓ 244 (-2%) ↑ 417 (+68%) ↑ 380 (+53%)
enterprise — 9 sheets, dense real-world complexity 23,983 ↓ 22,533 (-6%) ↓ 22,639 (-6%) ↑ 29,357 (+22%) ↑ 28,741 (+20%)

Tokens (cl100k_base):

File Raw CSV Clean CSV CSV islands Markdown JSON
large_sparse 1,033 ↓ 513 (-50%) ↓ 443 (-57%) ↓ 616 (-40%) ↓ 605 (-41%)
fragmented 2,556 ↓ 1,504 (-41%) ↓ 1,696 (-34%) ↓ 2,375 (-7%) ↓ 2,283 (-11%)
stray_cells 630 ↓ 555 (-12%) ↓ 575 (-9%) ↑ 824 (+31%) ↑ 811 (+29%)
multisheet 499 ↓ 430 (-14%) ↓ 484 (-3%) ↑ 737 (+48%) ↑ 721 (+45%)
invoice 1,317 ↓ 1,230 (-7%) ↑ 1,323 (+0%) ↑ 1,934 (+47%) ↑ 1,839 (+40%)
minimal 83 ↓ 71 (-15%) ↑ 89 (+7%) ↑ 143 (+72%) ↑ 139 (+67%)
enterprise 9,606 ↓ 9,311 (-3%) ↑ 9,627 (+0%) ↑ 11,723 (+22%) ↑ 12,386 (+29%)

Optional Rust backend

carloforte ships a Rust implementation of the island-detection algorithm via PyO3. It is not enabled by default — the pure-Python implementation is used unless you build and install the extension manually.

The Rust backend (carloforte-rs) re-implements find_islands in Rust and exposes it as a native Python extension (_islands_rs). The BFS logic is identical to the Python version; the speedup comes from Rust's memory layout and the absence of Python object overhead on large grids.

Build from source:

cd carloforte-rs
maturin develop --release

Once built, the .so / .pyd extension will be importable as carloforte._islands_rs. Integration into the default pipeline is planned for a future release.

Architecture

carloforte/
├── __init__.py        public API: re-exports extract()
├── _extract.py        extract() — orchestrates the pipeline
├── _cli.py            main() — CLI entry point
├── _reader.py         load_workbook_sheets() — openpyxl → Grid
├── _islands.py        find_islands() — BFS island detection
└── _serialiser.py     serialise() — Grid → csv / markdown / json
carloforte.extract(path, sheets=None, fmt="csv")
│
│  1. _reader.load_workbook_sheets(path, sheets)
│     openpyxl → dict[sheet_name, Grid]
│
│  2. _islands.find_islands(grid)   ← per sheet
│     BFS over non-empty cells → list[Island]
│     each Island: bounding box + header row + data rows
│
│  3. _serialiser.serialise(sheet_islands, fmt)
│     "csv"      → one block per island, blank-line separated
│     "markdown" → ## heading per sheet, fenced table per island
│     "json"     → {"sheets": {"name": {"tables": [...]}}}
│
└─ returns str

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

carloforte-0.2.1.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

carloforte-0.2.1-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file carloforte-0.2.1.tar.gz.

File metadata

  • Download URL: carloforte-0.2.1.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for carloforte-0.2.1.tar.gz
Algorithm Hash digest
SHA256 5ff61a8d6176ae607a2425af5462c6356277b8c261c8be67389a21a88e55968d
MD5 421c9a66f6cfedd867e2a430ab80f115
BLAKE2b-256 55c4cf0869b0acb5900b42394af4cd8823496232b54ed177f8552f12b29d4d75

See more details on using hashes here.

File details

Details for the file carloforte-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: carloforte-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for carloforte-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 408c4f8382f72be67109221a07875ee6bcaf390593a436a55c5625fcf9799f39
MD5 cac12c03f806be9b6169b04fc5cb1af5
BLAKE2b-256 69342560a36c63a7c7ceb7f2afeb676f819ddefebd361bc9d98904ce1ffc4d90

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page