Skip to main content

Extract structured data from Excel files with minimal token usage

Project description

carloforte

CI

Extract structured data from Excel files with minimal token usage.

carloforte uses an island-detection algorithm to convert Excel sheets into a compact intermediate representation (CSV, Markdown, or JSON), making it efficient to pass spreadsheet data to LLMs.

Installation

uv add carloforte

Usage

import carloforte

# Extract all sheets as CSV (default)
text = carloforte.extract("data.xlsx")

# Extract specific sheets as Markdown
text = carloforte.extract("data.xlsx", sheets=["Revenue", "Costs"], fmt="markdown")

# Extract as JSON
text = carloforte.extract("data.xlsx", fmt="json")

Formats

Format Best for
csv Compact, low token count
markdown Readable, good for LLM prompts
json Structured output, programmatic use

CLI

carloforte data.xlsx --fmt markdown
carloforte data.xlsx --sheets Revenue Costs --fmt json

How it works

Excel sheets often contain multiple disconnected tables, empty rows, and metadata scattered around. carloforte detects each contiguous block of data ("island") independently and serialises only what matters — reducing token usage by 60–75% compared to passing raw Excel content to an LLM.

Performance

Benchmarked against raw CSV export (worst case baseline). Run examples/scripts/benchmark_tokens.py to reproduce.

Characters:

File Raw CSV Clean CSV CSV islands Markdown JSON
large_sparse — 1 sheet, 5 islands spread over 165 rows 4,470 ↓ 1,480 (-67%) ↓ 960 (-79%) ↓ 1,453 (-68%) ↓ 1,337 (-70%)
fragmented — 4 sheets, 4% fill, tiny tables far apart 10,664 ↓ 5,334 (-50%) ↓ 4,811 (-55%) ↓ 6,690 (-37%) ↓ 6,221 (-42%)
stray_cells — 1 sheet, 7 islands + scattered stray cells 2,143 ↓ 1,668 (-22%) ↓ 1,458 (-32%) ↑ 2,175 (+2%) ↓ 1,994 (-7%)
multisheet — 4 sheets, mixed structure 1,304 ↓ 1,084 (-17%) ↓ 1,156 (-11%) ↑ 1,879 (+44%) ↑ 1,755 (+35%)
invoice — single invoice sheet 4,180 ↓ 3,745 (-10%) ↓ 3,855 (-8%) ↑ 5,097 (+22%) ↑ 5,020 (+20%)
minimal — 1 sheet, 3 small islands 248 ↓ 216 (-13%) ↓ 244 (-2%) ↑ 417 (+68%) ↑ 380 (+53%)
enterprise — 9 sheets, dense real-world complexity 23,983 ↓ 22,533 (-6%) ↓ 22,639 (-6%) ↑ 29,357 (+22%) ↑ 28,741 (+20%)

Tokens (cl100k_base):

File Raw CSV Clean CSV CSV islands Markdown JSON
large_sparse 1,033 ↓ 513 (-50%) ↓ 443 (-57%) ↓ 616 (-40%) ↓ 605 (-41%)
fragmented 2,556 ↓ 1,504 (-41%) ↓ 1,696 (-34%) ↓ 2,375 (-7%) ↓ 2,283 (-11%)
stray_cells 630 ↓ 555 (-12%) ↓ 575 (-9%) ↑ 824 (+31%) ↑ 811 (+29%)
multisheet 499 ↓ 430 (-14%) ↓ 484 (-3%) ↑ 737 (+48%) ↑ 721 (+45%)
invoice 1,317 ↓ 1,230 (-7%) ↑ 1,323 (+0%) ↑ 1,934 (+47%) ↑ 1,839 (+40%)
minimal 83 ↓ 71 (-15%) ↑ 89 (+7%) ↑ 143 (+72%) ↑ 139 (+67%)
enterprise 9,606 ↓ 9,311 (-3%) ↑ 9,627 (+0%) ↑ 11,723 (+22%) ↑ 12,386 (+29%)

Optional Rust backend

carloforte ships a Rust implementation of the island-detection algorithm via PyO3. It is not enabled by default — the pure-Python implementation is used unless you build and install the extension manually.

The Rust backend (carloforte-rs) re-implements find_islands in Rust and exposes it as a native Python extension (_islands_rs). The BFS logic is identical to the Python version; the speedup comes from Rust's memory layout and the absence of Python object overhead on large grids.

Build from source:

cd carloforte-rs
maturin develop --release

Once built, the .so / .pyd extension will be importable as carloforte._islands_rs. Integration into the default pipeline is planned for a future release.

Architecture

carloforte/
├── __init__.py        public API: re-exports extract()
├── _extract.py        extract() — orchestrates the pipeline
├── _cli.py            main() — CLI entry point
├── _reader.py         load_workbook_sheets() — openpyxl → Grid
├── _islands.py        find_islands() — BFS island detection
└── _serialiser.py     serialise() — Grid → csv / markdown / json
carloforte.extract(path, sheets=None, fmt="csv")
│
│  1. _reader.load_workbook_sheets(path, sheets)
│     openpyxl → dict[sheet_name, Grid]
│
│  2. _islands.find_islands(grid)   ← per sheet
│     BFS over non-empty cells → list[Island]
│     each Island: bounding box + header row + data rows
│
│  3. _serialiser.serialise(sheet_islands, fmt)
│     "csv"      → one block per island, blank-line separated
│     "markdown" → ## heading per sheet, fenced table per island
│     "json"     → {"sheets": {"name": {"tables": [...]}}}
│
└─ returns str

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

carloforte-0.2.2.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

carloforte-0.2.2-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file carloforte-0.2.2.tar.gz.

File metadata

  • Download URL: carloforte-0.2.2.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for carloforte-0.2.2.tar.gz
Algorithm Hash digest
SHA256 22c3d86fc957c78e0546476b17f8b849b925217e37e8f817845a40ea9d07cbc4
MD5 a1db03667581e632ad5010d00450740a
BLAKE2b-256 3a0c37c06d9da6c2921251578a59e1f6194fb18e97d33f403365b5347ebab441

See more details on using hashes here.

File details

Details for the file carloforte-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: carloforte-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for carloforte-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6e0baa5ec634609056aa6d306e8904da304e6838d5f07a0cc436c1d9ba630e2b
MD5 c1444adcbe5d03731284094a3958b492
BLAKE2b-256 afb4b9e3e946381bf61cd9e208f8a4fafd23bfb26f9d060ee6f59f9ffd173944

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page