Skip to main content

Untabulate grid data for friendly llm embedddings or similar analysis

Project description

Untabulate

PyPI version Python 3.12+ License: MIT

A Cython-accelerated library for associating tabular data points with their governing row and column headers. While it includes helpers for HTML and Excel, the core logic is source-agnostic, making it ideal for LLM embeddings and RAG pipelines where semantic context is crucial.

"Improving LLM accuracy since 2036"

Installation

pip install untabulate

To include HTML parsing support:

pip install "untabulate[lxml]"

To include Excel parsing support:

pip install "untabulate[openpyxl]"

To include both:

pip install "untabulate[lxml,openpyxl]"

The Problem

When you extract data from a table like this:

Q1 Q2
Revenue 100 120
North America 40 50
Europe 60 70

Traditional parsers give you value=40 at position (3, 3). But for LLM embeddings, you need:

Revenue → North America → Q1: 40

Quick Start

from untabulate import untabulate_html

html = """
<table>
    <tr><th></th><th>Q1</th><th>Q2</th></tr>
    <tr><th>Revenue</th><td>100</td><td>120</td></tr>
    <tr><th>Costs</th><td>60</td><td>70</td></tr>
</table>
"""

# Get all data with semantic context in one call
for item in untabulate_html(html, format="strings"):
    print(item)

# Output:
# Revenue → Q1: 100
# Revenue → Q2: 120
# Costs → Q1: 60
# Costs → Q2: 70

Output Formats

Choose the format that fits your use case:

from untabulate import untabulate_html

html = "<table><tr><th></th><th>Q1</th></tr><tr><th>Revenue</th><td>100</td></tr></table>"

# Strings - ready for embeddings
untabulate_html(html, format="strings")
# → ["Revenue → Q1: 100"]

# Dicts - structured data with metadata
untabulate_html(html, format="dict")
# → [{"path": ["Revenue", "Q1"], "value": "100", "context": "Revenue → Q1: 100"}]

# Tuples - lightweight path/value pairs
untabulate_html(html, format="tuples")
# → [(["Revenue", "Q1"], "100")]

Excel Files

from untabulate import untabulate_xlsx

results = untabulate_xlsx("financial_report.xlsx", format="strings")
for line in results:
    print(line)

Command Line

Install with CLI support:

pip install "untabulate[cli]"

Then use from the command line:

# Fetch and process a URL
untabulate html https://example.com/report.html

# Process a local HTML file
untabulate html ./report.html

# Target a specific table by ID
untabulate html page.html --id quarterly-results

# Process Excel files
untabulate xlsx data.xlsx --sheet "Q1 Results"

# Different output formats
untabulate html report.html --format json   # Default: structured JSON
untabulate html report.html --format text   # One line per value
untabulate html report.html --format jsonl  # JSON Lines (for streaming)
untabulate html report.html --format csv    # CSV format

# Read from stdin
curl https://example.com | untabulate -

# Custom separator
untabulate html report.html --format text --separator " | "

Custom Separator

untabulate_html(html, format="strings", separator=" | ")
# → ["Revenue | Q1: 100"]

Working with Custom Data Sources

Use untabulate() with any data source - dicts, tuples, or objects:

from untabulate import untabulate

# From database rows or API responses
data = [
    {"is_header": True, "row": 1, "col": 2, "value": "Q1"},
    {"is_header": True, "row": 2, "col": 1, "value": "Revenue"},
    {"is_header": False, "row": 2, "col": 2, "value": "100"},
]

results = untabulate(data, format="strings")
# → ["Revenue → Q1: 100"]

Algorithm: Semantic Header Scoping

The ProjectionGrid uses a simple but effective scoping rule:

  1. Row headers (left of data) apply to the rows they span (via rowspan)
  2. Column headers (above data) apply to the columns they span (via colspan)

This captures hierarchical relationships naturally:

Row 2: "Revenue" (rowspan=3, col 1)      → applies to rows 2, 3, 4
Row 2: "North America" (rowspan=1, col 2) → applies to row 2 only
Row 3: "Europe" (rowspan=1, col 2)        → applies to row 3 only

When you query get_path(row=3, col=3), you get all headers that govern that cell: ["Revenue", "Europe", "Q1"]

API Reference

High-Level Functions

untabulate_html(html, *, format="dict", separator=" → ", span_as_label=False, all_tables=False)

Parse HTML and extract data with semantic paths in one step.

  • html: HTML string containing table(s)
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • span_as_label: Treat cells with rowspan/colspan > 1 as headers
  • all_tables: Parse all tables (returns list of lists)
  • Returns: List of results in the specified format
  • Raises: TableNotFoundError if no table found

untabulate_xlsx(filepath, *, sheet_name=None, format="dict", separator=" → ")

Parse Excel and extract data with semantic paths in one step.

  • filepath: Path to .xlsx file
  • sheet_name: Worksheet name (default: active sheet)
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • Returns: List of results in the specified format

untabulate(data, *, format="dict", separator=" → ")

Extract semantic paths from any data source.

  • data: List of dicts, tuples, objects, or GridElement instances
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • Returns: List of results in the specified format

Low-Level API

For advanced use cases, you can use the lower-level components directly:

parse_html_table(html_string, span_as_label=False, all_tables=False)

Parse HTML table(s) into GridElement instances.

parse_xlsx_worksheet(filepath, sheet_name=None)

Parse an Excel worksheet into GridElement instances.

ProjectionGrid(elements)

Build a semantic header projection from elements.

ProjectionGrid.get_path(data_row, data_col)

Get headers governing a cell position.

GridElement(is_header, row, col, rowspan, colspan, value)

Lightweight element for table cells.

  • is_header: True if this cell is a header, False for data cells
  • row/col: 1-based position
  • rowspan/colspan: Cell span
  • value: Text content of the cell

Performance

~1M cells/second on typical hardware. The Cython implementation provides ~30% speedup over pure Python, but the main win is the O(n) algorithm vs O(n²) naive approaches.

Why This Matters for LLMs

Embedding models need semantic context, not coordinates. When chunking documents for RAG:

"40" - meaningless without context
"cell (3,2): 40" - coordinates don't help
"Revenue → North America → Q1: 40" - full semantic path

This enables:

  • Better vector similarity for table-based questions
  • Accurate retrieval of specific data points
  • Natural language grounding for structured data

Development

# Clone and install in development mode
git clone https://github.com/patrick/untabulate.git
cd untabulate
pip install -e ".[dev]"

# Run tests
pytest

# Build distribution
python -m build

Sponsor

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

untabulate-0.2.4.tar.gz (261.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

untabulate-0.2.4-cp313-cp313-win_amd64.whl (348.9 kB view details)

Uploaded CPython 3.13Windows x86-64

untabulate-0.2.4-cp313-cp313-win32.whl (339.9 kB view details)

Uploaded CPython 3.13Windows x86

untabulate-0.2.4-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (892.6 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

untabulate-0.2.4-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (862.0 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

untabulate-0.2.4-cp313-cp313-macosx_11_0_arm64.whl (357.1 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

untabulate-0.2.4-cp312-cp312-win_amd64.whl (349.6 kB view details)

Uploaded CPython 3.12Windows x86-64

untabulate-0.2.4-cp312-cp312-win32.whl (340.4 kB view details)

Uploaded CPython 3.12Windows x86

untabulate-0.2.4-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (899.0 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

untabulate-0.2.4-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (871.2 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

untabulate-0.2.4-cp312-cp312-macosx_11_0_arm64.whl (358.4 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file untabulate-0.2.4.tar.gz.

File metadata

  • Download URL: untabulate-0.2.4.tar.gz
  • Upload date:
  • Size: 261.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.4.tar.gz
Algorithm Hash digest
SHA256 16ac98dfb0511cde14f80d3bfa340796be333bde986da49f5f4f48a5adbda792
MD5 3a5ed3bc4ca2b6f1cc6606c8af92bf3c
BLAKE2b-256 09effb48587a238eb60f799d8e76b0be64a9cb4547c2d1e888825dc398d1f3bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.4.tar.gz:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.4-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: untabulate-0.2.4-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 348.9 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.4-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 af4d495036449f255e02c1d2d9e89c99afbe49e1ad2e0864d840c0dddeb2dbf8
MD5 88ffb466d314c0ec9643000bdd746d4a
BLAKE2b-256 4354ece36caa7079889b62446d435dfe1cb7380d16363a9fd0730099dca934de

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.4-cp313-cp313-win_amd64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.4-cp313-cp313-win32.whl.

File metadata

  • Download URL: untabulate-0.2.4-cp313-cp313-win32.whl
  • Upload date:
  • Size: 339.9 kB
  • Tags: CPython 3.13, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.4-cp313-cp313-win32.whl
Algorithm Hash digest
SHA256 7625418b3e7ff4732e0593c7c0d32e885f44c4fc4a6ff00139918c3ce6a607c3
MD5 d5fe9ae94d1c4e2f5af13bd8d167c029
BLAKE2b-256 085db373b1787c656591f3c0a66ad580f3552d574d62b9321e2a0358a1bedd76

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.4-cp313-cp313-win32.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.4-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for untabulate-0.2.4-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 af685524161e298b7e2eec15f4147f9c78c320acd0bf2cc8f65feda399514517
MD5 83887c6100395178883d1e2bf1ecc66c
BLAKE2b-256 788258cb592267de4bf5a4d1dcee3552a5e2d21d418fc4e784ae285becabb936

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.4-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.4-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for untabulate-0.2.4-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 fdd818d03d500572aa0a22ea8d3726186fc0a876d5ac7e2ba94d9b5984761a30
MD5 32cb7167ba4e538014d1b7a88e8c3b9c
BLAKE2b-256 6c5226926f6272287534cc2142e54ca138a845e96d4e05fde42614626de9f536

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.4-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.4-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for untabulate-0.2.4-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c11bf8f6f9bcedf662b7a7a7064a335efc10d0f9c150b8a6eaab3303b49f9273
MD5 9c4121c6a0a4227d9578688dd3cf9e32
BLAKE2b-256 2bc20acc35dcd9d599701e98ad838b34c209286764a84fd0f1047e01a8c1e0a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.4-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: untabulate-0.2.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 349.6 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a7aac9f7a34992ec9bd898d592cd7b8113acf4ea301c71bcd0d96e44ffa07e57
MD5 990a64d4a58958f016ce78edbb6a1845
BLAKE2b-256 3b1ca16b2598603d6ed208b87f38031efa5929046c23ec6a87325c7fe5b7d85d

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.4-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.4-cp312-cp312-win32.whl.

File metadata

  • Download URL: untabulate-0.2.4-cp312-cp312-win32.whl
  • Upload date:
  • Size: 340.4 kB
  • Tags: CPython 3.12, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.4-cp312-cp312-win32.whl
Algorithm Hash digest
SHA256 d7e5a71f0be860291c37da736899b70c1307c0638f828f77bac775a55f960c9a
MD5 08d3ce8fa226e899f5c48a541e8b5001
BLAKE2b-256 4e1769804d393ce021bcda0331d38dad040b28bdfbb2448cb9742ec33338785d

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.4-cp312-cp312-win32.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.4-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for untabulate-0.2.4-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 43ced1f31a2bf924c16a391e94b2230d16821db85e3326795eaaafcea472cff8
MD5 23a6a7087a8847215e8931591a243fbd
BLAKE2b-256 38c96537de963dbffc0295080c7e848aafdd8baccde17244f8302af6388b7c76

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.4-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.4-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for untabulate-0.2.4-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 f8f4e5a942ef178edc67e68cce4c50ab965c20121d385383bbf1c13671dcd76c
MD5 0931a88302c70bc9c1710803ad0cbc2f
BLAKE2b-256 2e47cb5f3c98164b8694471c4adab57997ba22f27eaea66fbffa967c4a84a39c

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.4-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for untabulate-0.2.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7e8e7a660b01afe4a73a2ede1058b52072d4748d9f161266c4891cfec7ea3a0e
MD5 be147b71294cec4d344ed7a0e6e1d42e
BLAKE2b-256 61a1c418f1160f33f52c7b87f44b28e4c00aef02b4c8dfebeb083e7757d26c5c

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.4-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page