Skip to main content

Untabulate grid data for friendly llm embedddings or similar analysis

Project description

Untabulate

PyPI version Python 3.12+ License: MIT

A Cython-accelerated library for associating tabular data points with their governing row and column headers. While it includes helpers for HTML and Excel, the core logic is source-agnostic, making it ideal for LLM embeddings and RAG pipelines where semantic context is crucial.

Installation

pip install untabulate

To include HTML parsing support:

pip install "untabulate[lxml]"

To include Excel parsing support:

pip install "untabulate[openpyxl]"

To include both:

pip install "untabulate[lxml,openpyxl]"

The Problem

When you extract data from a table like this:

Q1 Q2
Revenue 100 120
↳ North America 40 50
↳ Europe 60 70

Traditional parsers give you value=40 at position (3, 2). But for LLM embeddings, you need:

Revenue → North America → Q1: 40

This library address such problems.

Quick Start

from untabulate import untabulate_html

html = """
<table>
    <tr><th></th><th>Q1</th><th>Q2</th></tr>
    <tr><th>Revenue</th><td>100</td><td>120</td></tr>
    <tr><th>Costs</th><td>60</td><td>70</td></tr>
</table>
"""

# Get all data with semantic context in one call
for item in untabulate_html(html, format="strings"):
    print(item)

# Output:
# Revenue → Q1: 100
# Revenue → Q2: 120
# Costs → Q1: 60
# Costs → Q2: 70

Output Formats

Choose the format that fits your use case:

from untabulate import untabulate_html

html = "<table><tr><th></th><th>Q1</th></tr><tr><th>Revenue</th><td>100</td></tr></table>"

# Strings - ready for embeddings
untabulate_html(html, format="strings")
# → ["Revenue → Q1: 100"]

# Dicts - structured data with metadata
untabulate_html(html, format="dict")
# → [{"path": ["Revenue", "Q1"], "value": "100", "context": "Revenue → Q1: 100"}]

# Tuples - lightweight path/value pairs
untabulate_html(html, format="tuples")
# → [(["Revenue", "Q1"], "100")]

Excel Files

from untabulate import untabulate_xlsx

results = untabulate_xlsx("financial_report.xlsx", format="strings")
for line in results:
    print(line)

Custom Separator

untabulate_html(html, format="strings", separator=" | ")
# → ["Revenue | Q1: 100"]

Working with Custom Data Sources

Use untabulate() with any data source - dicts, tuples, or objects:

from untabulate import untabulate

# From database rows or API responses
data = [
    {"el_type": "LB", "row": 1, "col": 2, "label": "Q1"},
    {"el_type": "LB", "row": 2, "col": 1, "label": "Revenue"},
    {"el_type": "DT", "row": 2, "col": 2, "label": "100"},
]

results = untabulate(data, format="strings")
# → ["Revenue → Q1: 100"]

Algorithm: Semantic Header Scoping

The ProjectionGrid uses a simple but effective scoping rule:

  1. Row headers (column 1) propagate downward to all rows below them
  2. Column headers apply to the columns they span

This captures hierarchical relationships naturally:

Row 1: "Revenue" in col 1      → applies to rows 1, 2, 3, 4...
Row 2: "North America" in col 1 → applies to rows 2, 3, 4...
Row 3: "Europe" in col 1        → applies to rows 3, 4...

When you query get_path(row=3, col=2), you get all headers that govern that cell: ["Revenue", "North America", "Q1"]

API Reference

High-Level Functions

untabulate_html(html, *, format="dict", separator=" → ", span_as_label=False, all_tables=False)

Parse HTML and extract data with semantic paths in one step.

  • html: HTML string containing table(s)
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • span_as_label: Treat cells with rowspan/colspan > 1 as headers
  • all_tables: Parse all tables (returns list of lists)
  • Returns: List of results in the specified format
  • Raises: TableNotFoundError if no table found

untabulate_xlsx(filepath, *, sheet_name=None, format="dict", separator=" → ")

Parse Excel and extract data with semantic paths in one step.

  • filepath: Path to .xlsx file
  • sheet_name: Worksheet name (default: active sheet)
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • Returns: List of results in the specified format

untabulate(data, *, format="dict", separator=" → ")

Extract semantic paths from any data source.

  • data: List of dicts, tuples, objects, or GridElement instances
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • Returns: List of results in the specified format

Low-Level API

For advanced use cases, you can use the lower-level components directly:

parse_html_table(html_string, span_as_label=False, all_tables=False)

Parse HTML table(s) into GridElement instances.

parse_xlsx_worksheet(filepath, sheet_name=None)

Parse an Excel worksheet into GridElement instances.

ProjectionGrid(elements)

Build a semantic header projection from elements.

ProjectionGrid.get_path(data_row, data_col)

Get headers governing a cell position.

GridElement(el_type, row, col, rowspan, colspan, label)

Lightweight element for table cells.

  • el_type: "LB" (label/header) or "DT" (data)
  • row/col: 1-based position
  • rowspan/colspan: Cell span
  • label: Text content

Performance

~1M cells/second on typical hardware. The Cython implementation provides ~30% speedup over pure Python, but the main win is the O(n) algorithm vs O(n²) naive approaches.

Why This Matters for LLMs

Embedding models need semantic context, not coordinates. When chunking documents for RAG:

"40" - meaningless without context
"cell (3,2): 40" - coordinates don't help
"Revenue → North America → Q1: 40" - full semantic path

This enables:

  • Better vector similarity for table-based questions
  • Accurate retrieval of specific data points
  • Natural language grounding for structured data

Development

# Clone and install in development mode
git clone https://github.com/patrick/untabulate.git
cd untabulate
pip install -e ".[dev]"

# Run tests
pytest

# Build distribution
python -m build

Sponsor

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

untabulate-0.1.0.tar.gz (252.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

untabulate-0.1.0-cp313-cp313-win_amd64.whl (343.7 kB view details)

Uploaded CPython 3.13Windows x86-64

untabulate-0.1.0-cp313-cp313-win32.whl (334.4 kB view details)

Uploaded CPython 3.13Windows x86

untabulate-0.1.0-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (871.7 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

untabulate-0.1.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (845.0 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

untabulate-0.1.0-cp313-cp313-macosx_11_0_arm64.whl (349.5 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

untabulate-0.1.0-cp312-cp312-win_amd64.whl (344.5 kB view details)

Uploaded CPython 3.12Windows x86-64

untabulate-0.1.0-cp312-cp312-win32.whl (334.8 kB view details)

Uploaded CPython 3.12Windows x86

untabulate-0.1.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (881.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

untabulate-0.1.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (852.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

untabulate-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (350.4 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file untabulate-0.1.0.tar.gz.

File metadata

  • Download URL: untabulate-0.1.0.tar.gz
  • Upload date:
  • Size: 252.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5311e4d0a5e327aa8254b7d95528adb6c29ec1f75ae44aa996ed6297b5107dc6
MD5 36bf2d0630148d5a592f7f1052230442
BLAKE2b-256 f871745872df1f847509ffe34cd394635184465457f44a5edff6dae20921b50b

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.1.0.tar.gz:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.1.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: untabulate-0.1.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 343.7 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.1.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 9c4314cc6b28e3b534615fa81af0db64840ce5f9782074997647d9a4dbc435a3
MD5 103c19eca75c8ded8e7ea48f6825c05c
BLAKE2b-256 2c2da9c7c20310dcbf2320d5841078ac05df1b53771b76b773dddd5d7bca9402

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.1.0-cp313-cp313-win_amd64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.1.0-cp313-cp313-win32.whl.

File metadata

  • Download URL: untabulate-0.1.0-cp313-cp313-win32.whl
  • Upload date:
  • Size: 334.4 kB
  • Tags: CPython 3.13, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.1.0-cp313-cp313-win32.whl
Algorithm Hash digest
SHA256 902961d3508df4cbe3a7eb7711e8ba8754662243dc2feaeb9d823a3ddcb716ab
MD5 d6c3da2e9b3705409ec576c8f43cbc6c
BLAKE2b-256 d43ace803cb548036d80b76768202408024bbac7d61ad1c229ac266d138084d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.1.0-cp313-cp313-win32.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.1.0-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for untabulate-0.1.0-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a0dba4c2dee9e5e2be30cad6ae5026e19829467ee19aa7a4e8468fad4965bdd8
MD5 d190d6bb62125000a57cccb305e56595
BLAKE2b-256 0d2f3f8e622fca286272ceddb7ebe814c220ecefc22c47a6ace100e9e63561ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.1.0-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.1.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for untabulate-0.1.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 79d113cb43311afb8b0d14e4d9f289ee895565578d84663a25f0375a3e635f28
MD5 16235819e7dad65bf1fb5b5f3562a647
BLAKE2b-256 57d670197ba34d12e969e650e158e3be5843affe3c0370a8d0b4df44cf2b97a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.1.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.1.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for untabulate-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8703560e7e4d67bcdf356315ef048d242e93387a5eae1a4be1df9ed254c6c04f
MD5 3749f3325d96b31e96065c1258bf5d23
BLAKE2b-256 4b914e7b1e0b5c3ba9895fb8774181cc6048fc44cb4515968bb59c31d8ba8257

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.1.0-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.1.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: untabulate-0.1.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 344.5 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.1.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 013e55aa9c397faf5fb3bf8e18eb78fe6412748516520924bf6a350ea82d7f6e
MD5 4662bdf78c7a3eb586cb5030a432d3bd
BLAKE2b-256 9b80f77abe487ec82a3d010719a8ee6dcc9c8297ac1a2302646b9203d2b6e1d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.1.0-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.1.0-cp312-cp312-win32.whl.

File metadata

  • Download URL: untabulate-0.1.0-cp312-cp312-win32.whl
  • Upload date:
  • Size: 334.8 kB
  • Tags: CPython 3.12, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.1.0-cp312-cp312-win32.whl
Algorithm Hash digest
SHA256 94ee17dac7e085eac1ee6dc66d264a14d4343cc63c1f78f09b8ce0460d81d8ff
MD5 7d1109adaa42c99f0972a034be75755c
BLAKE2b-256 551dd530030a0d9f593b9f1da751ec94c3af113cb8433d8c2a3d11d52017636d

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.1.0-cp312-cp312-win32.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.1.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for untabulate-0.1.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4b2a3bcf74252e7798aa52b9e2d03a8000c46785fcce844776535dc9800a8acc
MD5 93946a8a58bc1461d977b8d88a996022
BLAKE2b-256 016fa201169cbacc38491c0e65bf69e6f36b3fae89ba53df60105fe3ddcfefc0

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.1.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.1.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for untabulate-0.1.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 a30d5139001c2d85535bf159ab706faa9ffc25de0c73dece73de22b773f9007f
MD5 32019d45de1b2cc03cb4f2d22075a0b2
BLAKE2b-256 e6d3b29cf90e413d292f58e5fc9444b4e1cd787201e952a46cc9a2c8ebf6b732

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.1.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for untabulate-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d343f3672041e3db19e85e9dd0fc4ad7d8ac8c489460c46d096f66b1ddf7e409
MD5 d9d1dddcbfe0287b2acfaa311017ea8e
BLAKE2b-256 fce88cdb49eb1d88110e9093ca4e6e07647498bfecf4002caf8546ee73d5122e

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.1.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page