Skip to main content

Untabulate grid data for friendly llm embedddings or similar analysis

Project description

Untabulate

PyPI version Python 3.12+ License: MIT

Extract table cell values with their row and column headers in Python.

Untabulate maps every data cell in a table to the row headers and column headers that govern it, producing semantic paths like Revenue → North America → Q1: 40. It handles hierarchical headers, merged cells (rowspan/colspan), and works with HTML tables, Excel spreadsheets, or any custom data source.

Built for LLM embeddings, RAG pipelines, and any workflow where a bare cell value is meaningless without its header context.

Use Cases

  • LLM & RAG pipelines — convert table cells into semantic strings for vector embeddings
  • HTML table scraping — associate each scraped value with its row and column headers
  • Excel data extraction — flatten spreadsheets with merged/hierarchical headers
  • Data flattening — turn any 2D table with multi-level headers into flat key-value pairs

Installation

pip install untabulate

To include HTML parsing support:

pip install "untabulate[lxml]"

To include Excel parsing support:

pip install "untabulate[openpyxl]"

To include both:

pip install "untabulate[lxml,openpyxl]"

The Problem: Table Cells Without Header Context

When you extract data from a table like this:

Q1 Q2
Revenue 100 120
North America 40 50
Europe 60 70

Traditional parsers give you value=40 at position (3, 3). But for LLM embeddings, semantic search, or readable output, you need the value associated with its headers:

Revenue → North America → Q1: 40

Untabulate solves this by projecting row and column headers onto every data cell automatically, even when headers span multiple rows or columns.

Quick Start

from untabulate import untabulate_html

html = """
<table>
    <tr><th></th><th>Q1</th><th>Q2</th></tr>
    <tr><th>Revenue</th><td>100</td><td>120</td></tr>
    <tr><th>Costs</th><td>60</td><td>70</td></tr>
</table>
"""

# Get all data with semantic context in one call
for item in untabulate_html(html, format="strings"):
    print(item)

# Output:
# Revenue → Q1: 100
# Revenue → Q2: 120
# Costs → Q1: 60
# Costs → Q2: 70

Output Formats

Choose the format that fits your use case:

from untabulate import untabulate_html

html = "<table><tr><th></th><th>Q1</th></tr><tr><th>Revenue</th><td>100</td></tr></table>"

# Strings - ready for embeddings
untabulate_html(html, format="strings")
# → ["Revenue → Q1: 100"]

# Dicts - structured data with metadata
untabulate_html(html, format="dict")
# → [{"path": ["Revenue", "Q1"], "value": "100", "context": "Revenue → Q1: 100"}]

# Tuples - lightweight path/value pairs
untabulate_html(html, format="tuples")
# → [(["Revenue", "Q1"], "100")]

Excel Files

from untabulate import untabulate_xlsx

results = untabulate_xlsx("financial_report.xlsx", format="strings")
for line in results:
    print(line)

Command Line

Install with CLI support:

pip install "untabulate[cli]"

Then use from the command line:

# Fetch and process a URL
untabulate html https://example.com/report.html

# Process a local HTML file
untabulate html ./report.html

# Target a specific table by ID
untabulate html page.html --id quarterly-results

# Process Excel files
untabulate xlsx data.xlsx --sheet "Q1 Results"

# Different output formats
untabulate html report.html --format json   # Default: structured JSON
untabulate html report.html --format text   # One line per value
untabulate html report.html --format jsonl  # JSON Lines (for streaming)
untabulate html report.html --format csv    # CSV format

# Read from stdin
curl https://example.com | untabulate -

# Custom separator
untabulate html report.html --format text --separator " | "

Custom Separator

untabulate_html(html, format="strings", separator=" | ")
# → ["Revenue | Q1: 100"]

Working with Any Data Source

Use untabulate() with any data source - dicts, tuples, or objects:

from untabulate import untabulate

# From database rows or API responses
data = [
    {"is_header": True, "row": 1, "col": 2, "value": "Q1"},
    {"is_header": True, "row": 2, "col": 1, "value": "Revenue"},
    {"is_header": False, "row": 2, "col": 2, "value": "100"},
]

results = untabulate(data, format="strings")
# → ["Revenue → Q1: 100"]

How It Works: Semantic Header Projection Algorithm

The ProjectionGrid uses a simple but effective scoping rule:

  1. Row headers (left of data) apply to the rows they span (via rowspan)
  2. Column headers (above data) apply to the columns they span (via colspan)

This captures hierarchical and merged header relationships naturally:

Row 2: "Revenue" (rowspan=3, col 1)      → applies to rows 2, 3, 4
Row 2: "North America" (rowspan=1, col 2) → applies to row 2 only
Row 3: "Europe" (rowspan=1, col 2)        → applies to row 3 only

When you query get_path(row=3, col=3), you get all headers that govern that cell: ["Revenue", "Europe", "Q1"]

API Reference

High-Level Functions

untabulate_html(html, *, format="dict", separator=" → ", span_as_label=False, all_tables=False)

Parse HTML and extract data with semantic paths in one step.

  • html: HTML string containing table(s)
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • span_as_label: Treat cells with rowspan/colspan > 1 as headers
  • all_tables: Parse all tables (returns list of lists)
  • Returns: List of results in the specified format
  • Raises: TableNotFoundError if no table found

untabulate_xlsx(filepath, *, sheet_name=None, format="dict", separator=" → ")

Parse Excel and extract data with semantic paths in one step.

  • filepath: Path to .xlsx file
  • sheet_name: Worksheet name (default: active sheet)
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • Returns: List of results in the specified format

untabulate(data, *, format="dict", separator=" → ")

Extract semantic paths from any data source.

  • data: List of dicts, tuples, objects, or GridElement instances
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • Returns: List of results in the specified format

Low-Level API

For advanced use cases, you can use the lower-level components directly:

parse_html_table(html_string, span_as_label=False, all_tables=False)

Parse HTML table(s) into GridElement instances.

parse_xlsx_worksheet(filepath, sheet_name=None)

Parse an Excel worksheet into GridElement instances.

ProjectionGrid(elements)

Build a semantic header projection from elements.

ProjectionGrid.get_path(data_row, data_col)

Get headers governing a cell position.

GridElement(is_header, row, col, rowspan, colspan, value)

Lightweight element for table cells.

  • is_header: True if this cell is a header, False for data cells
  • row/col: 1-based position
  • rowspan/colspan: Cell span
  • value: Text content of the cell

Performance

~1M cells/second on typical hardware. The Cython implementation provides ~30% speedup over pure Python, but the main win is the O(n) algorithm vs O(n²) naive approaches.

Why Untabulate Helps with LLM Embeddings and RAG

Embedding models need semantic context, not coordinates. When chunking documents for retrieval-augmented generation:

"40" — meaningless without context ❌ "cell (3,2): 40" — coordinates don't help similarity search ✅ "Revenue → North America → Q1: 40" — full semantic path with headers

This enables:

  • Better vector similarity for table-based questions
  • Accurate retrieval of specific data points from tables
  • Natural language grounding for structured and tabular data

Development

# Clone and install in development mode
git clone https://github.com/patrick/untabulate.git
cd untabulate
pip install -e ".[dev]"

# Run tests
pytest

# Build distribution
python -m build

Sponsor

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

untabulate-0.2.5.tar.gz (260.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

untabulate-0.2.5-cp313-cp313-win_amd64.whl (350.6 kB view details)

Uploaded CPython 3.13Windows x86-64

untabulate-0.2.5-cp313-cp313-win32.whl (341.6 kB view details)

Uploaded CPython 3.13Windows x86

untabulate-0.2.5-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (892.9 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

untabulate-0.2.5-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (862.3 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

untabulate-0.2.5-cp313-cp313-macosx_11_0_arm64.whl (357.4 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

untabulate-0.2.5-cp312-cp312-win_amd64.whl (351.4 kB view details)

Uploaded CPython 3.12Windows x86-64

untabulate-0.2.5-cp312-cp312-win32.whl (342.1 kB view details)

Uploaded CPython 3.12Windows x86

untabulate-0.2.5-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (899.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

untabulate-0.2.5-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (871.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

untabulate-0.2.5-cp312-cp312-macosx_11_0_arm64.whl (358.8 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file untabulate-0.2.5.tar.gz.

File metadata

  • Download URL: untabulate-0.2.5.tar.gz
  • Upload date:
  • Size: 260.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.5.tar.gz
Algorithm Hash digest
SHA256 72f97b1bfc16a95b739d14c0523b2670cd139d2c17fe3255c513c5c4f6540e52
MD5 533ad300c329d9cb3f82047ac871083a
BLAKE2b-256 2dea3c181d4c8d2ebe8a4d14c3012316114472343490f4db1be339d1dee1fa83

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.5.tar.gz:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.5-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: untabulate-0.2.5-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 350.6 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.5-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 9a6e68aabceba34517ed0a0e4459382c5b425a1c04f4cb75a8f53541bd132ccc
MD5 9d2a4c943eca3021785ec6708a64eb97
BLAKE2b-256 7cffa8e9f3438c4decc7c2cd99f4bbaef5e347b16631c18632320600a919b963

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.5-cp313-cp313-win_amd64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.5-cp313-cp313-win32.whl.

File metadata

  • Download URL: untabulate-0.2.5-cp313-cp313-win32.whl
  • Upload date:
  • Size: 341.6 kB
  • Tags: CPython 3.13, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.5-cp313-cp313-win32.whl
Algorithm Hash digest
SHA256 b094d04eb816b73020e3350373941fa3bc6949b803d710e10d1a8b358fdbc4be
MD5 05efd6dd8839474f802b37998fce654d
BLAKE2b-256 04349efaa298c0eab1173122cb0928533664ff5664e4908f197b11b262d51823

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.5-cp313-cp313-win32.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.5-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for untabulate-0.2.5-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ce521d35a409b64324b68b228ec96dc570832d4710152a6c864634b71bc9dc96
MD5 bf0242e05904957c75aba9c1a09f559f
BLAKE2b-256 e788ee0550463d27b0a55c4e5291ddff9ce243a7b4dbcbf86ba5862e9ca36ac2

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.5-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.5-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for untabulate-0.2.5-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 2d430bfb9d610ce7ae74e6886d3e042f6b4ea88885ecea55f350530b9eca7aae
MD5 e0838354a3c47afa716e85270508a3ac
BLAKE2b-256 5ba9420525c108f9852eccff3a715155614b11768800183de3faabdf2b1dd177

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.5-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.5-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for untabulate-0.2.5-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4673664b1940664385534c790c5bb83d76816338e7f42a9d5e39acd2af438a4e
MD5 0d143f7e8e1e238538544111e440bd39
BLAKE2b-256 6b9aa5354331d6b86f0b7dea7b2f65b0c9b4fef83fc689819f3652036fc9248f

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.5-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.5-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: untabulate-0.2.5-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 351.4 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.5-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 7abc9a8ffa5bafa46d91a780c58aac81d574a723d0e36d326496547076a18cd1
MD5 8a8c9d48a0495bf732d375773e50af2f
BLAKE2b-256 fe412696216c04245ee38bf45eab260123100ec35149963d01c774bcb82ea6fd

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.5-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.5-cp312-cp312-win32.whl.

File metadata

  • Download URL: untabulate-0.2.5-cp312-cp312-win32.whl
  • Upload date:
  • Size: 342.1 kB
  • Tags: CPython 3.12, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.5-cp312-cp312-win32.whl
Algorithm Hash digest
SHA256 d0e8a71427e03e76b9b44ab0b7111beda1b0516e8f2cdd37886fef4bbfdf4e85
MD5 596146a83d5c09d04783a2d480890dc4
BLAKE2b-256 14ebaae1fdb5c27e3151f09e792bb8e2efdbe75186a7ffb24140fcedd67c8c76

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.5-cp312-cp312-win32.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.5-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for untabulate-0.2.5-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f8450f47cd44c1d35afeda25078cea4b0f27135ef7c38453566cf619d2bc11fa
MD5 0fa0a43b7f7e44ffbd51d23fc4a0056f
BLAKE2b-256 e30162298726d634afe065a03f1657c5451ecdb72e60fb562dc9b9e6179d2775

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.5-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.5-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for untabulate-0.2.5-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 eb574227d40b40df725994b3d09b906e3de59356e716c60f13c9227a42618ac9
MD5 e3778f2379ee873b070d4a4d62077f7c
BLAKE2b-256 d355ed10d28cf19cf9f2e6d9c210ce145cd2424d93b5483af099a8f3c38faf59

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.5-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.5-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for untabulate-0.2.5-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 63e49912430b0f8495ddd26acdce7a6105986defdc4c290e18c9b26dce748bb4
MD5 04aaaf2828446ee226a6b0f5cbcfa3a6
BLAKE2b-256 c5fc2d1e69a2cdc4667cf44b8be1f836fcbe39bc1058bf6f19fd8f22281072df

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.5-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page