Skip to main content

Untabulate grid data for friendly llm embedddings or similar analysis

Project description

Untabulate

PyPI version Python 3.12+ License: MIT

Extract table cell values with their row and column headers in Python.

Untabulate maps every data cell in a table to the row headers and column headers that govern it, producing semantic paths like Revenue → North America → Q1: 40. It handles hierarchical headers, merged cells (rowspan/colspan), and works with HTML tables, Excel spreadsheets, or any custom data source.

Built for LLM embeddings, RAG pipelines, and any workflow where a bare cell value is meaningless without its header context.

Use Cases

  • LLM & RAG pipelines — convert table cells into semantic strings for vector embeddings
  • HTML table scraping — associate each scraped value with its row and column headers
  • Excel data extraction — flatten spreadsheets with merged/hierarchical headers
  • Data flattening — turn any 2D table with multi-level headers into flat key-value pairs

Installation

pip install untabulate

To include HTML parsing support:

pip install "untabulate[lxml]"

To include Excel parsing support:

pip install "untabulate[openpyxl]"

To include both:

pip install "untabulate[lxml,openpyxl]"

The Problem: Table Cells Without Header Context

When you extract data from a table like this:

Q1 Q2
Revenue 100 120
North America 40 50
Europe 60 70

Traditional parsers give you value=40 at position (3, 3). But for LLM embeddings, semantic search, or readable output, you need the value associated with its headers:

Revenue → North America → Q1: 40

Untabulate solves this by projecting row and column headers onto every data cell automatically, even when headers span multiple rows or columns.

Quick Start

from untabulate import untabulate_html

html = """
<table>
    <tr><th></th><th>Q1</th><th>Q2</th></tr>
    <tr><th>Revenue</th><td>100</td><td>120</td></tr>
    <tr><th>Costs</th><td>60</td><td>70</td></tr>
</table>
"""

# Get all data with semantic context in one call
for item in untabulate_html(html, format="strings"):
    print(item)

# Output:
# Revenue → Q1: 100
# Revenue → Q2: 120
# Costs → Q1: 60
# Costs → Q2: 70

Output Formats

Choose the format that fits your use case:

from untabulate import untabulate_html

html = "<table><tr><th></th><th>Q1</th></tr><tr><th>Revenue</th><td>100</td></tr></table>"

# Strings - ready for embeddings
untabulate_html(html, format="strings")
# → ["Revenue → Q1: 100"]

# Dicts - structured data with metadata
untabulate_html(html, format="dict")
# → [{"path": ["Revenue", "Q1"], "value": "100", "context": "Revenue → Q1: 100"}]

# Tuples - lightweight path/value pairs
untabulate_html(html, format="tuples")
# → [(["Revenue", "Q1"], "100")]

Excel Files

from untabulate import untabulate_xlsx

results = untabulate_xlsx("financial_report.xlsx", format="strings")
for line in results:
    print(line)

Command Line

Install with CLI support:

pip install "untabulate[cli]"

Then use from the command line:

# Fetch and process a URL
untabulate html https://example.com/report.html

# Process a local HTML file
untabulate html ./report.html

# Target a specific table by ID
untabulate html page.html --id quarterly-results

# Process Excel files
untabulate xlsx data.xlsx --sheet "Q1 Results"

# Different output formats
untabulate html report.html --format json   # Default: structured JSON
untabulate html report.html --format text   # One line per value
untabulate html report.html --format jsonl  # JSON Lines (for streaming)
untabulate html report.html --format csv    # CSV format

# Read from stdin
curl https://example.com | untabulate -

# Custom separator
untabulate html report.html --format text --separator " | "

Custom Separator

untabulate_html(html, format="strings", separator=" | ")
# → ["Revenue | Q1: 100"]

Working with Any Data Source

Use untabulate() with any data source - dicts, tuples, or objects:

from untabulate import untabulate

# From database rows or API responses
data = [
    {"is_header": True, "row": 1, "col": 2, "value": "Q1"},
    {"is_header": True, "row": 2, "col": 1, "value": "Revenue"},
    {"is_header": False, "row": 2, "col": 2, "value": "100"},
]

results = untabulate(data, format="strings")
# → ["Revenue → Q1: 100"]

How It Works: Semantic Header Projection Algorithm

The ProjectionGrid uses a simple but effective scoping rule:

  1. Row headers (left of data) apply to the rows they span (via rowspan)
  2. Column headers (above data) apply to the columns they span (via colspan)

This captures hierarchical and merged header relationships naturally:

Row 2: "Revenue" (rowspan=3, col 1)      → applies to rows 2, 3, 4
Row 2: "North America" (rowspan=1, col 2) → applies to row 2 only
Row 3: "Europe" (rowspan=1, col 2)        → applies to row 3 only

When you query get_path(row=3, col=3), you get all headers that govern that cell: ["Revenue", "Europe", "Q1"]

API Reference

High-Level Functions

untabulate_html(html, *, format="dict", separator=" → ", span_as_label=False, all_tables=False)

Parse HTML and extract data with semantic paths in one step.

  • html: HTML string containing table(s)
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • span_as_label: Treat cells with rowspan/colspan > 1 as headers
  • all_tables: Parse all tables (returns list of lists)
  • Returns: List of results in the specified format
  • Raises: TableNotFoundError if no table found

untabulate_xlsx(filepath, *, sheet_name=None, format="dict", separator=" → ")

Parse Excel and extract data with semantic paths in one step.

  • filepath: Path to .xlsx file
  • sheet_name: Worksheet name (default: active sheet)
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • Returns: List of results in the specified format

untabulate(data, *, format="dict", separator=" → ")

Extract semantic paths from any data source.

  • data: List of dicts, tuples, objects, or GridElement instances
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • Returns: List of results in the specified format

Low-Level API

For advanced use cases, you can use the lower-level components directly:

parse_html_table(html_string, span_as_label=False, all_tables=False)

Parse HTML table(s) into GridElement instances.

parse_xlsx_worksheet(filepath, sheet_name=None)

Parse an Excel worksheet into GridElement instances.

ProjectionGrid(elements)

Build a semantic header projection from elements.

ProjectionGrid.get_path(data_row, data_col)

Get headers governing a cell position.

GridElement(is_header, row, col, rowspan, colspan, value)

Lightweight element for table cells.

  • is_header: True if this cell is a header, False for data cells
  • row/col: 1-based position
  • rowspan/colspan: Cell span
  • value: Text content of the cell

Performance

~1M cells/second on typical hardware. The Cython implementation provides ~30% speedup over pure Python, but the main win is the O(n) algorithm vs O(n²) naive approaches.

Why Untabulate Helps with LLM Embeddings and RAG

Embedding models need semantic context, not coordinates. When chunking documents for retrieval-augmented generation:

"40" — meaningless without context ❌ "cell (3,2): 40" — coordinates don't help similarity search ✅ "Revenue → North America → Q1: 40" — full semantic path with headers

This enables:

  • Better vector similarity for table-based questions
  • Accurate retrieval of specific data points from tables
  • Natural language grounding for structured and tabular data

Development

# Clone and install in development mode
git clone https://github.com/patrick/untabulate.git
cd untabulate
pip install -e ".[dev]"

# Run tests
pytest

# Build distribution
python -m build

Sponsor

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

untabulate-0.3.0.tar.gz (263.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

untabulate-0.3.0-cp313-cp313-win_amd64.whl (349.2 kB view details)

Uploaded CPython 3.13Windows x86-64

untabulate-0.3.0-cp313-cp313-win32.whl (340.2 kB view details)

Uploaded CPython 3.13Windows x86

untabulate-0.3.0-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (892.9 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

untabulate-0.3.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (862.3 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

untabulate-0.3.0-cp313-cp313-macosx_11_0_arm64.whl (357.4 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

untabulate-0.3.0-cp312-cp312-win_amd64.whl (350.0 kB view details)

Uploaded CPython 3.12Windows x86-64

untabulate-0.3.0-cp312-cp312-win32.whl (340.6 kB view details)

Uploaded CPython 3.12Windows x86

untabulate-0.3.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (899.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

untabulate-0.3.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (871.5 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

untabulate-0.3.0-cp312-cp312-macosx_11_0_arm64.whl (358.8 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file untabulate-0.3.0.tar.gz.

File metadata

  • Download URL: untabulate-0.3.0.tar.gz
  • Upload date:
  • Size: 263.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.3.0.tar.gz
Algorithm Hash digest
SHA256 8368b65fdd53fe04bb416ff40ac919dfa1e76b88093061a7a86c8e8c926d8541
MD5 d75488a295d58e29d5e868f428bbb496
BLAKE2b-256 1c99f0e14fb2553637a32bb04fe9e76b9559d2530d7f95f6e74a7e9a322e0e2e

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.3.0.tar.gz:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.3.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: untabulate-0.3.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 349.2 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.3.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 5aaebfd54d564dbe6d031a2d643b1dd10a5f9c8ef7cd5bad02b09386476781fe
MD5 bb2fe1954700a83098996f5420bb3008
BLAKE2b-256 845471fb411e1c0b89fec13648ed5c6b18f8db89bd0f504e34adef43804ffa83

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.3.0-cp313-cp313-win_amd64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.3.0-cp313-cp313-win32.whl.

File metadata

  • Download URL: untabulate-0.3.0-cp313-cp313-win32.whl
  • Upload date:
  • Size: 340.2 kB
  • Tags: CPython 3.13, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.3.0-cp313-cp313-win32.whl
Algorithm Hash digest
SHA256 496b680b8e5f3f6e0b3d4b207640420bc8f00e774c00055bebff4a20e6f837f4
MD5 921f95f6a6b020d517b17ec5cab1ec51
BLAKE2b-256 82f9708f51a5eb79d9e1ebcb96d93e545e55fe364d7bd11c69d046cb6012b57a

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.3.0-cp313-cp313-win32.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.3.0-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for untabulate-0.3.0-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0ac552d6eada7d973bba7a9f19b0e4d9830560f68d4e3256297583ee5624458b
MD5 a9af4c717045c9b4f70ca5ae9a6ff1cf
BLAKE2b-256 06e49ebf9a115d8fc94656dbde1107c400a9e3f0d7df2bed1e37ebe3770492cd

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.3.0-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.3.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for untabulate-0.3.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 2b994d1f33013ac2f7d2dc43af51a2352845c0eb7587d00bf2fc3ef0b59c7089
MD5 04090678542ffbd857589bf1c5c479be
BLAKE2b-256 b27fbe2e88bbb829a95aeeae8d5319bacfda6595c2410bf07744716c52176a6e

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.3.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.3.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for untabulate-0.3.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8f37f447cb1da15e519451d95418df82a9e848bc3b4302c51368cf4d9f702f67
MD5 22528a561264b28c4207621fce759b73
BLAKE2b-256 761e30ca93e73523d660ceecd04fd66eeb9e423b60a7c0662c0400f192d5f871

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.3.0-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.3.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: untabulate-0.3.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 350.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.3.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6d1f17e07ccd5cd8d81f246f81c1560fa0b81c0ad33af09c749b64994125f5da
MD5 e62c66834c8d9e5c9e7783f4d7ffbc6c
BLAKE2b-256 6a2803cf3015eb3a72b831901293423a72f09eb6767f8d373053421c3485d279

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.3.0-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.3.0-cp312-cp312-win32.whl.

File metadata

  • Download URL: untabulate-0.3.0-cp312-cp312-win32.whl
  • Upload date:
  • Size: 340.6 kB
  • Tags: CPython 3.12, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.3.0-cp312-cp312-win32.whl
Algorithm Hash digest
SHA256 041fd7c4c1fc4a071e9eb5c26826ca08d5f5d235192ea5124641b63e4ae5f21c
MD5 586a212130e2d8a103e01724f9f165a1
BLAKE2b-256 c4334a21f056bfff19d36ce98068c39961f39a4099b638be9c3e7ca20df58ea6

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.3.0-cp312-cp312-win32.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.3.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for untabulate-0.3.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7bdbf1be3df8c6ddd9060d23d146c99dbedb15746706adf15bd2d5c2108f87a9
MD5 8a849d5f510e3eb3e3861fc3a4886699
BLAKE2b-256 8d5e830b6e76091640b0a4e900dca277ebb0e12394c2aefc4c7b4d20fd0e0c37

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.3.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.3.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for untabulate-0.3.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 ca10c37e20d88e34000021ad7d831b531d68bb79aaa64d41c3d892d51872f177
MD5 bae4250d7b211a82b15c47c85369f93a
BLAKE2b-256 4a2d3a76cbf4c6f61bdb117d243e01f82b5137e2f1328cd561a04932ea163184

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.3.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.3.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for untabulate-0.3.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 af5d5cefcf442a9b9f1f66de5dd4c169e8230f0005758c064258f50d40ac9902
MD5 36d664182a4a6eddc0ea49a22e35d9dc
BLAKE2b-256 e0c88ec5fa1fa13f89e4b56a644a40f713a94107602c6facbd978e5e6ccb9e46

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.3.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page