Skip to main content

Untabulate grid data for friendly llm embedddings or similar analysis

Project description

Untabulate

PyPI version Python 3.12+ License: MIT

A Cython-accelerated library for associating tabular data points with their governing row and column headers. While it includes helpers for HTML and Excel, the core logic is source-agnostic, making it ideal for LLM embeddings and RAG pipelines where semantic context is crucial.

"Improving LLM accuracy since 2036"

Installation

pip install untabulate

To include HTML parsing support:

pip install "untabulate[lxml]"

To include Excel parsing support:

pip install "untabulate[openpyxl]"

To include both:

pip install "untabulate[lxml,openpyxl]"

The Problem

When you extract data from a table like this:

Q1 Q2
Revenue 100 120
North America 40 50
Europe 60 70

Traditional parsers give you value=40 at position (3, 3). But for LLM embeddings, you need:

Revenue → North America → Q1: 40

Quick Start

from untabulate import untabulate_html

html = """
<table>
    <tr><th></th><th>Q1</th><th>Q2</th></tr>
    <tr><th>Revenue</th><td>100</td><td>120</td></tr>
    <tr><th>Costs</th><td>60</td><td>70</td></tr>
</table>
"""

# Get all data with semantic context in one call
for item in untabulate_html(html, format="strings"):
    print(item)

# Output:
# Revenue → Q1: 100
# Revenue → Q2: 120
# Costs → Q1: 60
# Costs → Q2: 70

Output Formats

Choose the format that fits your use case:

from untabulate import untabulate_html

html = "<table><tr><th></th><th>Q1</th></tr><tr><th>Revenue</th><td>100</td></tr></table>"

# Strings - ready for embeddings
untabulate_html(html, format="strings")
# → ["Revenue → Q1: 100"]

# Dicts - structured data with metadata
untabulate_html(html, format="dict")
# → [{"path": ["Revenue", "Q1"], "value": "100", "context": "Revenue → Q1: 100"}]

# Tuples - lightweight path/value pairs
untabulate_html(html, format="tuples")
# → [(["Revenue", "Q1"], "100")]

Excel Files

from untabulate import untabulate_xlsx

results = untabulate_xlsx("financial_report.xlsx", format="strings")
for line in results:
    print(line)

Command Line

Install with CLI support:

pip install "untabulate[cli]"

Then use from the command line:

# Fetch and process a URL
untabulate html https://example.com/report.html

# Process a local HTML file
untabulate html ./report.html

# Target a specific table by ID
untabulate html page.html --id quarterly-results

# Process Excel files
untabulate xlsx data.xlsx --sheet "Q1 Results"

# Different output formats
untabulate html report.html --format json   # Default: structured JSON
untabulate html report.html --format text   # One line per value
untabulate html report.html --format jsonl  # JSON Lines (for streaming)
untabulate html report.html --format csv    # CSV format

# Read from stdin
curl https://example.com | untabulate -

# Custom separator
untabulate html report.html --format text --separator " | "

Custom Separator

untabulate_html(html, format="strings", separator=" | ")
# → ["Revenue | Q1: 100"]

Working with Custom Data Sources

Use untabulate() with any data source - dicts, tuples, or objects:

from untabulate import untabulate

# From database rows or API responses
data = [
    {"is_header": True, "row": 1, "col": 2, "value": "Q1"},
    {"is_header": True, "row": 2, "col": 1, "value": "Revenue"},
    {"is_header": False, "row": 2, "col": 2, "value": "100"},
]

results = untabulate(data, format="strings")
# → ["Revenue → Q1: 100"]

Algorithm: Semantic Header Scoping

The ProjectionGrid uses a simple but effective scoping rule:

  1. Row headers (column 1) propagate downward to all rows below them
  2. Column headers apply to the columns they span

This captures hierarchical relationships naturally:

Row 1: "Revenue" in col 1      → applies to rows 1, 2, 3, 4...
Row 2: "North America" in col 1 → applies to rows 2, 3, 4...
Row 3: "Europe" in col 1        → applies to rows 3, 4...

When you query get_path(row=3, col=2), you get all headers that govern that cell: ["Revenue", "North America", "Q1"]

API Reference

High-Level Functions

untabulate_html(html, *, format="dict", separator=" → ", span_as_label=False, all_tables=False)

Parse HTML and extract data with semantic paths in one step.

  • html: HTML string containing table(s)
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • span_as_label: Treat cells with rowspan/colspan > 1 as headers
  • all_tables: Parse all tables (returns list of lists)
  • Returns: List of results in the specified format
  • Raises: TableNotFoundError if no table found

untabulate_xlsx(filepath, *, sheet_name=None, format="dict", separator=" → ")

Parse Excel and extract data with semantic paths in one step.

  • filepath: Path to .xlsx file
  • sheet_name: Worksheet name (default: active sheet)
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • Returns: List of results in the specified format

untabulate(data, *, format="dict", separator=" → ")

Extract semantic paths from any data source.

  • data: List of dicts, tuples, objects, or GridElement instances
  • format: "dict", "strings", or "tuples"
  • separator: Path separator for context strings
  • Returns: List of results in the specified format

Low-Level API

For advanced use cases, you can use the lower-level components directly:

parse_html_table(html_string, span_as_label=False, all_tables=False)

Parse HTML table(s) into GridElement instances.

parse_xlsx_worksheet(filepath, sheet_name=None)

Parse an Excel worksheet into GridElement instances.

ProjectionGrid(elements)

Build a semantic header projection from elements.

ProjectionGrid.get_path(data_row, data_col)

Get headers governing a cell position.

GridElement(is_header, row, col, rowspan, colspan, value)

Lightweight element for table cells.

  • is_header: True if this cell is a header, False for data cells
  • row/col: 1-based position
  • rowspan/colspan: Cell span
  • value: Text content of the cell

Performance

~1M cells/second on typical hardware. The Cython implementation provides ~30% speedup over pure Python, but the main win is the O(n) algorithm vs O(n²) naive approaches.

Why This Matters for LLMs

Embedding models need semantic context, not coordinates. When chunking documents for RAG:

"40" - meaningless without context
"cell (3,2): 40" - coordinates don't help
"Revenue → North America → Q1: 40" - full semantic path

This enables:

  • Better vector similarity for table-based questions
  • Accurate retrieval of specific data points
  • Natural language grounding for structured data

Development

# Clone and install in development mode
git clone https://github.com/patrick/untabulate.git
cd untabulate
pip install -e ".[dev]"

# Run tests
pytest

# Build distribution
python -m build

Sponsor

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

untabulate-0.2.3.tar.gz (260.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

untabulate-0.2.3-cp313-cp313-win_amd64.whl (353.4 kB view details)

Uploaded CPython 3.13Windows x86-64

untabulate-0.2.3-cp313-cp313-win32.whl (343.7 kB view details)

Uploaded CPython 3.13Windows x86

untabulate-0.2.3-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (895.1 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

untabulate-0.2.3-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (864.6 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

untabulate-0.2.3-cp313-cp313-macosx_11_0_arm64.whl (360.1 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

untabulate-0.2.3-cp312-cp312-win_amd64.whl (354.2 kB view details)

Uploaded CPython 3.12Windows x86-64

untabulate-0.2.3-cp312-cp312-win32.whl (344.2 kB view details)

Uploaded CPython 3.12Windows x86

untabulate-0.2.3-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (901.1 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

untabulate-0.2.3-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (873.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

untabulate-0.2.3-cp312-cp312-macosx_11_0_arm64.whl (361.4 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file untabulate-0.2.3.tar.gz.

File metadata

  • Download URL: untabulate-0.2.3.tar.gz
  • Upload date:
  • Size: 260.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.3.tar.gz
Algorithm Hash digest
SHA256 ffbd1a4e4e207edd8a3b9393dd2b0ae21f0141737cbe20503ace86f55f4932a0
MD5 110e38c5676533a2b2ba23a2bb2392bd
BLAKE2b-256 4e2f35020d17ffed494e6c257a95af4ca91010dec71930db1fb9c1758bc521e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.3.tar.gz:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.3-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: untabulate-0.2.3-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 353.4 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.3-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 671fab33e6d90e6c9ac80f53c361b6844e78a43c5e9df2cbb64a81cc3cc43fae
MD5 71b410245755bc348447792fb1190769
BLAKE2b-256 5e0bdd97555dfe6f28e32f904998b8fceee49ebb1182cc6628b38e224fe8b623

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.3-cp313-cp313-win_amd64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.3-cp313-cp313-win32.whl.

File metadata

  • Download URL: untabulate-0.2.3-cp313-cp313-win32.whl
  • Upload date:
  • Size: 343.7 kB
  • Tags: CPython 3.13, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.3-cp313-cp313-win32.whl
Algorithm Hash digest
SHA256 b05fb033deb15faef88ea2b1870e966f715863fdfb027ce41b3d0124cd168505
MD5 123ced614346c9b5aceb7845494e2e3f
BLAKE2b-256 5d04299bf8a4d72503c8c421e540c783452a7f8a064930b957e112a038ab645e

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.3-cp313-cp313-win32.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.3-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for untabulate-0.2.3-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a624293e4c3168643da12f6cf48e9c6d3a47ace016740efda79867c6290fca64
MD5 8a329d612ce0ed35bc4dbdf82fc3ec41
BLAKE2b-256 06ed8604b6b51a3763597cd1b6048bec19873204425e5a35940947156682c557

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.3-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.3-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for untabulate-0.2.3-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 01ead6f20c57a4f730a18ac00ad2837b45277a843a2416605306c625d632f4d6
MD5 aba81ce431f386c45da262326db218a8
BLAKE2b-256 911fc8ddb83911f991248c0cd4ef3cd0f381e684a9885ad29fe121de25cbfc57

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.3-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.3-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for untabulate-0.2.3-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 78d8654063f18bacdf54b32ed9c286762ce0883c7548209672c769dfc0e5fe9d
MD5 01f7a63cb7b8c6e9f8bbd02e32c2e93f
BLAKE2b-256 bdb001f8160b3311b93d35660fd47172b5ce11f99b75dc6fed1bd7b43f4822c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.3-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.3-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: untabulate-0.2.3-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 354.2 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 461d6b6ef8fb0254f8da4aadc1becc6a9d032f77756c2c1d2e27f4b45d1bffc9
MD5 dcc29001c23809ced6bad86eb499a0b8
BLAKE2b-256 cfd9d5e467f580920d8bdea7e23b435a4bb58db7b2aaf248080f5d0a35dd3c64

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.3-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.3-cp312-cp312-win32.whl.

File metadata

  • Download URL: untabulate-0.2.3-cp312-cp312-win32.whl
  • Upload date:
  • Size: 344.2 kB
  • Tags: CPython 3.12, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for untabulate-0.2.3-cp312-cp312-win32.whl
Algorithm Hash digest
SHA256 a4cb89d9109b3ea8c5a43c1ed008edc027c5a8c15a00a4ea790fdfbf47711a9f
MD5 de9ef5fcebe1137f47ee150372820b0f
BLAKE2b-256 141b12256ee41ac7303ad2eb3907cf48b30a54f6d929e0757538178d62f4c0bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.3-cp312-cp312-win32.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.3-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for untabulate-0.2.3-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 af7b1ab235e2136ea9397644cb882b0249cb2638ae34666f803b58d9df5f2753
MD5 b1aa9217f52c7004fa6b65913de25671
BLAKE2b-256 921c4967365fb162c6ea05b70f9d8f37e43ad0dc69e4bca926170e72b08fc779

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.3-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.3-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for untabulate-0.2.3-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 7452d6fd14ac75adc15cacbcae7f43d8ae40d3e35367eb2c5f795eb5cfef9d02
MD5 44701a636aaf46b165ba08a00bebc55b
BLAKE2b-256 3a61f6ea2bf8aa9e3444b06bb8a0169663561db997bdafcf41d140f4a33e2f41

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.3-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file untabulate-0.2.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for untabulate-0.2.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 69dad4f7414215487b5b9df4737cfa959bc70d3e2ab37fa600f5d00082dc82a6
MD5 ce284b3c60875c77d5a65a23de6f4b6b
BLAKE2b-256 246392fe1d2bf8cb37c7f302f2d28b4fb6ed36e915579ff53192a608f54628d7

See more details on using hashes here.

Provenance

The following attestation bundles were made for untabulate-0.2.3-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on patrickcd/untabulate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page