Skip to main content

Turn messy real-world spreadsheets into clean, typed data — with an auditable report of every fix.

Project description

messy-table

pandas.read_excel assumes your spreadsheet is well-behaved. messy-table assumes it is not.

Turn messy real-world spreadsheets — Excel/CSV exported from ERPs, legacy systems, hand-made reports — into clean, typed data, and get back an auditable report of every fix that was applied.

from messy_table import clean

result = clean("relatorio_vendas.xlsx")

result.data        # list[dict] — clean, typed rows
result.columns     # per-column name / dtype / null summary
result.report      # every transformation that was applied
result.warnings    # low-confidence decisions
result.to_pandas() # DataFrame (optional extra)

Why

AI agents and data pipelines receive arbitrary spreadsheets from users and today re-implement, by hand and fragilely, the same heuristics: find where the table starts, fix the headers, convert Excel serial dates, interpret 1.234,56. messy-table is the canonical answer to that step — deterministic, dependency-light, and fully auditable.

Install

pip install messy-table            # core — depends only on openpyxl
pip install 'messy-table[pandas]'  # adds result.to_pandas()

Python ≥ 3.10. Ships py.typed — fully type-checked.

Before & after

A typical ERP export — a title banner, a blank row, pt-BR numbers, an N/A, and a trailing totals row:

Relatório de Vendas 2024
(gerado em 01/02/2024)

Produto      | Valor (R$) | Qtd | Ativo
Café         | 1.234,56   | 10  | sim
Chá          | 2.000,00   | 5   | nao
Açúcar       | -          | 3   | sim
TOTAL        | 3.234,56   | 18  |
>>> result = clean("vendas.xlsx")
>>> result.data
[{'produto': 'Café',   'valor_r': 1234.56, 'qtd': 10, 'ativo': True},
 {'produto': 'Chá',    'valor_r': 2000.0,  'qtd': 5,  'ativo': False},
 {'produto': 'Açúcar', 'valor_r': None,    'qtd': 3,  'ativo': True}]
>>> [(c.name, c.dtype) for c in result.columns]
[('produto', 'str'), ('valor_r', 'float'), ('qtd', 'int'), ('ativo', 'bool')]

The title, blank row and TOTAL line are gone; numbers are parsed in the column's inferred locale; - is a null; sim/nao became booleans — and the report says so.

Features (v0.1)

Feature What it does
Table-start detection Skips titles, logos, stray cells and metadata before the real header.
Header detection & normalisation Finds the header (or its absence); resolves duplicate/empty/multi-row headers; slugifies to snake_case.
Merged cells Propagates a merged value across its range (fill) or keeps it top-left only (first-only).
Excel serial dates Converts 45123 → a real date when the column has a date profile (both 1900 and 1904 epochs).
Localised numbers 1.234,56 (pt-BR/EU) vs 1,234.56 (en-US), inferred per column, never per cell.
Column type inference int/float/date/datetime/bool/str, with mixed-column handling.
Disguised nulls -, N/A, n/d, #REF!, #DIV/0!, blanks → None (extensible).
Trailing junk Removes totals, signatures and footnotes after the data ends.
Cleaning report Structured, JSON-serialisable record of every correction with location and confidence.
Input formats .xlsx, .csv (delimiter + encoding sniffing), .tsv.

The report

Nothing changes without a record. Per-cell fixes are aggregated per column with a count and sample locations, so the report stays small even on huge files:

>>> print(result.report.to_json())
{
  "summary": {"table_start_detected": 1, "table_end_trimmed": 1,
              "header_renamed": 1, "null_normalized": 1,
              "number_parsed": 4, "type_coerced": 2},
  "actions": [
    {"kind": "table_start_detected", "rule": "density", "confidence": 0.8,
     "detail": "skipped 3 leading row(s) (title/metadata/blank); table starts at row 3"},
    {"kind": "number_parsed", "rule": "locale:pt_BR", "column": "valor_r",
     "count": 4, "confidence": 1.0,
     "examples": [{"row": 0, "original": "1.234,56", "final": 1234.56}]},
    ...
  ]
}

Configuration

The 80% case needs no config. For the rest:

from messy_table import clean, Config

result = clean(
    "dados.csv",
    config=Config(
        locale="pt_BR",            # force number/date interpretation; default "auto"
        header="auto",             # "auto" | int (row index) | None (no header)
        sheet=0,                   # index or name of the worksheet
        merged_cells="fill",       # "fill" | "first-only"
        null_values_extra=["s/i"], # add to the built-in null list
        strict=False,              # True: raise AmbiguityError instead of warning
    ),
)

In strict=True, a low-confidence decision raises AmbiguityError — always with a copy-pasteable Config suggestion to resolve it.

Security

messy-table parses untrusted files, so it defends against it: .xlsx archives are checked for decompression-bomb shape (absolute size and ratio) before they are opened, and a hard cell ceiling bounds memory. See ARCHITECTURE.md.

Docs

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

messy_table-0.1.0.tar.gz (45.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

messy_table-0.1.0-py3-none-any.whl (40.6 kB view details)

Uploaded Python 3

File details

Details for the file messy_table-0.1.0.tar.gz.

File metadata

  • Download URL: messy_table-0.1.0.tar.gz
  • Upload date:
  • Size: 45.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for messy_table-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e8b2182834c5dd1578cc7ed40c577b97195037a59194dd771facafa05e74d3e6
MD5 683fa7c0d856f8ccdcbae9f1175928ca
BLAKE2b-256 47ca934fef27a0ffeb3b75781b0c7c68c51ef1a436770267138dce0a2c1d855e

See more details on using hashes here.

Provenance

The following attestation bundles were made for messy_table-0.1.0.tar.gz:

Publisher: release.yml on victorhurchella/messy-table

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file messy_table-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: messy_table-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for messy_table-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1c6b44c1bb02a09cbc68832184ca0a1a6ca595bc073ab4cc35ee2e37a94d06c6
MD5 5031df5dce77094b9d94375d0d758e0e
BLAKE2b-256 a8bc5fa1a5a9e9862c2d41d2c977e1b2c119d6a6b1e1e4bb60013413c2ea8fa9

See more details on using hashes here.

Provenance

The following attestation bundles were made for messy_table-0.1.0-py3-none-any.whl:

Publisher: release.yml on victorhurchella/messy-table

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page