A Python library for parsing messy Excel files with intelligent structure detection and formula evaluation

These details have not been verified by PyPI

Project links

Project description

messy-xlsx

Python library for parsing Excel files with structure detection and normalization.

Installation

pip install messy-xlsx

# With formula evaluation
pip install messy-xlsx[formulas]

# With XLS support
pip install messy-xlsx[xls]

Quick Start

import messy_xlsx
from messy_xlsx import MessyWorkbook, SheetConfig
import io

# Basic usage
df = messy_xlsx.read_excel("data.xlsx")

# Multi-sheet workbook
wb = MessyWorkbook("data.xlsx")
df = wb.to_dataframe(sheet="Sheet1")
dfs = wb.to_dataframes()  # All sheets

# Read from BytesIO (cloud storage, S3, etc.)
with open("data.xlsx", "rb") as f:
    content = f.read()
buffer = io.BytesIO(content)
wb = MessyWorkbook(buffer, filename="data.xlsx")
df = wb.to_dataframe()

# Structure analysis
structure = wb.get_structure()
print(f"Header row: {structure.header_row}")
print(f"Tables detected: {structure.num_tables}")
print(f"Locale: {structure.detected_locale}")

Features

Structure Detection

Auto-detects header rows with confidence scoring
Identifies multiple tables per sheet
Detects merged cells, hidden rows/columns
Locale detection (US vs European number formats)

Format Support

XLSX/XLSM (Office Open XML)
XLS (legacy Excel)
CSV/TSV with delimiter detection

Normalization Pipeline

Locale-aware number parsing (1,234.56 vs 1.234,56)
Date normalization from multiple formats
Whitespace cleaning
Missing value standardization (NA, null, -, etc.)
Semantic type inference

Formula Evaluation (optional)

Integrates formulas/xlcalculator libraries
Fallback to cached values
Configurable evaluation modes

Configuration

from messy_xlsx import MessyWorkbook, SheetConfig

config = SheetConfig(
    skip_rows        = 2,
    header_rows      = 1,
    auto_detect      = True,
    merge_strategy   = "fill",
    locale           = "auto",
    evaluate_formulas = True,

    # Header detection (auto-enabled by default)
    header_detection_mode = "smart",  # "auto" | "manual" | "smart"
    header_confidence_threshold = 0.7,
    header_patterns  = [r".*name.*", r".*date.*"],  # Boost confidence
)

wb = MessyWorkbook("data.xlsx", sheet_config=config)
df = wb.to_dataframe()

Header Detection Modes

smart (default) - Uses detection unless user explicitly overrides

config = SheetConfig(auto_detect=True)  # Headers detected automatically

auto - Always trust detection if confidence >= threshold

config = SheetConfig(
    header_detection_mode="auto",
    header_confidence_threshold=0.8  # Only use if 80%+ confident
)

manual - Ignore detection, use explicit values

config = SheetConfig(
    skip_rows=5,
    header_rows=2,
    header_detection_mode="manual"
)

API Reference

MessyWorkbook

# From file path
wb = MessyWorkbook("data.xlsx", sheet_config=None, formula_config=None)

# From file-like object (BytesIO, S3 stream, etc.)
wb = MessyWorkbook(buffer, sheet_config=None, filename="data.xlsx")

# Properties and methods
wb.sheet_names                    # List of sheet names
wb.file_path                      # Path or None if from buffer
wb.source                         # The file path or buffer
wb.get_sheet(name)                # Get MessySheet object
wb.to_dataframe(sheet=None)       # Convert sheet to DataFrame
wb.to_dataframes()                # Convert all sheets
wb.get_structure(sheet=None)      # Get StructureInfo
wb.get_cell_by_ref("Sheet1!A1")   # Get cell with formula eval

SheetConfig

SheetConfig(
    skip_rows        = 0,
    header_rows      = 1,
    skip_footer      = 0,
    cell_range       = None,           # "A1:F100"
    auto_detect      = True,
    include_hidden   = False,
    merge_strategy   = "fill",         # "fill", "skip", "first_only"
    locale           = "auto",         # "en_US", "de_DE", "auto"
    evaluate_formulas = True,

    # Header detection
    header_detection_mode = "smart",   # "auto", "manual", "smart"
    header_confidence_threshold = 0.7, # 0.0-1.0
    header_fallback  = "first_row",    # "first_row", "none", "error"
    header_patterns  = None,           # List of regex patterns

    # Normalization controls
    normalize        = True,           # Master switch for all normalization
    normalize_dates  = True,           # Convert date columns to datetime
    normalize_numbers = True,          # Parse number strings to numeric
    normalize_whitespace = True,       # Clean whitespace in text columns
)

Disable Normalization

To get raw data without type conversion (useful for ETL pipelines where the destination handles schema):

config = SheetConfig(normalize=False)
wb = MessyWorkbook("data.xlsx", sheet_config=config)
df = wb.to_dataframe()  # All columns as object dtype

Or disable specific normalizations:

config = SheetConfig(
    normalize_dates=False,   # Keep dates as strings
    normalize_numbers=False, # Keep numbers as strings
)

StructureInfo

structure = wb.get_structure()

structure.header_row              # Detected header row index
structure.header_confidence       # 0.0-1.0 confidence score
structure.num_tables             # Number of tables detected
structure.detected_locale        # "en_US" or "de_DE"
structure.merged_ranges          # List of merged cell ranges
structure.has_formulas           # Boolean

Architecture

messy_xlsx/
├── detection/
│   ├── format_detector.py       # Binary signatures, ZIP analysis
│   ├── structure_analyzer.py    # Headers, tables, merged cells
│   └── locale_detector.py       # Number format detection
├── parsing/
│   ├── xlsx_handler.py          # XLSX/XLSM
│   ├── xls_handler.py           # XLS
│   ├── csv_handler.py           # CSV/TSV with dialect detection
│   └── handler_registry.py      # Format routing
├── normalization/
│   ├── pipeline.py              # Orchestration
│   ├── numbers.py               # Locale-aware number parsing
│   ├── dates.py                 # Date normalization
│   ├── whitespace.py            # Whitespace cleaning
│   ├── missing_values.py        # NA standardization
│   └── type_inference.py        # Semantic type detection
├── formulas/
│   ├── config.py                # Configuration
│   └── engine.py                # External library integration
├── workbook.py                  # MessyWorkbook class
├── sheet.py                     # MessySheet class
└── models.py                    # Data structures

Testing

Tested on 33 real-world Excel files including:

100,000 row spreadsheets (5MB)
Multi-table sheets
European number formats
Formula-heavy workbooks
Merged cells and complex layouts

100% success rate on test suite.

Dependencies

Required:

Python >= 3.10
openpyxl >= 3.1
pandas >= 2.0
numpy >= 1.24

Optional:

formulas >= 1.2 (formula evaluation)
xlcalculator >= 0.4 (lightweight formula eval)
xlrd >= 2.0 (XLS support)

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.0

Apr 5, 2026

0.8.0

Apr 4, 2026

0.7.2

Jan 26, 2026

0.7.1

Jan 25, 2026

0.7.0

Jan 25, 2026

0.6.0

Jan 16, 2026

0.5.0

Jan 14, 2026

This version

0.4.1

Jan 5, 2026

0.4.0

Jan 4, 2026

0.3.0

Jan 4, 2026

0.2.0

Jan 4, 2026

0.1.0

Jan 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

messy_xlsx-0.4.1.tar.gz (26.7 MB view details)

Uploaded Jan 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

messy_xlsx-0.4.1-py3-none-any.whl (45.1 kB view details)

Uploaded Jan 5, 2026 Python 3

File details

Details for the file messy_xlsx-0.4.1.tar.gz.

File metadata

Download URL: messy_xlsx-0.4.1.tar.gz
Upload date: Jan 5, 2026
Size: 26.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for messy_xlsx-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`553aae301518400dcdd0cfc48017d7ed6f42b775c87a01e2b1b8833bc612a532`
MD5	`130bf8f1f14aa2c3ba61567e6e04c7df`
BLAKE2b-256	`9bbd73a2b053332aeb9c3452e9a976b0b6986d3e2e7976ef45f301ae9264e4a1`

See more details on using hashes here.

File details

Details for the file messy_xlsx-0.4.1-py3-none-any.whl.

File metadata

Download URL: messy_xlsx-0.4.1-py3-none-any.whl
Upload date: Jan 5, 2026
Size: 45.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for messy_xlsx-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`af07c5cd71dc2b4141e4619c8e26ee33e355ff13f1114e6702e4d0023f9a58d1`
MD5	`4a5c755be1f4b38984bfd3d312c6de5e`
BLAKE2b-256	`ef21afa345fdc7c4c09ad86434728d188d7ab576ef411c627ab2437ad627b8a8`

See more details on using hashes here.

messy-xlsx 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

messy-xlsx

Installation

Quick Start

Features

Configuration

Header Detection Modes

API Reference

MessyWorkbook

SheetConfig

Disable Normalization

StructureInfo

Architecture

Testing

Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes