Skip to main content

A Python library for parsing messy Excel files with intelligent structure detection and formula evaluation

Project description

messy-xlsx

Python library for parsing Excel files with structure detection and normalization.

Installation

pip install messy-xlsx

# With formula evaluation
pip install messy-xlsx[formulas]

# With XLS support
pip install messy-xlsx[xls]

Quick Start

import messy_xlsx

# Basic usage
df = messy_xlsx.read_excel("data.xlsx")

# Multi-sheet workbook
wb = messy_xlsx.MessyWorkbook("data.xlsx")
df = wb.to_dataframe(sheet="Sheet1")
dfs = wb.to_dataframes()  # All sheets

# Structure analysis
structure = wb.get_structure()
print(f"Header row: {structure.header_row}")
print(f"Tables detected: {structure.num_tables}")
print(f"Locale: {structure.detected_locale}")

Features

Structure Detection

  • Auto-detects header rows with confidence scoring
  • Identifies multiple tables per sheet
  • Detects merged cells, hidden rows/columns
  • Locale detection (US vs European number formats)

Format Support

  • XLSX/XLSM (Office Open XML)
  • XLS (legacy Excel)
  • CSV/TSV with delimiter detection

Normalization Pipeline

  • Locale-aware number parsing (1,234.56 vs 1.234,56)
  • Date normalization from multiple formats
  • Whitespace cleaning
  • Missing value standardization (NA, null, -, etc.)
  • Semantic type inference

Formula Evaluation (optional)

  • Integrates formulas/xlcalculator libraries
  • Fallback to cached values
  • Configurable evaluation modes

Configuration

from messy_xlsx import MessyWorkbook, SheetConfig

config = SheetConfig(
    skip_rows        = 2,
    header_rows      = 1,
    auto_detect      = True,
    merge_strategy   = "fill",
    locale           = "auto",
    evaluate_formulas = True,
)

wb = MessyWorkbook("data.xlsx", sheet_config=config)
df = wb.to_dataframe()

API Reference

MessyWorkbook

wb = MessyWorkbook(file_path, sheet_config=None, formula_config=None)

wb.sheet_names                    # List of sheet names
wb.get_sheet(name)                # Get MessySheet object
wb.to_dataframe(sheet=None)       # Convert sheet to DataFrame
wb.to_dataframes()                # Convert all sheets
wb.get_structure(sheet=None)      # Get StructureInfo
wb.get_cell_by_ref("Sheet1!A1")   # Get cell with formula eval

SheetConfig

SheetConfig(
    skip_rows        = 0,
    header_rows      = 1,
    skip_footer      = 0,
    cell_range       = None,           # "A1:F100"
    auto_detect      = True,
    include_hidden   = False,
    merge_strategy   = "fill",         # "fill", "skip", "first_only"
    locale           = "auto",         # "en_US", "de_DE", "auto"
    evaluate_formulas = True,
)

StructureInfo

structure = wb.get_structure()

structure.header_row              # Detected header row index
structure.header_confidence       # 0.0-1.0 confidence score
structure.num_tables             # Number of tables detected
structure.detected_locale        # "en_US" or "de_DE"
structure.merged_ranges          # List of merged cell ranges
structure.has_formulas           # Boolean

Architecture

messy_xlsx/
├── detection/
│   ├── format_detector.py       # Binary signatures, ZIP analysis
│   ├── structure_analyzer.py    # Headers, tables, merged cells
│   └── locale_detector.py       # Number format detection
├── parsing/
│   ├── xlsx_handler.py          # XLSX/XLSM
│   ├── xls_handler.py           # XLS
│   ├── csv_handler.py           # CSV/TSV with dialect detection
│   └── handler_registry.py      # Format routing
├── normalization/
│   ├── pipeline.py              # Orchestration
│   ├── numbers.py               # Locale-aware number parsing
│   ├── dates.py                 # Date normalization
│   ├── whitespace.py            # Whitespace cleaning
│   ├── missing_values.py        # NA standardization
│   └── type_inference.py        # Semantic type detection
├── formulas/
│   ├── config.py                # Configuration
│   └── engine.py                # External library integration
├── workbook.py                  # MessyWorkbook class
├── sheet.py                     # MessySheet class
└── models.py                    # Data structures

Testing

Tested on 33 real-world Excel files including:

  • 100,000 row spreadsheets (5MB)
  • Multi-table sheets
  • European number formats
  • Formula-heavy workbooks
  • Merged cells and complex layouts

100% success rate on test suite.

Dependencies

Required:

  • Python >= 3.10
  • openpyxl >= 3.1
  • pandas >= 2.0
  • numpy >= 1.24

Optional:

  • formulas >= 1.2 (formula evaluation)
  • xlcalculator >= 0.4 (lightweight formula eval)
  • xlrd >= 2.0 (XLS support)

License

MIT

Credits

Inspired by mcp-excel.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

messy_xlsx-0.1.0.tar.gz (8.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

messy_xlsx-0.1.0-py3-none-any.whl (40.0 kB view details)

Uploaded Python 3

File details

Details for the file messy_xlsx-0.1.0.tar.gz.

File metadata

  • Download URL: messy_xlsx-0.1.0.tar.gz
  • Upload date:
  • Size: 8.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for messy_xlsx-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ebc082d1ea0deb2a3333cb95e2c3828318e6f4cffa1f59daaaf990e05f677ee6
MD5 070c86b5f8debab65205e76ac62c9d92
BLAKE2b-256 66b3d571b881b781324c41d52e90d8f2ab01379a0f465630a34a58ff18d2c44d

See more details on using hashes here.

File details

Details for the file messy_xlsx-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: messy_xlsx-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for messy_xlsx-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 136655fdf69eb07ee80e089ebfde26b7a62fe3be546671fdeab41d5ab872dcc6
MD5 751a28e0c50658e491fb8366708205d8
BLAKE2b-256 3961c2e852f30057fd30f3ce96560183a74c355c8d9a419070b62e117b9d5e26

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page