Skip to main content

A Python library for parsing messy Excel files with intelligent structure detection and formula evaluation

Project description

messy-xlsx

Tests PyPI version

Parse Excel files (XLSX, XLS, CSV) to pandas DataFrames with structure detection and normalization.

Install

pip install messy-xlsx

# Optional: formula evaluation
pip install messy-xlsx[formulas]

# Optional: legacy .xls support
pip install messy-xlsx[xls]

Usage

from messy_xlsx import MessyWorkbook, SheetConfig, read_excel

# Quick read
df = read_excel("data.xlsx")

# With options
df = read_excel("data.xlsx", sheet="Sheet1", skip_rows=2, normalize=False)

# Workbook API
with MessyWorkbook("data.xlsx") as wb:
    df = wb.to_dataframe(sheet="Sheet1")
    all_dfs = wb.to_dataframes()  # All sheets
    structure = wb.get_structure()

# From bytes (S3, cloud storage)
import io
wb = MessyWorkbook(io.BytesIO(content), filename="data.xlsx")

Configuration

config = SheetConfig(
    # Row handling
    skip_rows=0,
    header_rows=1,
    skip_footer=0,
    cell_range=None,              # "A1:F100"

    # Detection
    auto_detect=True,
    header_detection_mode="smart", # "smart", "auto", "manual"
    header_confidence_threshold=0.7,

    # Parsing
    merge_strategy="fill",        # "fill", "skip", "first_only"
    include_hidden=False,
    locale="auto",                # "auto", "en_US", "de_DE"

    # Normalization
    normalize=True,
    normalize_dates=True,
    normalize_numbers=True,
    normalize_whitespace=True,

    # Formulas
    evaluate_formulas=True,
)

wb = MessyWorkbook("data.xlsx", sheet_config=config)

Multi-Sheet

from messy_xlsx import read_all_sheets, analyze_excel

# Read all sheets
results = read_all_sheets("data.xlsx")
for name, df in results.items():
    print(f"{name}: {len(df)} rows")

# Analyze without loading
info = analyze_excel("data.xlsx")
for sheet in info:
    print(f"{sheet.name}: {sheet.row_count} rows, {sheet.column_count} cols")

Output

Output is compatible with BigQuery/Arrow. Mixed-type columns are coerced to strings.

Dependencies

  • Python >= 3.10
  • fastexcel >= 0.11
  • openpyxl >= 3.1
  • pandas >= 2.0
  • numpy >= 1.24

Optional:

  • formulas, xlcalculator (formula evaluation)
  • xlrd (XLS support)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

messy_xlsx-0.6.0.tar.gz (8.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

messy_xlsx-0.6.0-py3-none-any.whl (53.8 kB view details)

Uploaded Python 3

File details

Details for the file messy_xlsx-0.6.0.tar.gz.

File metadata

  • Download URL: messy_xlsx-0.6.0.tar.gz
  • Upload date:
  • Size: 8.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for messy_xlsx-0.6.0.tar.gz
Algorithm Hash digest
SHA256 c72b7f17b1097460f44c3837c97f44b16fc546bdeb167c04d58d05db0ecea8a7
MD5 bf33c8c5b855aed7b8651d8c57904eb5
BLAKE2b-256 0a1a82bad81581772291ce84dad3897ad4f689283866ed334ac6e5dae6671793

See more details on using hashes here.

Provenance

The following attestation bundles were made for messy_xlsx-0.6.0.tar.gz:

Publisher: publish.yml on ivan-loh/messy-xlsx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file messy_xlsx-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: messy_xlsx-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 53.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for messy_xlsx-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e2649c596ad48c7b9d529eed05aaa4e978615a68fc87b7e3513775f00e5ce416
MD5 44365612c2a716f01c4e7ae55efd4031
BLAKE2b-256 a8536a129d2042ec9b767b2b0916f35252b13bf29d7311bfe5957c29de270da7

See more details on using hashes here.

Provenance

The following attestation bundles were made for messy_xlsx-0.6.0-py3-none-any.whl:

Publisher: publish.yml on ivan-loh/messy-xlsx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page