Skip to main content

A Python library for parsing messy Excel files with intelligent structure detection and formula evaluation

Project description

messy-xlsx

Tests PyPI version

Parse Excel files (XLSX, XLS, CSV) to pandas DataFrames with structure detection and normalization.

Install

pip install messy-xlsx

# Optional: formula evaluation
pip install messy-xlsx[formulas]

# Optional: legacy .xls support
pip install messy-xlsx[xls]

Usage

from messy_xlsx import MessyWorkbook, SheetConfig, read_excel

# Quick read
df = read_excel("data.xlsx")

# With options
df = read_excel("data.xlsx", sheet="Sheet1", skip_rows=2, normalize=False)

# Workbook API
with MessyWorkbook("data.xlsx") as wb:
    df = wb.to_dataframe(sheet="Sheet1")
    all_dfs = wb.to_dataframes()  # All sheets
    structure = wb.get_structure()

# From bytes (S3, cloud storage)
import io
wb = MessyWorkbook(io.BytesIO(content), filename="data.xlsx")

Configuration

config = SheetConfig(
    # Row handling
    skip_rows=0,
    header_rows=1,
    skip_footer=0,
    cell_range=None,              # "A1:F100"

    # Detection
    auto_detect=True,
    header_detection_mode="smart", # "smart", "auto", "manual"
    header_confidence_threshold=0.7,

    # Parsing
    merge_strategy="fill",        # "fill", "skip", "first_only"
    include_hidden=False,
    locale="auto",                # "auto", "en_US", "de_DE"

    # Normalization
    normalize=True,
    normalize_dates=True,
    normalize_numbers=True,
    normalize_whitespace=True,

    # Formulas
    evaluate_formulas=True,
)

wb = MessyWorkbook("data.xlsx", sheet_config=config)

Multi-Sheet

from messy_xlsx import read_all_sheets, analyze_excel

# Read all sheets
results = read_all_sheets("data.xlsx")
for name, df in results.items():
    print(f"{name}: {len(df)} rows")

# Analyze without loading
info = analyze_excel("data.xlsx")
for sheet in info:
    print(f"{sheet.name}: {sheet.row_count} rows, {sheet.column_count} cols")

Output

Output is compatible with BigQuery/Arrow. Mixed-type columns are coerced to strings.

Dependencies

  • Python >= 3.10
  • fastexcel >= 0.11
  • openpyxl >= 3.1
  • pandas >= 2.0
  • numpy >= 1.24

Optional:

  • formulas, xlcalculator (formula evaluation)
  • xlrd (XLS support)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

messy_xlsx-0.7.1.tar.gz (8.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

messy_xlsx-0.7.1-py3-none-any.whl (54.8 kB view details)

Uploaded Python 3

File details

Details for the file messy_xlsx-0.7.1.tar.gz.

File metadata

  • Download URL: messy_xlsx-0.7.1.tar.gz
  • Upload date:
  • Size: 8.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for messy_xlsx-0.7.1.tar.gz
Algorithm Hash digest
SHA256 229e97ae17c8d599a23289a91f778a193b3a7e1ecea4bd60f5b4e84b65e98f3d
MD5 45a04f2608308627b31780a5617f64b6
BLAKE2b-256 fb4f26bd118776c7048ca1cb8966092ffd9b4a962fe0e52de7d0dd693075f4df

See more details on using hashes here.

Provenance

The following attestation bundles were made for messy_xlsx-0.7.1.tar.gz:

Publisher: publish.yml on ivan-loh/messy-xlsx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file messy_xlsx-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: messy_xlsx-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 54.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for messy_xlsx-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5db8862d4400f13a6ef18308a70e19d02c57ce41c59e36148a31b6a045755397
MD5 2556158b54bd47a3577c9004504a3df4
BLAKE2b-256 9eb9e6eceeb963b5f656751ccb171e86ee69f4f968fb72166b648a63e12dae2d

See more details on using hashes here.

Provenance

The following attestation bundles were made for messy_xlsx-0.7.1-py3-none-any.whl:

Publisher: publish.yml on ivan-loh/messy-xlsx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page