A Python library for parsing messy Excel files with intelligent structure detection and formula evaluation
Project description
messy-xlsx
Parse messy Excel files (XLSX, XLS, CSV) to clean pandas DataFrames with intelligent structure detection, merged cell handling, and type normalization.
Install
pip install messy-xlsx
# Optional: formula evaluation
pip install messy-xlsx[formulas]
# Optional: legacy .xls support
pip install messy-xlsx[xls]
# Everything
pip install messy-xlsx[all]
Quick Start
from messy_xlsx import MessyWorkbook, SheetConfig, read_excel
# Quick read
df = read_excel("data.xlsx")
# With options
df = read_excel("data.xlsx", sheet="Sheet1", skip_rows=2, normalize=False)
# Workbook API
with MessyWorkbook("data.xlsx") as wb:
df = wb.to_dataframe(sheet="Sheet1")
all_dfs = wb.to_dataframes() # All sheets
structure = wb.get_structure()
# From bytes (S3, cloud storage)
import io
wb = MessyWorkbook(io.BytesIO(content), filename="data.xlsx")
Configuration
from messy_xlsx import SheetConfig, MergeStrategy, HeaderDetectionMode
config = SheetConfig(
# Row handling
skip_rows=0,
header_rows=1,
skip_footer=0,
cell_range=None, # "A1:F100"
# Detection
auto_detect=True,
header_detection_mode="smart", # or HeaderDetectionMode.SMART
header_confidence_threshold=0.7,
# Parsing
merge_strategy="fill", # or MergeStrategy.FILL
include_hidden=False,
# Normalization
normalize=True,
normalize_dates=True,
normalize_numbers=True,
normalize_whitespace=True,
sanitize_column_names=True, # BigQuery-compatible names
# Formulas
evaluate_formulas=True,
)
wb = MessyWorkbook("data.xlsx", sheet_config=config)
All string-based config values accept both raw strings and enum types:
from messy_xlsx import MergeStrategy
# These are equivalent:
SheetConfig(merge_strategy="fill")
SheetConfig(merge_strategy=MergeStrategy.FILL)
# Enums compare equal to strings:
assert MergeStrategy.FILL == "fill" # True
Invalid values raise ValueError at construction time:
SheetConfig(skip_rows=-1) # ValueError
SheetConfig(merge_strategy="banana") # ValueError
Multi-Sheet
from messy_xlsx import read_all_sheets, analyze_excel
# Read all sheets
results = read_all_sheets("data.xlsx")
for name, df in results.items():
print(f"{name}: {len(df)} rows")
# Analyze without loading
info = analyze_excel("data.xlsx")
for sheet in info:
print(f"{sheet.name}: {sheet.row_count} rows, {sheet.column_count} cols")
Output
Output is compatible with BigQuery/Arrow. Column names are sanitized by default and mixed-type columns are coerced to strings.
Dependencies
- Python >= 3.10
- fastexcel >= 0.11
- openpyxl >= 3.1
- pandas >= 2.0
- numpy >= 1.24
Optional:
- formulas, xlcalculator (formula evaluation)
- xlrd (XLS support)
Development
# Install with dev dependencies
make install
# Run tests, lint, type check
make ci
# Run benchmarks
make benchmark
# Serve documentation locally
make docs
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file messy_xlsx-0.9.0.tar.gz.
File metadata
- Download URL: messy_xlsx-0.9.0.tar.gz
- Upload date:
- Size: 9.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
637ad74094f79b5e9c0defc21c46bef56d9e9a2809012f476b8f0aa18224d7c0
|
|
| MD5 |
70ff93fd45cd4353aa5960a603ab2702
|
|
| BLAKE2b-256 |
0b644ef68622b0989f7fc90eb6a9700c301538e72ecf24da96d19d52ab430e9f
|
Provenance
The following attestation bundles were made for messy_xlsx-0.9.0.tar.gz:
Publisher:
publish.yml on ivan-loh/messy-xlsx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
messy_xlsx-0.9.0.tar.gz -
Subject digest:
637ad74094f79b5e9c0defc21c46bef56d9e9a2809012f476b8f0aa18224d7c0 - Sigstore transparency entry: 1238721898
- Sigstore integration time:
-
Permalink:
ivan-loh/messy-xlsx@9a036dbdf3454afbe6afe3d9cdd18e85ccefaca4 -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/ivan-loh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9a036dbdf3454afbe6afe3d9cdd18e85ccefaca4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file messy_xlsx-0.9.0-py3-none-any.whl.
File metadata
- Download URL: messy_xlsx-0.9.0-py3-none-any.whl
- Upload date:
- Size: 62.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa659bfd26a0fa56891ebd55bf66b0979bb26953e973dadf10ba342afb8b5708
|
|
| MD5 |
3767f319a714146dc32de7335ad7d95b
|
|
| BLAKE2b-256 |
37b6e18e943ccb3943c5570eeb53efd02095d9314a3fa14dbf0cf30984a11001
|
Provenance
The following attestation bundles were made for messy_xlsx-0.9.0-py3-none-any.whl:
Publisher:
publish.yml on ivan-loh/messy-xlsx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
messy_xlsx-0.9.0-py3-none-any.whl -
Subject digest:
fa659bfd26a0fa56891ebd55bf66b0979bb26953e973dadf10ba342afb8b5708 - Sigstore transparency entry: 1238721904
- Sigstore integration time:
-
Permalink:
ivan-loh/messy-xlsx@9a036dbdf3454afbe6afe3d9cdd18e85ccefaca4 -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/ivan-loh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9a036dbdf3454afbe6afe3d9cdd18e85ccefaca4 -
Trigger Event:
push
-
Statement type: