A Python library for parsing messy Excel files with intelligent structure detection and formula evaluation
Project description
messy-xlsx
Parse Excel files (XLSX, XLS, CSV) to pandas DataFrames with structure detection and normalization.
Install
pip install messy-xlsx
# Optional: formula evaluation
pip install messy-xlsx[formulas]
# Optional: legacy .xls support
pip install messy-xlsx[xls]
Usage
from messy_xlsx import MessyWorkbook, SheetConfig, read_excel
# Quick read
df = read_excel("data.xlsx")
# With options
df = read_excel("data.xlsx", sheet="Sheet1", skip_rows=2, normalize=False)
# Workbook API
with MessyWorkbook("data.xlsx") as wb:
df = wb.to_dataframe(sheet="Sheet1")
all_dfs = wb.to_dataframes() # All sheets
structure = wb.get_structure()
# From bytes (S3, cloud storage)
import io
wb = MessyWorkbook(io.BytesIO(content), filename="data.xlsx")
Configuration
config = SheetConfig(
# Row handling
skip_rows=0,
header_rows=1,
skip_footer=0,
cell_range=None, # "A1:F100"
# Detection
auto_detect=True,
header_detection_mode="smart", # "smart", "auto", "manual"
header_confidence_threshold=0.7,
# Parsing
merge_strategy="fill", # "fill", "skip", "first_only"
include_hidden=False,
locale="auto", # "auto", "en_US", "de_DE"
# Normalization
normalize=True,
normalize_dates=True,
normalize_numbers=True,
normalize_whitespace=True,
# Formulas
evaluate_formulas=True,
)
wb = MessyWorkbook("data.xlsx", sheet_config=config)
Multi-Sheet
from messy_xlsx import read_all_sheets, analyze_excel
# Read all sheets
results = read_all_sheets("data.xlsx")
for name, df in results.items():
print(f"{name}: {len(df)} rows")
# Analyze without loading
info = analyze_excel("data.xlsx")
for sheet in info:
print(f"{sheet.name}: {sheet.row_count} rows, {sheet.column_count} cols")
Output
Output is compatible with BigQuery/Arrow. Mixed-type columns are coerced to strings.
Dependencies
- Python >= 3.10
- fastexcel >= 0.11
- openpyxl >= 3.1
- pandas >= 2.0
- numpy >= 1.24
Optional:
- formulas, xlcalculator (formula evaluation)
- xlrd (XLS support)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file messy_xlsx-0.7.1.tar.gz.
File metadata
- Download URL: messy_xlsx-0.7.1.tar.gz
- Upload date:
- Size: 8.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
229e97ae17c8d599a23289a91f778a193b3a7e1ecea4bd60f5b4e84b65e98f3d
|
|
| MD5 |
45a04f2608308627b31780a5617f64b6
|
|
| BLAKE2b-256 |
fb4f26bd118776c7048ca1cb8966092ffd9b4a962fe0e52de7d0dd693075f4df
|
Provenance
The following attestation bundles were made for messy_xlsx-0.7.1.tar.gz:
Publisher:
publish.yml on ivan-loh/messy-xlsx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
messy_xlsx-0.7.1.tar.gz -
Subject digest:
229e97ae17c8d599a23289a91f778a193b3a7e1ecea4bd60f5b4e84b65e98f3d - Sigstore transparency entry: 853939690
- Sigstore integration time:
-
Permalink:
ivan-loh/messy-xlsx@8cd3b72120729b8055ecdca07143deeeb1bf15f1 -
Branch / Tag:
refs/tags/v0.7.1 - Owner: https://github.com/ivan-loh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8cd3b72120729b8055ecdca07143deeeb1bf15f1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file messy_xlsx-0.7.1-py3-none-any.whl.
File metadata
- Download URL: messy_xlsx-0.7.1-py3-none-any.whl
- Upload date:
- Size: 54.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5db8862d4400f13a6ef18308a70e19d02c57ce41c59e36148a31b6a045755397
|
|
| MD5 |
2556158b54bd47a3577c9004504a3df4
|
|
| BLAKE2b-256 |
9eb9e6eceeb963b5f656751ccb171e86ee69f4f968fb72166b648a63e12dae2d
|
Provenance
The following attestation bundles were made for messy_xlsx-0.7.1-py3-none-any.whl:
Publisher:
publish.yml on ivan-loh/messy-xlsx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
messy_xlsx-0.7.1-py3-none-any.whl -
Subject digest:
5db8862d4400f13a6ef18308a70e19d02c57ce41c59e36148a31b6a045755397 - Sigstore transparency entry: 853939697
- Sigstore integration time:
-
Permalink:
ivan-loh/messy-xlsx@8cd3b72120729b8055ecdca07143deeeb1bf15f1 -
Branch / Tag:
refs/tags/v0.7.1 - Owner: https://github.com/ivan-loh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8cd3b72120729b8055ecdca07143deeeb1bf15f1 -
Trigger Event:
push
-
Statement type: