Import Excel files into tidy format with cell-level data extraction
Project description
tidyxl
A Python package that imports Excel files (.xlsx, .xlsm) into a tidy format where each cell is represented as a single row with detailed metadata. This package is inspired by and closely mirrors the functionality of the R tidyxl package.
Installation
pip install tidyxl
Quick Start
from tidyxl import xlsx_cells, xlsx_sheet_names
# Get all sheet names
sheets = xlsx_sheet_names("data.xlsx")
print(f"Found sheets: {sheets}")
# Read all cells from all sheets
cells = xlsx_cells("data.xlsx")
print(f"Total cells: {len(cells)}")
# Read specific sheet only
employees = xlsx_cells("data.xlsx", sheets="Employees")
# Filter for cells with content
content_cells = cells[~cells['is_blank']]
print(content_cells[['sheet', 'address', 'data_type', 'character', 'numeric']].head())
Features
- Tidy Format: Each Excel cell becomes one row with comprehensive metadata
- Complete Cell Information: Extract values, formulas, formatting, comments, and more
- Multiple Worksheets: Process all sheets or specify particular ones
- Named Ranges: Extract and analyze Excel named ranges and formulas
- Data Validation: Discover data validation rules applied to cells
- Type Safety: Separate columns for different data types (numeric, character, logical, date, error)
- R tidyxl Compatible: Identical API and output structure to the R package
Core Functions
xlsx_cells()
Extract all cell data in tidy format:
from tidyxl import xlsx_cells
# Read all sheets
cells = xlsx_cells("file.xlsx")
# Read specific sheets
cells = xlsx_cells("file.xlsx", sheets=["Sheet1", "Sheet2"])
cells = xlsx_cells("file.xlsx", sheets="Data")
# Include/exclude blank cells
cells = xlsx_cells("file.xlsx", include_blank_cells=False)
xlsx_sheet_names()
List all worksheet names:
from tidyxl import xlsx_sheet_names
sheets = xlsx_sheet_names("file.xlsx")
# Returns: ['Sheet1', 'Data', 'Summary']
xlsx_names()
Extract named ranges and formulas:
from tidyxl import xlsx_names
names = xlsx_names("file.xlsx")
print(names[['name', 'formula', 'sheet', 'is_range']])
xlsx_validation()
Extract data validation rules:
from tidyxl import xlsx_validation
validation = xlsx_validation("file.xlsx")
print(validation[['sheet', 'ref', 'type', 'formula1']])
xlsx_formats()
Extract formatting information:
from tidyxl import xlsx_formats
formats = xlsx_formats("file.xlsx")
# Returns dict with keys: fonts, fills, borders, number_formats
Output Structure
The xlsx_cells() function returns a pandas DataFrame with 23 columns matching the R tidyxl package:
| Column | Type | Description |
|---|---|---|
sheet |
str | Worksheet name |
address |
str | Cell address (A1 notation) |
row |
int | Row number |
col |
int | Column number |
is_blank |
bool | Whether cell has a value |
content |
str | Raw cell value before type conversion |
data_type |
str | Cell type (character, numeric, logical, date, error, blank) |
error |
str | Error value if cell contains error |
logical |
bool | Boolean value if cell contains TRUE/FALSE |
numeric |
float | Numeric value if cell contains number |
date |
datetime | Date value if cell contains date |
character |
str | String value if cell contains text |
formula |
str | Formula if cell contains formula |
is_array |
bool | Whether formula is array formula |
formula_ref |
str | Range address for array/shared formulas |
formula_group |
int | Formula group identifier |
comment |
str | Cell comment text |
height |
float | Row height in Excel units |
width |
float | Column width in Excel units |
row_outline_level |
int | Row outline/grouping level |
col_outline_level |
int | Column outline/grouping level |
style_format |
str | Style format identifier |
local_format_id |
int | Local formatting identifier |
License
This project is licensed under the MIT License.
Acknowledgments
- Inspired by the excellent R tidyxl package by Duncan Garmonsway
- Built on top of openpyxl for Excel file processing
- Uses pandas for data structure management
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tidyxl-0.1.0.tar.gz.
File metadata
- Download URL: tidyxl-0.1.0.tar.gz
- Upload date:
- Size: 21.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c429a078a7e850b23972717d3424b53c7c2e7d5655f065d81c0f4f3c85729885
|
|
| MD5 |
f8438099303803d755496d72eb26cafd
|
|
| BLAKE2b-256 |
be4a153a140ad0f17455b754524f55d56b5e1d9baf1cbbc51256eff00c9e99e2
|
File details
Details for the file tidyxl-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tidyxl-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee441083a43b67215fa44aa71a330568315b6b117b899b2c734f237a9b7df321
|
|
| MD5 |
85ce11a676cb8e232aaa1329436a6fc4
|
|
| BLAKE2b-256 |
4f65d56db59e62e5c95da5e72e5c1c2f06949a2cb7ece77db70aecf9d01338cd
|