Skip to main content

Import Excel files into tidy format with cell-level data extraction

Project description

tidyxl

A Python package that imports Excel files (.xlsx, .xlsm) into a tidy format where each cell is represented as a single row with detailed metadata. This package is inspired by and closely mirrors the functionality of the R tidyxl package.

Installation

pip install tidyxl

Quick Start

from tidyxl import xlsx_cells, xlsx_sheet_names

# Get all sheet names
sheets = xlsx_sheet_names("data.xlsx")
print(f"Found sheets: {sheets}")

# Read all cells from all sheets
cells = xlsx_cells("data.xlsx")
print(f"Total cells: {len(cells)}")

# Read specific sheet only
employees = xlsx_cells("data.xlsx", sheets="Employees")

# Filter for cells with content
content_cells = cells[~cells['is_blank']]
print(content_cells[['sheet', 'address', 'data_type', 'character', 'numeric']].head())

Features

  • Tidy Format: Each Excel cell becomes one row with comprehensive metadata
  • Complete Cell Information: Extract values, formulas, formatting, comments, and more
  • Multiple Worksheets: Process all sheets or specify particular ones
  • Named Ranges: Extract and analyze Excel named ranges and formulas
  • Data Validation: Discover data validation rules applied to cells
  • Type Safety: Separate columns for different data types (numeric, character, logical, date, error)
  • R tidyxl Compatible: Identical API and output structure to the R package

Core Functions

xlsx_cells()

Extract all cell data in tidy format:

from tidyxl import xlsx_cells

# Read all sheets
cells = xlsx_cells("file.xlsx")

# Read specific sheets
cells = xlsx_cells("file.xlsx", sheets=["Sheet1", "Sheet2"])
cells = xlsx_cells("file.xlsx", sheets="Data")

# Include/exclude blank cells
cells = xlsx_cells("file.xlsx", include_blank_cells=False)

xlsx_sheet_names()

List all worksheet names:

from tidyxl import xlsx_sheet_names

sheets = xlsx_sheet_names("file.xlsx")
# Returns: ['Sheet1', 'Data', 'Summary']

xlsx_names()

Extract named ranges and formulas:

from tidyxl import xlsx_names

names = xlsx_names("file.xlsx")
print(names[['name', 'formula', 'sheet', 'is_range']])

xlsx_validation()

Extract data validation rules:

from tidyxl import xlsx_validation

validation = xlsx_validation("file.xlsx")
print(validation[['sheet', 'ref', 'type', 'formula1']])

xlsx_formats()

Extract formatting information:

from tidyxl import xlsx_formats

formats = xlsx_formats("file.xlsx")
# Returns dict with keys: fonts, fills, borders, number_formats

Output Structure

The xlsx_cells() function returns a pandas DataFrame with 23 columns matching the R tidyxl package:

Column Type Description
sheet str Worksheet name
address str Cell address (A1 notation)
row int Row number
col int Column number
is_blank bool Whether cell has a value
content str Raw cell value before type conversion
data_type str Cell type (character, numeric, logical, date, error, blank)
error str Error value if cell contains error
logical bool Boolean value if cell contains TRUE/FALSE
numeric float Numeric value if cell contains number
date datetime Date value if cell contains date
character str String value if cell contains text
formula str Formula if cell contains formula
is_array bool Whether formula is array formula
formula_ref str Range address for array/shared formulas
formula_group int Formula group identifier
comment str Cell comment text
height float Row height in Excel units
width float Column width in Excel units
row_outline_level int Row outline/grouping level
col_outline_level int Column outline/grouping level
style_format str Style format identifier
local_format_id int Local formatting identifier

License

This project is licensed under the MIT License.

Acknowledgments

  • Inspired by the excellent R tidyxl package by Duncan Garmonsway
  • Built on top of openpyxl for Excel file processing
  • Uses pandas for data structure management

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tidyxl-0.1.0.tar.gz (21.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tidyxl-0.1.0-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file tidyxl-0.1.0.tar.gz.

File metadata

  • Download URL: tidyxl-0.1.0.tar.gz
  • Upload date:
  • Size: 21.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for tidyxl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c429a078a7e850b23972717d3424b53c7c2e7d5655f065d81c0f4f3c85729885
MD5 f8438099303803d755496d72eb26cafd
BLAKE2b-256 be4a153a140ad0f17455b754524f55d56b5e1d9baf1cbbc51256eff00c9e99e2

See more details on using hashes here.

File details

Details for the file tidyxl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tidyxl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for tidyxl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ee441083a43b67215fa44aa71a330568315b6b117b899b2c734f237a9b7df321
MD5 85ce11a676cb8e232aaa1329436a6fc4
BLAKE2b-256 4f65d56db59e62e5c95da5e72e5c1c2f06949a2cb7ece77db70aecf9d01338cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page