Skip to main content

Import Excel files into tidy format with cell-level data extraction

Project description

tidyxl Logo

PyPI version PyPI downloads Python versions

A Python package that imports Excel files (.xlsx, .xlsm) into a tidy format where each cell is represented as a single row with detailed metadata. This package is inspired by and closely mirrors the functionality of the R tidyxl package.

Installation

pip install tidyxl

Quick Start

from tidyxl import xlsx_cells, xlsx_sheet_names

# Get all sheet names
sheets = xlsx_sheet_names("data.xlsx")
print(f"Found sheets: {sheets}")

# Read all cells from all sheets
cells = xlsx_cells("data.xlsx")
print(f"Total cells: {len(cells)}")

# Read specific sheet only
employees = xlsx_cells("data.xlsx", sheets="Employees")

# Filter for cells with content
content_cells = cells[~cells['is_blank']]
print(content_cells[['sheet', 'address', 'data_type', 'character', 'numeric']].head())

Features

  • Tidy Format: Each Excel cell becomes one row with comprehensive metadata
  • Complete Cell Information: Extract values, formulas, formatting, comments, and more
  • Multiple Worksheets: Process all sheets or specify particular ones
  • Named Ranges: Extract and analyze Excel named ranges and formulas
  • Data Validation: Discover data validation rules applied to cells
  • Type Safety: Separate columns for different data types (numeric, character, logical, date, error)
  • R tidyxl Compatible: Identical API and output structure to the R package

Core Functions

xlsx_cells()

Extract all cell data in tidy format:

from tidyxl import xlsx_cells

# Read all sheets
cells = xlsx_cells("file.xlsx")

# Read specific sheets
cells = xlsx_cells("file.xlsx", sheets=["Sheet1", "Sheet2"])
cells = xlsx_cells("file.xlsx", sheets="Data")

# Include/exclude blank cells
cells = xlsx_cells("file.xlsx", include_blank_cells=False)

xlsx_sheet_names()

List all worksheet names:

from tidyxl import xlsx_sheet_names

sheets = xlsx_sheet_names("file.xlsx")
# Returns: ['Sheet1', 'Data', 'Summary']

xlsx_names()

Extract named ranges and formulas:

from tidyxl import xlsx_names

names = xlsx_names("file.xlsx")
print(names[['name', 'formula', 'sheet', 'is_range']])

xlsx_validation()

Extract data validation rules:

from tidyxl import xlsx_validation

validation = xlsx_validation("file.xlsx")
print(validation[['sheet', 'ref', 'type', 'formula1']])

xlsx_formats()

Extract formatting information:

from tidyxl import xlsx_formats

formats = xlsx_formats("file.xlsx")
# Returns dict with keys: fonts, fills, borders, number_formats

Output Structure

The xlsx_cells() function returns a pandas DataFrame with 23 columns matching the R tidyxl package:

Column Type Description
sheet str Worksheet name
address str Cell address (A1 notation)
row int Row number
col int Column number
is_blank bool Whether cell has a value
content str Raw cell value before type conversion
data_type str Cell type (character, numeric, logical, date, error, blank)
error str Error value if cell contains error
logical bool Boolean value if cell contains TRUE/FALSE
numeric float Numeric value if cell contains number
date datetime Date value if cell contains date
character str String value if cell contains text
formula str Formula if cell contains formula
is_array bool Whether formula is array formula
formula_ref str Range address for array/shared formulas
formula_group int Formula group identifier
comment str Cell comment text
height float Row height in Excel units
width float Column width in Excel units
row_outline_level int Row outline/grouping level
col_outline_level int Column outline/grouping level
style_format str Style format identifier
local_format_id int Local formatting identifier

License

This project is licensed under the MIT License.

Acknowledgments

  • Inspired by the excellent R tidyxl package by Duncan Garmonsway
  • Built on top of openpyxl for Excel file processing
  • Uses pandas for data structure management

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tidyxl-0.1.8.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tidyxl-0.1.8-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file tidyxl-0.1.8.tar.gz.

File metadata

  • Download URL: tidyxl-0.1.8.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.4

File hashes

Hashes for tidyxl-0.1.8.tar.gz
Algorithm Hash digest
SHA256 a6e5e1d70a8b111c0b5593add6b4c65f8d4058577bde7e1a3b9fb867f9b0e782
MD5 fcf65d5ee158134b8f85fb725982e4da
BLAKE2b-256 e2bcd590aeb3c9c4e8bb27c9f415d2e3d64864f62ff62d4a043aca8260dfcafd

See more details on using hashes here.

File details

Details for the file tidyxl-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: tidyxl-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.4

File hashes

Hashes for tidyxl-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 108290167e0e6f431e49b9a9ac7bc6125cf31b282f00e6a6042f1ffc3771973b
MD5 ed85aebb1950e92438b26fedb7816ac2
BLAKE2b-256 c965d8ab97601cab79adaf8f6badf98e01b06e4ba2bf82d9b157320f02636082

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page