Reading both XLSX and XLSB files, fast and memory-safe, into PyArrow.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

RXLS

Shortcut from Read XLS[X|B]

Reading both XLSX and XLSB files, fast and memory-safe, with Python, into PyArrow.

Description

This module provides one function: read for reading both .xlsx and .xlsb files.

Simple usage showcase:

import polars as pl
import pandas as pd

from rxls.reader import read

polars_df = pl.from_arrow(read(some_file, header=True))
pandas_df = read(some_file, header=True).to_pandas()

Advanced usage examples.

Some file has no header:

from rxls.reader import read

table = read(some_file, header=0)  # produce `Unnamed: 0` ... `Unnamed: N` columns
table = read(some_file, header=['a', 'b', 'c'])  # produce table with columns `a`, `b`, `c`, and raise an exception, when column count != 3.

Header is placed in unknown row (but we know, it is in top-30 and one cell of it contains word timestamp)

from rxls.reader import read

table = read(some_file, header=True, lookup_head='^timestamp$')  # First row of header must contain cell with word `timestamp`

Some rows in column timestamp contains unwanted cells (with subscripts, totals, or something else), but column row_id doesn't:

from rxls.reader import read

table = read(some_file, header=True, row_filters='^row_id$')  # After `header`, will be only rows, that contains non-null value in column `row_id`

Some rows in column timestamp contains unwanted cells, column row_id doesn't, but we will ensure, that table has also non-null column client:

from rxls.reader import read

table = read(some_file, header=True, row_filters=['^row_id$', '^client$'])  # After `header`, will be only rows, that contains non-null value in columns `row_id` AND `client`

Some rows in a sheet contains unwanted cells, but we know, that row_id column cells are not null, or client column cells are not null either:

from rxls.reader import read

table = read(some_file, header=True, row_filters=['^row_id$', '^client$'], row_filters_strategy='or')  # After `header`, will be only rows, that contains non-null value in columns `row_id` OR `client`

For tracking rows events:

from rxls.reader import read
from tqdm import tqdm

with tqdm(desc='Reading some file', unit=' rows') as tq:
    table = read(some_file, header=True, row_filters=['^row_id$', '^client$'], row_filters_strategy='or', row_callback=tq.update)

Some algorithm explanations:

RXLS uses only TWO datatypes for cells of excel files: NUMERIC and STRING. It is because all other datatypes, as I discovered, are just styles, applied to them (named representation in RXLS).

So, when MS Excel show us a date (or time, or timestamp, or duration interval), it's actually a floating-point number "under the hood". This number, when is between [0.0, 1.0), presents part of day (0.5 = 12 hours, 0.125 = 3 hours and vice versa), which useful for time and duration datatypes, and, when bigger or equals to 1, presents days since 1900 with part of day. For date - decimals will be zero. All temporal cells (date/time, datetime, timedelta) will be converted to timestamp[ms] during prepare step. This behaviour may be overriden with dtypes argument - after all, pyarrow's timestamp[ms] datatype may be convert to date64 or time32['ms'] instantly and without any problems. duration datatype in Excel files, usually, not presented, and pyarrow cannot cast timestamp['ms'] to duration['ms']. For such conversion, you can extract timestamp column and convert it as:

import pyarrow as pa

column_name = 'duration_column'

table = read(some_file, header=True)

table.set_column(
    next(i for i, x in enumerate(table.column_names) if x == column_name),
    pa.field(column_name, pa.duration('ms')),
    table[column_name].combine_chunks().view(pa.duration('ms'))  # original dtype is pyarrow.timestamp['ms'], which is binary equivalent for pyarrow.duration['ms']
)

Strings most common representation is 4-byte little-endian integer (in xlsb) or some number (in xlsx). This number is an index of shared string in special part of Excel file (xl/sharedStrings.[xml|bin]). This special part will be scanned on each read call, before all other steps.

Keeping all this in mind, I've implemented class chunk for keeping original data as-is, until it should be accessed during final table creation. Chunks, actually, are column parts, and contains non-null values with same type and representation. Chunk may be of NULL type, but it contains just count of null values between not-null chunks.

Original data will be contained inside chunk as-is, intact until prepare step:

numpy.NDArray[numpy.float64] - simple numeric values (with optional temporal flag)
numpy.NDArray[numpy.uint32] - for XLSB RkNumber-formatted floating-poing or integer values.
pyarrow.LargeStringArray - for XLSX and XLSB strings, that are not in shared, but presented as-is.
pyarrow.UInt64Array - for XLSX and XLSB shared strings indices.
int - for null chunks: count of null values

RXLS can skip entire chunks (or their parts) before prepare step, when they should be skipped (see lookup_head and row_filters arguments descriptions) - they will never be accessed afterwards.

RXLS performs two main steps for a sheet of Excel file: reading and prepare.

On reading step, no computations will be performed - it's just filling xl_series objects by chunks, keeping original data.

On prepare step, unwanted chunks will be dropped, and others will be prepared as:

numpy.NDArray[numpy.uint32] -> numpy.NDArray[numpy.float64] - when numeric chunk has rknubmer flag.
numpy.NDArray[numpy.float64] -> numpy.NDArray[numpy.int64] - when numeric chunk has temporal flag. count of days with decimal part since 1900 (Windows) -> milliseconds since 1970-01-01 (UNIX timestamp)
pyarrow.UInt64Array -> pyarrow.LargeStringArray - shared strings indices to actual data from shared.
int -> pyarrow.NullArray or pyarrow.Array[T, pyarrow.Scalar[T]] if column has chunks of another common datatype T.

These operations were totally optimized (and may performs much faster with numba along with (or without) tbb - see Dependencies section).

Also, about lookup_head and header with height > 1.

If header has more than 1 row, every column will fill it's topmost empty cells by use corresponding non-empty cell values from left neighbour. All other cells will be leave as is. After this, non-empty cells will be concatenated from top to down with ', ' separator. These operations performs for each column from left to right.

So, RXLS features are:

Lazy and fast column-wise computations
Smart header searching and row filtering
Keeping datatypes and resolve conflicts.

Notes:

At the moment, RXLS just skip empty column (which has no header, nor any data).

Parameters:

Positional only:
- file: path to the file, or BytesIO with file.
- sheet: index or name of sheet for reading (default: 0)
Positional:
- header:
  - bool: True when header is presented in the sheet and have one row height
  - int: 0 when no header is presented in the sheet, N for header of N rows height.
  - list[str]: when no header is presented, but we are know, what it should be.
Keyword-only
- dtypes: Specify datatypes for columns.
  - Sequence[pyarrow.DataType]: when we know dtype for each non-empty column.
  - dict[str, pyarrow.DataType]: when we will override dtype for some columns.
  - pyarrow.DataType: when all columns must be of this dtype.
  - None: for original dtypes of columns.
- skip_cols: Skip some columns by their 0-based indices (A == 0). Performs on reading step.
- skip_rows: Skip some rows on the top of the file. Performs on reading step.
- skip_rows_after_header: Skip some rows after header. Performs on prepare step.
- take_rows: Stop reading after this row (0-based). Performs on reading step.
- take_rows_non_empty: Leave empty rows in resulting table. Performs on reading step.
- lookup_head: Regular expression for smart-search of the first row of header, or column index, where first non-empty cell is the top-level cell of header.
- lookup_size: Count of rows to perform lookup when searching for header. Note: RXLS will raise an exception, if lookup_head with this lookup_size is failed.
- row_filters: Regular expression(s) for columns, which content determines empty and non-empty rows.
- row_filters_strategy: Boolean operator(s) for situations with two or more columns in row_filters.
- float_precision. All numeric values in MS Excel are floating-point under the hood, so, when rounding whole float64 column to this precision gives equal result to just truncate decimals, this column will be converted to int64.
- datetime_formats: One or more formats, which may appears in columns with conflicts.
- conflict_resolve: When column contains two or more datatypes, this is a conflict. When conflict cannot be resolved, whole column will be convert to utf-8. Conflicts may be resolved as:
  - no: All parts in columns with conflicts will be convert to utf8.
  - temporal: Try to convert non-temporal parts of column with some temporal parts to temporal (float64 -> timestamp and utf8 -> timestamp (using default formats (ISO 8601), or as specified in datetime_formats)
  - numeric: Try to convert non-numeric parts of column with some numeric parts to numeric (utf8 -> float64).
  - all: Use both strategies to resolve conflicts. When some parts of column is temporal, try to convert all other parts to temporal (also enable two-step string converting: utf8 -> float64 -> timestamp)
- utf8_type_infers: (WIP) When resulting column is utf-8 and all non-null cells of it passes regular expression of numeric values, convert it to float64 (and, maybe, to int64 after).
- null_values: Advanced list of values, that should be skipped on reading step (or callable predicate for them).
- row_callback: Any callable, which may be called without arguments on each row event. Useful for progress tracking.

Dependencies

Required:

pyarrow>=14.0.2
numpy>=1.24.4
recordclass>=0.21.1
typing_extensions>=4.9.0

Optional:

numba>=0.58.1 (increase import time, but reading speed also increases x3/x4 and up)
tbb>=2021.11.0 (only for numba - additional performance gain)
polars>=0.20.4 (if needs to parse timestamps with milliseconds/microseconds/nanoseconds or AM/PM with timezone)
pandas>=2.0.3 (for pyarrow to_pandas functionality)
tqdm>=4.66.1 (fast progress tracking)

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.2.2

Feb 6, 2024

0.2.1

Jan 20, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rxls-0.2.2.tar.gz (40.7 kB view hashes)

Uploaded Feb 6, 2024 Source

Built Distribution

rxls-0.2.2-py3-none-any.whl (41.7 kB view hashes)

Uploaded Feb 6, 2024 Python 3

Hashes for rxls-0.2.2.tar.gz

Hashes for rxls-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`ab5a2eb51b9d02aa182e6195fae6d6c0f1c59bcbd5ff4fb9ba70ec49556535fe`
MD5	`66569f233cf3d90803290d6e7d95f691`
BLAKE2b-256	`75f919ea1882edafc11d30b7dc7adfceacbe6928270a03b7657753d2722ef446`

Hashes for rxls-0.2.2-py3-none-any.whl

Hashes for rxls-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1ecc47151b19a32757ef63b3b0ede5972a0a17f5045139f4a618afc3df1e91c3`
MD5	`caecf850733e961ba2f2740f93b00410`
BLAKE2b-256	`63956b111773fd0ae4b54d5bc9703b9a23bbff71f894110abe40dcae15b10eb6`