Reading both XLSX and XLSB files, fast and memory-safe, into PyArrow.
Project description
RXLS
Shortcut from Read XLS[X|B]
Reading both XLSX and XLSB files, fast and memory-safe, with Python, into PyArrow.
Description
This module provides one function: read for reading both .xlsx and .xlsb files.
Simple usage showcase:
import polars as pl
import pandas as pd
from rxls.reader import read
polars_df = pl.from_arrow(read(some_file, header=True))
pandas_df = read(some_file, header=True).to_pandas()
Advanced usage examples.
Some file has no header:
from rxls.reader import read
table = read(some_file, header=0) # produce `Unnamed: 0` ... `Unnamed: N` columns
table = read(some_file, header=['a', 'b', 'c']) # produce table with columns `a`, `b`, `c`, and raise an exception, when column count != 3.
Header is placed in unknown row (but we know, it is in top-30 and one cell of it contains word
timestamp)
from rxls.reader import read
table = read(some_file, header=True, lookup_head='^timestamp$') # First row of header must contain cell with word `timestamp`
Some rows in column
timestampcontains unwanted cells (with subscripts, totals, or something else), but columnrow_iddoesn't:
from rxls.reader import read
table = read(some_file, header=True, row_filters='^row_id$') # After `header`, will be only rows, that contains non-null value in column `row_id`
Some rows in column
timestampcontains unwanted cells, columnrow_iddoesn't, but we will ensure, that table has also non-null columnclient:
from rxls.reader import read
table = read(some_file, header=True, row_filters=['^row_id$', '^client$']) # After `header`, will be only rows, that contains non-null value in columns `row_id` AND `client`
Some rows in a sheet contains unwanted cells, but we know, that
row_idcolumn cells are not null, orclientcolumn cells are not null either:
from rxls.reader import read
table = read(some_file, header=True, row_filters=['^row_id$', '^client$'], row_filters_strategy='or') # After `header`, will be only rows, that contains non-null value in columns `row_id` OR `client`
For tracking rows events:
from rxls.reader import read
from tqdm import tqdm
with tqdm(desc='Reading some file', unit=' rows') as tq:
table = read(some_file, header=True, row_filters=['^row_id$', '^client$'], row_filters_strategy='or', row_callback=tq.update)
Some algorithm explanations:
RXLS uses only TWO datatypes for cells of excel files: NUMERIC and STRING. It is because all other datatypes, as I discovered, are just styles, applied to them (named representation in RXLS).
So, when MS Excel show us a date (or time, or timestamp, or duration interval), it's actually a floating-point number "under the hood". This number, when is between [0.0, 1.0), presents part of day (0.5 = 12 hours, 0.125 = 3 hours and vice versa), which useful for time and duration datatypes, and, when bigger or equals to 1, presents days since 1900 with part of day. For date - decimals will be zero. All temporal cells (date/time, datetime, timedelta) will be converted to timestamp[ms] during prepare step. This behaviour may be overriden with dtypes argument - after all, pyarrow's timestamp[ms] datatype may be convert to date64 or time32['ms'] instantly and without any problems. duration datatype in Excel files, usually, not presented, and pyarrow cannot cast timestamp['ms'] to duration['ms']. For such conversion, you can extract timestamp column and convert it as:
import pyarrow as pa
column_name = 'duration_column'
table = read(some_file, header=True)
table.set_column(
next(i for i, x in enumerate(table.column_names) if x == column_name),
pa.field(column_name, pa.duration('ms')),
table[column_name].combine_chunks().view(pa.duration('ms')) # original dtype is pyarrow.timestamp['ms'], which is binary equivalent for pyarrow.duration['ms']
)
Strings most common representation is 4-byte little-endian integer (in xlsb) or some number (in xlsx). This number is an index of shared string in special part of Excel file (xl/sharedStrings.[xml|bin]). This special part will be scanned on each read call, before all other steps.
Keeping all this in mind, I've implemented class chunk for keeping original data as-is, until it should be accessed during final table creation. Chunks, actually, are column parts, and contains non-null values with same type and representation. Chunk may be of NULL type, but it contains just count of null values between not-null chunks.
Original data will be contained inside chunk as-is, intact until prepare step:
numpy.NDArray[numpy.float64]- simple numeric values (with optionaltemporalflag)numpy.NDArray[numpy.uint32]- forXLSBRkNumber-formatted floating-poing or integer values.pyarrow.LargeStringArray- forXLSXandXLSBstrings, that are not inshared, but presented as-is.pyarrow.UInt64Array- forXLSXandXLSBshared strings indices.int- fornullchunks: count of null values
RXLS can skip entire chunks (or their parts) before prepare step, when they should be skipped (see lookup_head and row_filters arguments descriptions) - they will never be accessed afterwards.
RXLS performs two main steps for a sheet of Excel file: reading and prepare.
On reading step, no computations will be performed - it's just filling xl_series objects by chunks, keeping original data.
On prepare step, unwanted chunks will be dropped, and others will be prepared as:
numpy.NDArray[numpy.uint32]->numpy.NDArray[numpy.float64]- when numeric chunk hasrknubmerflag.numpy.NDArray[numpy.float64]->numpy.NDArray[numpy.int64]- when numeric chunk hastemporalflag. count of days with decimal part since 1900 (Windows) -> milliseconds since 1970-01-01 (UNIX timestamp)pyarrow.UInt64Array->pyarrow.LargeStringArray- shared strings indices to actual data fromshared.int->pyarrow.NullArrayorpyarrow.Array[T, pyarrow.Scalar[T]]if column has chunks of another common datatypeT.
These operations were totally optimized (and may performs much faster with numba along with (or without) tbb - see Dependencies section).
Also, about lookup_head and header with height > 1.
If header has more than 1 row, every column will fill it's topmost empty cells by use corresponding non-empty cell values from left neighbour. All other cells will be leave as is. After this, non-empty cells will be concatenated from top to down with ', ' separator. These operations performs for each column from left to right.
So, RXLS features are:
- Lazy and fast column-wise computations
- Smart header searching and row filtering
- Keeping datatypes and resolve conflicts.
Notes:
- At the moment, RXLS just skip empty column (which has no header, nor any data).
Parameters:
- Positional only:
- file: path to the file, or BytesIO with file.
- sheet: index or name of sheet for reading (default:
0)
- Positional:
- header:
bool:Truewhen header is presented in the sheet and have one row heightint:0when no header is presented in the sheet,Nfor header ofNrows height.list[str]: when no header is presented, but we are know, what it should be.
- header:
- Keyword-only
- dtypes: Specify datatypes for columns.
Sequence[pyarrow.DataType]: when we know dtype for each non-empty column.dict[str, pyarrow.DataType]: when we will override dtype for some columns.pyarrow.DataType: when all columns must be of this dtype.None: for original dtypes of columns.
- skip_cols: Skip some columns by their
0-based indices (A == 0). Performs onreadingstep. - skip_rows: Skip some rows on the top of the file. Performs on
readingstep. - skip_rows_after_header: Skip some rows after
header. Performs onpreparestep. - take_rows: Stop reading after this row (
0-based). Performs onreadingstep. - take_rows_non_empty: Leave empty rows in resulting table. Performs on
readingstep. - lookup_head: Regular expression for smart-search of the first row of
header, orcolumnindex, where first non-empty cell is the top-level cell ofheader. - lookup_size: Count of rows to perform lookup when searching for
header. Note: RXLS will raise an exception, iflookup_headwith thislookup_sizeis failed. - row_filters: Regular expression(s) for columns, which content determines empty and non-empty rows.
- row_filters_strategy: Boolean operator(s) for situations with two or more columns in
row_filters. - float_precision. All numeric values in MS Excel are
floating-pointunder the hood, so, when rounding wholefloat64column to this precision gives equal result to just truncate decimals, this column will be converted toint64. - datetime_formats: One or more formats, which may appears in columns with
conflicts. - conflict_resolve: When column contains two or more datatypes, this is a
conflict. When conflict cannot be resolved, whole column will be convert toutf-8. Conflicts may be resolved as:no: All parts in columns withconflictswill be convert toutf8.temporal: Try to convert non-temporal parts of column with some temporal parts to temporal (float64->timestampandutf8->timestamp(using default formats (ISO 8601), or as specified indatetime_formats)numeric: Try to convert non-numeric parts of column with some numeric parts to numeric (utf8->float64).all: Use both strategies to resolve conflicts. When some parts of column is temporal, try to convert all other parts to temporal (also enable two-step string converting:utf8->float64->timestamp)
- utf8_type_infers:
(WIP)When resulting column isutf-8and all non-null cells of it passes regular expression ofnumericvalues, convert it tofloat64(and, maybe, toint64after). - null_values: Advanced list of values, that should be skipped on
readingstep (orcallablepredicate for them). - row_callback: Any callable, which may be called without arguments on each row event. Useful for progress tracking.
- dtypes: Specify datatypes for columns.
Dependencies
Required:
- pyarrow>=
14.0.2 - numpy>=
1.24.4 - recordclass>=
0.21.1 - typing_extensions>=
4.9.0
Optional:
- numba>=
0.58.1(increase import time, but reading speed also increases x3/x4 and up) - tbb>=
2021.11.0(only for numba - additional performance gain) - polars>=
0.20.4(if needs to parse timestamps with milliseconds/microseconds/nanoseconds or AM/PM with timezone) - pandas>=
2.0.3(for pyarrowto_pandasfunctionality) - tqdm>=
4.66.1(fast progress tracking)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rxls-0.2.2.tar.gz.
File metadata
- Download URL: rxls-0.2.2.tar.gz
- Upload date:
- Size: 40.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab5a2eb51b9d02aa182e6195fae6d6c0f1c59bcbd5ff4fb9ba70ec49556535fe
|
|
| MD5 |
66569f233cf3d90803290d6e7d95f691
|
|
| BLAKE2b-256 |
75f919ea1882edafc11d30b7dc7adfceacbe6928270a03b7657753d2722ef446
|
File details
Details for the file rxls-0.2.2-py3-none-any.whl.
File metadata
- Download URL: rxls-0.2.2-py3-none-any.whl
- Upload date:
- Size: 41.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ecc47151b19a32757ef63b3b0ede5972a0a17f5045139f4a618afc3df1e91c3
|
|
| MD5 |
caecf850733e961ba2f2740f93b00410
|
|
| BLAKE2b-256 |
63956b111773fd0ae4b54d5bc9703b9a23bbff71f894110abe40dcae15b10eb6
|