Useful data crunching tools for pyarrow
Project description
Pyarrow ops
Pyarrow ops is Python libary for data crunching operations directly on the pyarrow.Table class, implemented in numpy & Cython. For convenience, function naming and behavior tries to replicates that of the Pandas API. The Join / Groupy performance is slightly slower than that of pandas, especially on multi column joins.
Current use cases:
- Data operations like joins, groupby (aggregations), filters & drop_duplicates
- (Very fast) reusable pre-processing for ML applications
Installation
Use the package manager pip to install pyarrow_ops.
pip install pyarrow_ops
Usage
See test_*.py for runnable test examples
Data operations:
import pyarrow as pa
from pyarrow_ops import join, filters, groupby, head, drop_duplicates
# Create pyarrow.Table
t = pa.Table.from_pydict({
'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26., 24.]
})
head(t) # Use head to print, like df.head()
# Drop duplicates based on column values
d = drop_duplicates(t, on=['Animal'], keep='first')
# Groupby iterable
for key, value in groupby(t, ['Animal']):
print(key)
head(value)
# Group by aggregate functions
g = groupby(t, ['Animal']).sum()
g = groupby(t, ['Animal']).agg({'Max Speed': 'max'})
# Use filter predicates using list of tuples (column, operation, value)
f = filters(t, [('Animal', 'not in', ['Falcon', 'Duck']), ('Max Speed', '<', 25)])
# Join operations (currently performs inner join)
t2 = pa.Table.from_pydict({
'Animal': ['Falcon', 'Parrot'],
'Age': [10, 20]
})
j = join(t, t2, on=['Animal'])
ML Preprocessing (note: personal tests showed ~5x speed up compared to pandas on large datasets)
import pyarrow as pa
from pyarrow_ops import head, TableCleaner
# Training data
t1 = pa.Table.from_pydict({
'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot', 'Parrot'],
'Max Speed': [380., 370., None, 26., 24.],
'Value': [2000, 1500, 10, 30, 20],
})
# Create TableCleaner & register columns to be processed
cleaner = TableCleaner()
cleaner.register_numeric('Max Speed', impute='min', clip=True)
cleaner.register_label('Animal', categories=['Goose', 'Falcon'])
cleaner.register_one_hot('Animal')
# Clean table and split into train/test
X, y = cleaner.clean_table(t1, label='Value')
X_train, X_test, y_train, y_test = cleaner.split(X, y)
# Train a model + Save cleaner settings
cleaner_dict = cleaner.to_dict()
# Prediction data
t2 = pa.Table.from_pydict({
'Animal': ['Falcon', 'Goose', 'Parrot', 'Parrot'],
'Max Speed': [380., 10., None, 26.]
})
new_cleaner = TableCleaner().from_dict(cleaner_dict)
X_pred = new_cleaner.clean_table(t2)
To Do's
- Improve groupby speed by not create copys of table
- Add ML cleaning class
- Improve speed of groupby by avoiding for loops
- Improve join speed by moving code to C
- Add unit tests using pytest
- Add window functions on groupby
- Add more join options (left, right, outer, full, cross)
- Allow for functions to be classmethods of pa.Table* (t.groupby(...))
*One of the main difficulties is that the pyarrow classes are written in C and do not have a dict method, this hinders inheritance and adding classmethods.
Relation to pyarrow
In the future many of these functions might be obsolete by enhancements in the pyarrow package, but for now it is a convenient alternative to switching back and forth between pyarrow and pandas.
Contributing
Pull requests are very welcome, however I believe in 80% of the utility in 20% of the code. I personally get lost reading the tranches of the pandas source code. If you would like to seriously improve this work, please let me know!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyarrow_ops-0.0.8-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4bfe40aeb370e483aa6941f2c7ec237ea4a90a76352672b2b0d16553512e05a2 |
|
MD5 | f76914992b5235db64cfa9f75a109e0d |
|
BLAKE2b-256 | ab0e6360af5a418caa71210d9495c4baa765b44ee79cb176b8b62d82ed953f44 |