Useful data crunching tools for Apache Arrow in Python
Project description
Pyarrow ops
Pyarrow ops is Python libary for data crunching operations directly on the pyarrow.Table class, using only numpy. For convenience, function naming and behavior tries to replicates that of the Pandas API. The performance is currently on par with pandas, however performance can be significantly improved by utilizing pyarrow.compute functions or improving algorithms in numpy.
Installation
Use the package manager pip to install pyarrow_ops.
pip install pyarrow_ops
Usage
See test_func.py for full runnable test example
import pyarrow as pa
from pyarrow_ops import join, filters, groupby, head, drop_duplicates
# Create pyarrow.Table
t = pa.Table.from_pydict({
'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26., 24.]
})
head(t) # Use head to print, like df.head()
# Drop duplicates based on column values
d = drop_duplicates(t, on=['Animal'], keep='first')
# Groupby iterable
for key, value in groupby(t, ['Animal']):
print(key)
head(value)
# Group by aggregate functions
g = groupby(t, ['Animal']).median()
g = groupby(t, ['Animal']).sum()
g = groupby(t, ['Animal']).min()
g = groupby(t, ['Animal']).agg({'Max Speed': 'max'})
# Group by window functions
# Use filter predicates using list of tuples (column, operation, value)
f = filters(t, ('Animal', '=', 'Falcon'))
f = filters(t, [('Animal', 'not in', ['Falcon', 'Duck']), ('Max Speed', '<', 25)])
# Join operations (currently performs inner join)
t2 = pa.Table.from_pydict({
'Animal': ['Falcon', 'Parrot'],
'Age': [10, 20]
})
j = join(t, t2, on=['Animal'])
Relation to pyarrow
In the future many of these functions might be obsolete by enhancements in the pyarrow package, but for now it is a convenient alternative to switching back and forth between pyarrow and pandas.
Contributing
Pull requests are very welcome, however I believe in 80% of the utility in 20% of the code. I personally get lost reading the tranches of the pandas source code.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for pyarrow_ops-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc20dfdf0da013fc1f2a346f41be8fac8a1a9efbd9ed093344f374eb06ad94cc |
|
MD5 | daa1ce66f661c8d382321505b80ea867 |
|
BLAKE2b-256 | aa647f92e4047485caa3d2665558fffe9b98db87a829965fbe6ad7a3e8697cd2 |
Hashes for pyarrow_ops-0.0.1-1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 09756b7f8a35411f1d0e02b97106f379c7ef4c05d1158381ba290d99574c451c |
|
MD5 | 72b2b2b2bc93d2e039753d08dcd8da22 |
|
BLAKE2b-256 | 21bb567ee34da5b78e916ddf8a74b3c16bf7d1795e62864b2b3b0197d0f3bcc7 |