Skip to main content

Useful data crunching tools for Apache Arrow in Python

Project description

Pyarrow ops

Pyarrow ops is Python libary for data crunching operations directly on the pyarrow.Table class, using only numpy. For convenience, function naming and behavior tries to replicates that of the Pandas API. The performance is currently on par with pandas, however performance can be significantly improved by utilizing pyarrow.compute functions or improving algorithms in numpy.

Installation

Use the package manager pip to install pyarrow_ops.

pip install pyarrow_ops

Usage

See test_func.py for full runnable test example

import pyarrow as pa 
from pyarrow_ops import join, filters, groupby, head, drop_duplicates

# Create pyarrow.Table
t = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot', 'Parrot'],
    'Max Speed': [380., 370., 24., 26., 24.]
})
head(t) # Use head to print, like df.head()

# Drop duplicates based on column values
d = drop_duplicates(t, on=['Animal'], keep='first')

# Groupby iterable
for key, value in groupby(t, ['Animal']):
    print(key)
    head(value)

# Group by aggregate functions
g = groupby(t, ['Animal']).median()
g = groupby(t, ['Animal']).sum()
g = groupby(t, ['Animal']).min()
g = groupby(t, ['Animal']).agg({'Max Speed': 'max'})

# Group by window functions

# Use filter predicates using list of tuples (column, operation, value)
f = filters(t, ('Animal', '=', 'Falcon'))
f = filters(t, [('Animal', 'not in', ['Falcon', 'Duck']), ('Max Speed', '<', 25)])

# Join operations (currently performs inner join)
t2 = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Parrot'],
    'Age': [10, 20]
})
j = join(t, t2, on=['Animal'])

Relation to pyarrow

In the future many of these functions might be obsolete by enhancements in the pyarrow package, but for now it is a convenient alternative to switching back and forth between pyarrow and pandas.

Contributing

Pull requests are very welcome, however I believe in 80% of the utility in 20% of the code. I personally get lost reading the tranches of the pandas source code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyarrow_ops-0.0.1.tar.gz (6.3 kB view hashes)

Uploaded Source

Built Distributions

pyarrow_ops-0.0.1-py3-none-any.whl (10.7 kB view hashes)

Uploaded Python 3

pyarrow_ops-0.0.1-1-py3-none-any.whl (10.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page