Skip to main content

Useful data crunching tools for Apache Arrow in Python

Project description

Pyarrow ops

Pyarrow ops is Python libary for data crunching operations directly on the pyarrow.Table class, using only numpy. For convenience, function naming and behavior tries to replicates that of the Pandas API. The performance is currently on par with pandas, however performance can be significantly improved by utilizing pyarrow.compute functions or improving algorithms in numpy.

Installation

Use the package manager pip to install pyarrow_ops.

pip install pyarrow_ops

Usage

See test_func.py for full runnable test example

import pyarrow as pa 
from pyarrow_ops import join, filters, groupby, head, drop_duplicates

# Create pyarrow.Table
t = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot', 'Parrot'],
    'Max Speed': [380., 370., 24., 26., 24.]
})
head(t) # Use head to print, like df.head()

# Drop duplicates based on column values
d = drop_duplicates(t, on=['Animal'], keep='first')

# Groupby iterable
for key, value in groupby(t, ['Animal']):
    print(key)
    head(value)

# Group by aggregate functions
g = groupby(t, ['Animal']).median()
g = groupby(t, ['Animal']).sum()
g = groupby(t, ['Animal']).min()
g = groupby(t, ['Animal']).agg({'Max Speed': 'max'})

# Group by window functions

# Use filter predicates using list of tuples (column, operation, value)
f = filters(t, ('Animal', '=', 'Falcon'))
f = filters(t, [('Animal', 'not in', ['Falcon', 'Duck']), ('Max Speed', '<', 25)])

# Join operations (currently performs inner join)
t2 = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Parrot'],
    'Age': [10, 20]
})
j = join(t, t2, on=['Animal'])

Relation to pyarrow

In the future many of these functions might be obsolete by enhancements in the pyarrow package, but for now it is a convenient alternative to switching back and forth between pyarrow and pandas.

Contributing

Pull requests are very welcome, however I believe in 80% of the utility in 20% of the code. I personally get lost reading the tranches of the pandas source code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyarrow_ops-0.0.1.tar.gz (6.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyarrow_ops-0.0.1-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

pyarrow_ops-0.0.1-1-py3-none-any.whl (10.8 kB view details)

Uploaded Python 3

File details

Details for the file pyarrow_ops-0.0.1.tar.gz.

File metadata

  • Download URL: pyarrow_ops-0.0.1.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.6

File hashes

Hashes for pyarrow_ops-0.0.1.tar.gz
Algorithm Hash digest
SHA256 91e1bc0df597dfbb54b8c6bf0021c69dc48e330d75abbc3c37d428ba55eb7786
MD5 f276fd8a444c1f5238bb50609ef6feaf
BLAKE2b-256 b0b5795ceef8fcae979586a3e382eee873771219aea8772661dbaf4daecbf86e

See more details on using hashes here.

File details

Details for the file pyarrow_ops-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: pyarrow_ops-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.6

File hashes

Hashes for pyarrow_ops-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bc20dfdf0da013fc1f2a346f41be8fac8a1a9efbd9ed093344f374eb06ad94cc
MD5 daa1ce66f661c8d382321505b80ea867
BLAKE2b-256 aa647f92e4047485caa3d2665558fffe9b98db87a829965fbe6ad7a3e8697cd2

See more details on using hashes here.

File details

Details for the file pyarrow_ops-0.0.1-1-py3-none-any.whl.

File metadata

  • Download URL: pyarrow_ops-0.0.1-1-py3-none-any.whl
  • Upload date:
  • Size: 10.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.6

File hashes

Hashes for pyarrow_ops-0.0.1-1-py3-none-any.whl
Algorithm Hash digest
SHA256 09756b7f8a35411f1d0e02b97106f379c7ef4c05d1158381ba290d99574c451c
MD5 72b2b2b2bc93d2e039753d08dcd8da22
BLAKE2b-256 21bb567ee34da5b78e916ddf8a74b3c16bf7d1795e62864b2b3b0197d0f3bcc7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page