Skip to main content

Useful data crunching tools for pyarrow

Project description

Pyarrow ops

Pyarrow ops is Python libary for data crunching operations directly on the pyarrow.Table class, implemented in numpy & Cython. For convenience, function naming and behavior tries to replicates that of the Pandas API. The Join / Groupy performance is slightly slower than that of pandas, especially on multi column joins.

Current use cases:

  • Data operations like joins, groupby (aggregations), filters & drop_duplicates
  • (Very fast) reusable pre-processing for ML applications

Installation

Use the package manager pip to install pyarrow_ops.

pip install pyarrow_ops

Usage

See test_*.py for runnable test examples

Data operations:

import pyarrow as pa 
from pyarrow_ops import join, filters, groupby, head, drop_duplicates

# Create pyarrow.Table
t = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot', 'Parrot'],
    'Max Speed': [380., 370., 24., 26., 24.]
})
head(t) # Use head to print, like df.head()

# Drop duplicates based on column values
d = drop_duplicates(t, on=['Animal'], keep='first')

# Groupby iterable
for key, value in groupby(t, ['Animal']):
    print(key)
    head(value)

# Group by aggregate functions
g = groupby(t, ['Animal']).sum()
g = groupby(t, ['Animal']).agg({'Max Speed': 'max'})

# Use filter predicates using list of tuples (column, operation, value)
f = filters(t, [('Animal', 'not in', ['Falcon', 'Duck']), ('Max Speed', '<', 25)])

# Join operations (currently performs inner join)
t2 = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Parrot'],
    'Age': [10, 20]
})
j = join(t, t2, on=['Animal'])

ML Preprocessing (note: personal tests showed ~5x speed up compared to pandas on large datasets)

import pyarrow as pa 
from pyarrow_ops import head, TableCleaner

# Training data
t1 = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot', 'Parrot'],
    'Max Speed': [380., 370., None, 26., 24.],
    'Value': [2000, 1500, 10, 30, 20],
})

# Create TableCleaner & register columns to be processed
cleaner = TableCleaner()
cleaner.register_numeric('Max Speed', impute='min', clip=True)
cleaner.register_label('Animal', categories=['Goose', 'Falcon'])
cleaner.register_one_hot('Animal')

# Clean table and split into train/test
X, y = cleaner.clean_table(t1, label='Value')
X_train, X_test, y_train, y_test = cleaner.split(X, y)

# Train a model + Save cleaner settings
cleaner_dict = cleaner.to_dict()

# Prediction data
t2 = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Goose', 'Parrot', 'Parrot'],
    'Max Speed': [380., 10., None, 26.]
})
new_cleaner = TableCleaner().from_dict(cleaner_dict)
X_pred = new_cleaner.clean_table(t2)

To Do's

  • Improve groupby speed by not create copys of table
  • Add ML cleaning class
  • Improve speed of groupby by avoiding for loops
  • Improve join speed by moving code to C
  • Add unit tests using pytest
  • Add window functions on groupby
  • Add more join options (left, right, outer, full, cross)
  • Allow for functions to be classmethods of pa.Table* (t.groupby(...))

*One of the main difficulties is that the pyarrow classes are written in C and do not have a dict method, this hinders inheritance and adding classmethods.

Relation to pyarrow

In the future many of these functions might be obsolete by enhancements in the pyarrow package, but for now it is a convenient alternative to switching back and forth between pyarrow and pandas.

Contributing

Pull requests are very welcome, however I believe in 80% of the utility in 20% of the code. I personally get lost reading the tranches of the pandas source code. If you would like to seriously improve this work, please let me know!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyarrow_ops-0.0.8.tar.gz (121.4 kB view details)

Uploaded Source

Built Distribution

pyarrow_ops-0.0.8-cp38-cp38-macosx_10_15_x86_64.whl (397.1 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

File details

Details for the file pyarrow_ops-0.0.8.tar.gz.

File metadata

  • Download URL: pyarrow_ops-0.0.8.tar.gz
  • Upload date:
  • Size: 121.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.8

File hashes

Hashes for pyarrow_ops-0.0.8.tar.gz
Algorithm Hash digest
SHA256 b59b29535268c06e224e9d417b6c3b358e761dbf0d170d1b6d1337b05811d06b
MD5 2cf49e119e51770c6e6b30c9696b5c15
BLAKE2b-256 6c462b02e9e9c45dc2bebe4151587c30ec5235dbddf877a03afdf2b949be9289

See more details on using hashes here.

File details

Details for the file pyarrow_ops-0.0.8-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: pyarrow_ops-0.0.8-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 397.1 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.8

File hashes

Hashes for pyarrow_ops-0.0.8-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 4bfe40aeb370e483aa6941f2c7ec237ea4a90a76352672b2b0d16553512e05a2
MD5 f76914992b5235db64cfa9f75a109e0d
BLAKE2b-256 ab0e6360af5a418caa71210d9495c4baa765b44ee79cb176b8b62d82ed953f44

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page