Skip to main content

Tafra: innards of a dataframe

Project description

https://img.shields.io/pypi/v/tafra.svg https://travis-ci.org/petbox-dev/tafra.svg?branch=master Documentation Status Coverage Status

The tafra began life as a thought experiment: how could we reduce the idea of a dataframe (as expressed in libraries like pandas or languages like R) to its useful essence, while carving away the cruft? The original proof of concept stopped at “group by”.

This library expands on the proof of concept to produce a practically useful tafra, which we hope you may find to be a helpful lightweight substitute for certain uses of pandas.

A tafra is, more-or-less, a set of named columns or dimensions. Each of these is a typed numpy array of consistent length, representing the values for each column by rows.

The library provides lightweight syntax for manipulating rows and columns, support for managing data types, iterators for rows and sub-frames, pandas-like “transform” support and conversion from pandas Dataframes, and SQL-style “group by” and join operations.

Tafra

Tafra

Aggregations

Union, GroupBy, Transform, IterateBy, InnerJoin, LeftJoin, CrossJoin

Aggregation Helpers

union, union_inplace, group_by, transform, iterate_by, inner_join, left_join, cross_join

Constructors

as_tafra, from_dataframe, from_series, from_records

SQL Readers

read_sql, read_sql_chunks

Destructors

to_records, to_list, to_tuple, to_array, to_pandas

Properties

rows, columns, data, dtypes, size, ndim, shape

Iter Methods

iterrows, itertuples, itercols

Functional Methods

row_map, tuple_map, col_map, pipe

Dict-like Methods

keys, values, items, get, update, update_inplace, update_dtypes, update_dtypes_inplace

Other Helper Methods

select, head, copy, rename, rename_inplace, coalesce, coalesce_inplace, _coalesce_dtypes, delete, delete_inplace

Printer Methods

pprint, pformat, to_html

Indexing Methods

_slice, _index, _ndindex

Getting Started

Install the library with pip:

pip install tafra

A short example

>>> from tafra import Tafra

>>> t = Tafra({
...    'x': np.array([1, 2, 3, 4]),
...    'y': np.array(['one', 'two', 'one', 'two'], dtype='object'),
... })

>>> t.pformat()
Tafra(data = {
 'x': array([1, 2, 3, 4]),
 'y': array(['one', 'two', 'one', 'two'])},
dtypes = {
 'x': 'int', 'y': 'object'},
rows = 4)

>>> print('List:', '\n', t.to_list())
List:
 [array([1, 2, 3, 4]), array(['one', 'two', 'one', 'two'], dtype=object)]

>>> print('Records:', '\n', tuple(t.to_records()))
Records:
 ((1, 'one'), (2, 'two'), (3, 'one'), (4, 'two'))

>>> gb = t.group_by(
...     ['y'], {'x': sum}
... )

>>> print('Group By:', '\n', gb.pformat())
Group By:
Tafra(data = {
 'x': array([4, 6]), 'y': array(['one', 'two'])},
dtypes = {
 'x': 'int', 'y': 'object'},
rows = 2)

Flexibility

Have some code that works with pandas, or just a way of doing things that you prefer? tafra is flexible:

>>> df = pd.DataFrame(np.c_[
...     np.array([1, 2, 3, 4]),
...     np.array(['one', 'two', 'one', 'two'])
... ], columns=['x', 'y'])

>>> t = Tafra.from_dataframe(df)

And going back is just as simple:

>>> df = pd.DataFrame(t.data)

Timings

In this case, lightweight also means performant. Beyond any additional features added to the library, tafra should provide the necessary base for organizing data structures for numerical processing. One of the most important aspects is fast access to the data itself. By minimizing abstraction to access the underlying numpy arrays, tafra provides an order of magnitude increase in performance.

  • Import note If you assign directly to the Tafra.data or Tafra._data attributes, you must call Tafra._coalesce_dtypes afterwards in order to ensure the typing is consistent.

Construct a Tafra and a DataFrame:

>>> tf = Tafra({
...     'x': np.array([1., 2., 3., 4., 5., 6.]),
...     'y': np.array(['one', 'two', 'one', 'two', 'one', 'two'], dtype='object'),
...     'z': np.array([0, 0, 0, 1, 1, 1])
... })

>>> df = pd.DataFrame(t.data)

Read Operations

Direct access:

>>> %timemit x = t._data['x']
55.3 ns ± 5.64 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Indirect with some penalty to support Tafra slicing and numpy’s advanced indexing:

>>> %timemit x = t['x']
219 ns ± 71.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

pandas timing:

>>> %timemit x = df['x']
1.55 µs ± 105 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

This is the fastest methed for accessing the numpy array among alternatives of df.values(), df.to_numpy(), and df.loc[].

Assignment Operations

Direct access is not recommended as it avoids the validation steps, but it does provide fast access to the data attribute:

>>> x = np.arange(6)

>>> %timeit tf._data['x'] = x
65 ns ± 5.55 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Indidrect access has a performance penalty due to the validation checks to ensure consistency of the tafra:

>>> %timeit tf['x'] = x
7.39 µs ± 950 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Even so, there is considerable performance improvement over pandas.

pandas timing:

>>> %timeit df['x'] = x
47.8 µs ± 3.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Grouping Operations

tafra also excels at aggregation methods, the primary of which are a SQL-like GROUP BY and the split-apply-combine equivalent to a SQL-like GROUP BY following by a LEFT JOIN back to the original table.

>>> %timeit tf.group_by(['y', 'z'], {'x': sum})
138 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit tf.transform(['y', 'z'], {'sum_x': (sum, 'x')})
161 µs ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The equivalent pandas functions are given below. They require a chain of several object methods to perform the same role, and the transform requires a copy operation and assignment into the copied DataFrame in order to preserve immutability.

>>> %timeit df.groupby(['y','z']).agg({'x': 'sum'}).reset_index()
2.5 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %%timeit
... tdf = df.copy()
... tdf['x'] = df.groupby(['y', 'z'])[['x']].transform(sum)
2.81 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Version History

1.0.10

  • Add pipe and overload >> operator for Tafra objects

1.0.9

  • Add test files to build

1.0.8

  • Check rows in constructor to ensure equal data length

1.0.7

  • Handle missing or NULL values in read_csv().

  • Cast empty elements to None when updating dtypes to avoid failure of np.astype().

  • Update some typing, minor refactoring for performance

1.0.6

  • Additional validations in constructor, primary to evaluate Iterables of values

  • Split col_map to col_map and key_map as the original function’s return signature depending upon an argument.

  • Fix some documentation typos

1.0.5

  • Add tuple_map method

  • Refactor all iterators and ..._map functions to improve performance

  • Unpack np.ndarray if given as keys to constructor

  • Add validate=False in __post_init__ if inputs are known to be valid to improve performance

1.0.4

  • Add read_csv, to_csv

  • Various refactoring and improvement in data validation

  • Add typing_extensions to dependencies

  • Change method of dtype storage, extract str representation from np.dtype()

1.0.3

  • Add read_sql and read_sql_chunks

  • Add to_tuple and to_pandas

  • Cleanup constructor data validation

1.0.2

  • Add object_formatter to expose user formatting for dtype=object

  • Improvements to indexing and slicing

1.0.1

  • Add iter functions

  • Add map functions

  • Various constructor improvements

1.0.0

  • Initial Release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tafra-1.0.10.tar.gz (45.3 kB view details)

Uploaded Source

Built Distribution

tafra-1.0.10-py3-none-any.whl (29.0 kB view details)

Uploaded Python 3

File details

Details for the file tafra-1.0.10.tar.gz.

File metadata

  • Download URL: tafra-1.0.10.tar.gz
  • Upload date:
  • Size: 45.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for tafra-1.0.10.tar.gz
Algorithm Hash digest
SHA256 280d3d845f7bbd7c6fe3decea5d956c424a0bcbec9e53910fcc9b925f75f941b
MD5 3eef3cbd474b8e21de31de17a5456b09
BLAKE2b-256 e6906d825f780e03cbef7aa84cb954ebb1b819c5c886cdef0b9608d1b743983f

See more details on using hashes here.

File details

Details for the file tafra-1.0.10-py3-none-any.whl.

File metadata

  • Download URL: tafra-1.0.10-py3-none-any.whl
  • Upload date:
  • Size: 29.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for tafra-1.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 4dfa5e694f6cfc75759def67ebbf1fb12701880e5609dd2ef5f957906ae4b880
MD5 ef9f6b208722da523a21dd3d86403e46
BLAKE2b-256 e67e93e7504231f719497f45bfa91e851932374040491a48c78372f2c0ad1c11

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page