Skip to main content

A dataclass container with multi-indexing and bulk operations

Project description

PyPI Python Build Status Documentation

dataclassframe

A dataclass container with multi-indexing and bulk operations. Provides the typed benefits and ergonomics of dataclasses while having the efficiency of Pandas dataframes.

The container is based on data-oriented design by optimising the memory layout of the stored data, providing fast bulk operations and a smaller memory footprint for large collections. Bulk operations are enabled using Pandas which has a rich set of vectorised methods for both numerical and string data types.

Multi-indexing provides the ability to use multiple fields as keys to index the records. This is suitable for bidirectional and inverse dictionary keys.

A DataClassFrame provides good ergonomics for production code as columns are immutable and columns/data types are well defined by the dataclasses. This makes it easier for users to understand the "shape" of the data in large projects and refactor when necessary.

Installing

Get the latest version using pip/PyPi

pip install dataclassframe

Feature comparison

Container Positional indexing Key indexing Multi-key indexing Data-oriented design Column-wise opperations Type hints Use in prod
DataClassFrame
List
Dictionary
MIDict
Pandas DataFrame

Show by example

A container data-type for dataclasses...

from dataclasses import dataclass
from dataclassframe import DataClassFrame

@dataclass
class ExampleDC:
    field1: str
    field2: int

records = [
    ExampleDC('a', 1),
    ExampleDC('b', 2),
    ExampleDC('c', 3),
]

dcf = DataClassFrame(
        record_class=ExampleDC,
        data=records,
        index=['field1', 'field2']
)

Which acts like a ordered dictionary with multi-indexing...

# Obtain record `ExampleDC('b', 2)`
row_idx = dcf.iat[1]    # Using positional index
row_f1 = dcf.at['b']    # Using index of `field1`
row_f2 = dcf.at[:, 2]   # Using index of `field2`
assert row_idx == row_f1 == row_f2

With bulk operations on the columns..

assert dcf.cols.field2.sum() == 6

Works nicely with Python 3 type hints...

dcf: DataClassFrame[ExampleDC]
dcf.iat[1]: ExampleDC

Design

It's no secret that under the hood DataClassFrames are using Pandas DataFrames to store data. The data is converted where possible to Pandas Series, which in turn use Numpy arrays. When the user accesses a record the data is then converted back into the dataclass provided at initialisation.

Pandas provides many advantages over of using a simple list of dataclasses or similar such as better memory footprint and fast vectorised operations. However using Pandas DataFrames directly in production code is considered by the author and others as an anti-pattern. Specifically as DataFrames are column-wise mutable and therefore difficult to determine at code-time what columns the dataframe contains i.e. its shape. It also does not provide any type-hinting benefits.

Todo

  • Slicing and dataclassframe views for accessing data and setting data
  • Append and inserts
  • Data-oriented design for Numpy fields

Changelog

All notable changes to this project will be documented here.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

[0.1.0] - 2020-10-22

Added

  • Initial release of dataclassframe

License

© Josh Levy-Kramer 2020. dataclassframe is released under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataclassframe-0.1.0.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataclassframe-0.1.0-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file dataclassframe-0.1.0.tar.gz.

File metadata

  • Download URL: dataclassframe-0.1.0.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0.post20201006 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.7

File hashes

Hashes for dataclassframe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0a12255d8c3e0f2bfa5ac711b94a97c5a6dba7ee48a44e323509d6990749a0ca
MD5 bb610b12ff16d3c20611b6cf7c86402d
BLAKE2b-256 38562d7339c7c6139de751a89ac3a0e7fb2701e1c5dd5dfd52242529b764f1d8

See more details on using hashes here.

File details

Details for the file dataclassframe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dataclassframe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0.post20201006 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.7

File hashes

Hashes for dataclassframe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b400fa8a500f8995d492370498f1ec44367efa9dc76dd7269ee64c209e1c45ca
MD5 8e0773fa0db51a9f2d9b4dfca5f53d23
BLAKE2b-256 e9deb31f0cc1536646903856516474ed1306e9953e6f12f9c121886b6969596a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page