Lightweight labelled multidimensional arrays with NumPy arrays under the hood.
Project description
[k]array: labeled multi-dimensional arrays
Karray is a simple tool that intends to abstract the users from the complexity of working with labelled multi-dimensional arrays. Numpy is the tool’s core, with an extensive collection of high-level mathematical functions to operate on multi-dimensional arrays efficiently thanks to its well-optimized C code. With Karray, we put effort into generating lightweight objects expecting to reduce overheads and avoid large loops that cause bottlenecks and impact performance. Numpy is the only relevant dependency, while Polars, Pandas, sparse and Pyarrow are required to import, export and store the arrays. karray
is developed by the research group Transformation of the Energy Economy
at DIW Berlin (German Institute of Economic Research).
Links
- Documentation: https://diw-evu.gitlab.io/karray
- Source code: https://gitlab.com/diw-evu/karray
- PyPI releases: https://pypi.org/project/karray
Table of contents
Getting started
Quick installation
To install karray, you can use pip:
pip install karray
Importing karray
To start using karray, import the necessary classes and functions:
import karray as ka
# then you can use ka.Array, ka.Long, and ka.settings
The Array
class represents a labeled multidimensional array, while the Long
class represents a labeled one-dimensional array. The settings
object allows you to configure various options for karray.
Usage Examples
Creating an Array
You can create an Array
object in several ways:
- From a
Long
object and coordinates:
import pandas as pd
index = {'dim1': ['a', 'b'],
'dim2': [1, 2],
'dim3': pd.to_datetime(['2020-01-01', '2020-01-02'], utc=True)}
value = [10., 20.]
long = ka.Long(index=index, value=value)
arr1 = ka.Array(data=long)
arr1
[k]array
Long object size | 64 bytes |
Data object type | dense |
Data object size | 64 bytes |
Dimensions | ['dim1', 'dim2', 'dim3'] |
Shape | [2, 2, 2] |
Capacity | 8 |
Rows | 2 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
dim3 | 2 | datetime64[ns] | ['2020-01-01T00:00:00.000000000' '2020-01-02T00:00:00.000000000'] |
Data
dim1 | dim2 | dim3 | value | |
---|---|---|---|---|
0 | a | 1 | 2020-01-01T00:00:00.000000000 | 10.00 |
1 | b | 2 | 2020-01-02T00:00:00.000000000 | 20.00 |
- From a tuple of index and value, and coordinates:
index2 = {'dim1': ['a', 'b'], 'dim2': [1, 2]}
value2 = [10, 20]
coords2 = {'dim1': ['a', 'b'], 'dim2': [1, 2]}
arr2 = ka.Array(data=(index2, value2), coords=coords2)
arr2
[k]array
Long object size | 48 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim1', 'dim2'] |
Shape | [2, 2] |
Capacity | 4 |
Rows | 2 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim1 | dim2 | value | |
---|---|---|---|
0 | a | 1 | 10 |
1 | b | 2 | 20 |
- From a dense NumPy array and coordinates:
import numpy as np
dense = np.array([[10, 20], [30, 40]])
coords3 = {'dim1': ['a', 'b'], 'dim2': [1, 2]}
arr3 = ka.Array(data=dense, coords=coords3)
arr3
[k]array
Long object size | 96 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim1', 'dim2'] |
Shape | [2, 2] |
Capacity | 4 |
Rows | 4 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim1 | dim2 | value | |
---|---|---|---|
0 | a | 1 | 10.00 |
1 | a | 2 | 20.00 |
2 | b | 1 | 30.00 |
3 | b | 2 | 40.00 |
- From a sparse array (using the
sparse
library) and coordinates:
import sparse as sp
sparse_arr = sp.COO(data=[10, 20], coords=[[0, 1], [0, 1]], shape=(2, 2))
coords4 = {'dim1': ['a', 'b'], 'dim2': [1, 2]}
arr4 = ka.Array(data=sparse_arr, coords=coords4)
arr4
[k]array
Long object size | 48 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim1', 'dim2'] |
Shape | [2, 2] |
Capacity | 4 |
Rows | 2 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim1 | dim2 | value | |
---|---|---|---|
0 | a | 1 | 10 |
1 | b | 2 | 20 |
Accessing Array Elements
You can access elements of an Array
object using various methods:
- Using the
items()
method to iterate over the array elements:
for item in arr3.items():
print(item)
('dim1', array(['a', 'a', 'b', 'b'], dtype=object))
('dim2', array([1, 2, 1, 2]))
('value', array([10., 20., 30., 40.]))
- Using the
to_pandas()
method to convert the array to a pandas DataFrame:
df = arr1.to_pandas()
print(df)
dim1 dim2 dim3 value
0 a 1 2020-01-01 10.0
1 b 2 2020-01-02 20.0
- Using the
to_polars()
method to convert the array to a polars DataFrame:
df = arr1.to_polars()
print(df)
shape: (2, 4)
┌──────┬──────┬─────────────────────┬───────┐
│ dim1 ┆ dim2 ┆ dim3 ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ datetime[ns] ┆ f64 │
╞══════╪══════╪═════════════════════╪═══════╡
│ a ┆ 1 ┆ 2020-01-01 00:00:00 ┆ 10.0 │
│ b ┆ 2 ┆ 2020-01-02 00:00:00 ┆ 20.0 │
└──────┴──────┴─────────────────────┴───────┘
Array Operations
karray provides various operations that can be performed on Array
objects:
- Arithmetic operations:
result = arr1 + arr2
result
[k]array
Long object size | 128 bytes |
Data object type | dense |
Data object size | 64 bytes |
Dimensions | ['dim1', 'dim2', 'dim3'] |
Shape | [2, 2, 2] |
Capacity | 8 |
Rows | 4 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
dim3 | 2 | datetime64[ns] | ['2020-01-01T00:00:00.000000000' '2020-01-02T00:00:00.000000000'] |
Data
dim1 | dim2 | dim3 | value | |
---|---|---|---|---|
0 | a | 1 | 2020-01-01T00:00:00.000000000 | 20.00 |
1 | a | 1 | 2020-01-02T00:00:00.000000000 | 10.00 |
2 | b | 2 | 2020-01-01T00:00:00.000000000 | 20.00 |
3 | b | 2 | 2020-01-02T00:00:00.000000000 | 40.00 |
result = arr3 * 2
result
[k]array
Long object size | 96 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim1', 'dim2'] |
Shape | [2, 2] |
Capacity | 4 |
Rows | 4 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim1 | dim2 | value | |
---|---|---|---|
0 | a | 1 | 20.00 |
1 | a | 2 | 40.00 |
2 | b | 1 | 60.00 |
3 | b | 2 | 80.00 |
result = arr4 - 1
result
[k]array
Long object size | 96 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim1', 'dim2'] |
Shape | [2, 2] |
Capacity | 4 |
Rows | 4 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim1 | dim2 | value | |
---|---|---|---|
0 | a | 1 | 9.00 |
1 | a | 2 | -1.00 |
2 | b | 1 | -1.00 |
3 | b | 2 | 19.00 |
- Comparison operations:
mask = arr2 > 10
mask = arr2 == 5
- Logical operations:
result = arr2 & arr4
result = arr2 | arr4
result = ~arr2
- Reduction operations:
result = arr1.reduce('dim1', aggfunc='sum')
result
[k]array
Long object size | 48 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim2', 'dim3'] |
Shape | [2, 2] |
Capacity | 4 |
Rows | 2 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim2 | 2 | int64 | [1 2] |
dim3 | 2 | datetime64[ns] | ['2020-01-01T00:00:00.000000000' '2020-01-02T00:00:00.000000000'] |
Data
dim2 | dim3 | value | |
---|---|---|---|
0 | 1 | 2020-01-01T00:00:00.000000000 | 10.00 |
1 | 2 | 2020-01-02T00:00:00.000000000 | 20.00 |
result = arr1.reduce('dim2', aggfunc=np.mean)
result
[k]array
Long object size | 48 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim1', 'dim3'] |
Shape | [2, 2] |
Capacity | 4 |
Rows | 2 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim3 | 2 | datetime64[ns] | ['2020-01-01T00:00:00.000000000' '2020-01-02T00:00:00.000000000'] |
Data
dim1 | dim3 | value | |
---|---|---|---|
0 | a | 2020-01-01T00:00:00.000000000 | 5.00 |
1 | b | 2020-01-02T00:00:00.000000000 | 10.00 |
- Shifting and rolling operations:
shifted = arr3.shift(dim1=1, dim2=-1, fill_value=0.)
shifted
[k]array
Long object size | 24 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim1', 'dim2'] |
Shape | [2, 2] |
Capacity | 4 |
Rows | 1 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim1 | dim2 | value | |
---|---|---|---|
0 | b | 1 | 20.00 |
rolled = arr3.roll(dim1=2)
rolled
[k]array
Long object size | 96 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim1', 'dim2'] |
Shape | [2, 2] |
Capacity | 4 |
Rows | 4 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim1 | dim2 | value | |
---|---|---|---|
0 | a | 1 | 10.00 |
1 | a | 2 | 20.00 |
2 | b | 1 | 30.00 |
3 | b | 2 | 40.00 |
- Inserting new dimensions:
# One dimension with one element
result = arr2.insert(dim3='x')
result
[k]array
Long object size | 64 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim3', 'dim1', 'dim2'] |
Shape | [1, 2, 2] |
Capacity | 4 |
Rows | 2 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim3 | 1 | object | ['x'] |
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim3 | dim1 | dim2 | value | |
---|---|---|---|---|
0 | x | a | 1 | 10 |
1 | x | b | 2 | 20 |
# One dimension with several elements related to an existing dimension using a dict
result = arr2.insert(dim3={'dim1': {'a': -1, 'b': -2}})
result
[k]array
Long object size | 64 bytes |
Data object type | dense |
Data object size | 64 bytes |
Dimensions | ['dim3', 'dim1', 'dim2'] |
Shape | [2, 2, 2] |
Capacity | 8 |
Rows | 2 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim3 | 2 | int64 | [-2 -1] |
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim3 | dim1 | dim2 | value | |
---|---|---|---|---|
0 | -1 | a | 1 | 10 |
1 | -2 | b | 2 | 20 |
# One dimension with several elements related to an existing dimension using two lists
result = arr2.insert(dim3={'dim1': [['a', 'b'], [-1, -2]]})
result
[k]array
Long object size | 64 bytes |
Data object type | dense |
Data object size | 64 bytes |
Dimensions | ['dim3', 'dim1', 'dim2'] |
Shape | [2, 2, 2] |
Capacity | 8 |
Rows | 2 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim3 | 2 | int64 | [-1 -2] |
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim3 | dim1 | dim2 | value | |
---|---|---|---|---|
0 | -1 | a | 1 | 10 |
1 | -2 | b | 2 | 20 |
- Drop a dimension:
result = arr1.drop('dim3')
result
[k]array
Long object size | 48 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim1', 'dim2'] |
Shape | [2, 2] |
Capacity | 4 |
Rows | 2 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim1 | dim2 | value | |
---|---|---|---|
0 | a | 1 | 10.00 |
1 | b | 2 | 20.00 |
!Note
Dropping a dimension will work only if the resulting array still has unique coordinates. If dropping a dimension leads to an array with duplicate coordinates, as a results of the removed dimension, karray will raise an error.
# Assertion error due to duplicate coords
try:
arr3.drop('dim2')
except AssertionError as e:
print(e)
Index items per row must be unique. By removing ['dim2'] leads the existence of repeated indexes
e.g.:
('dim1',) value
0 ('a',) 10.0
1 ('a',) 20.0
Intead, you can use obj.reduce('dim2')
With an aggfunc: sum() by default
- Expanding a dimension (Broadcasting)
result = arr3.expand(dim3=['x', 'y', 'z'])
result
[k]array
Long object size | 384 bytes |
Data object type | dense |
Data object size | 96 bytes |
Dimensions | ['dim1', 'dim2', 'dim3'] |
Shape | [2, 2, 3] |
Capacity | 12 |
Rows | 12 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
dim3 | 3 | object | ['x' 'y' 'z'] |
Data
dim1 | dim2 | dim3 | value | |
---|---|---|---|---|
0 | a | 1 | x | 10.00 |
1 | a | 1 | y | 10.00 |
2 | a | 1 | z | 10.00 |
3 | a | 2 | x | 20.00 |
4 | a | 2 | y | 20.00 |
5 | a | 2 | z | 20.00 |
6 | b | 1 | x | 30.00 |
7 | b | 1 | y | 30.00 |
8 | b | 1 | z | 30.00 |
9 | b | 2 | x | 40.00 |
10 | b | 2 | y | 40.00 |
11 | b | 2 | z | 40.00 |
- ufunc operations
arr3.ufunc(dim='dim2', func=np.prod, keepdims=True)
[k]array
Long object size | 96 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim1', 'dim2'] |
Shape | [2, 2] |
Capacity | 4 |
Rows | 4 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim1 | dim2 | value | |
---|---|---|---|
0 | a | 1 | 200.00 |
1 | a | 2 | 200.00 |
2 | b | 1 | 1200.00 |
3 | b | 2 | 1200.00 |
!Note
The dim argument is passed to ufunc as axis argument in numpy and keepdims argument is passed with the same name. You can add more arguments depending on the ufunc.
Saving and Loading Arrays
karray supports saving and loading arrays using the Feather format:
- Saving an array to a Feather file:
arr1.to_feather('array.feather')
- Loading an array from a Feather file:
loaded_arr1 = ka.from_feather('array.feather')
loaded_arr1
[k]array
Long object size | 64 bytes |
Data object type | dense |
Data object size | 64 bytes |
Dimensions | ['dim1', 'dim2', 'dim3'] |
Shape | [2, 2, 2] |
Capacity | 8 |
Rows | 2 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
dim3 | 2 | int64 | [1577836800000000000 1577923200000000000] |
Data
dim1 | dim2 | dim3 | value | |
---|---|---|---|---|
0 | a | 1 | 2020-01-01T00:00:00.000000000 | 10.00 |
1 | b | 2 | 2020-01-02T00:00:00.000000000 | 20.00 |
Interoperability with Other Libraries
karray provides interoperability with other popular data manipulation libraries:
- Converting an array to a pandas DataFrame and then back to an array:
df = arr2.to_pandas()
new_arr = ka.from_pandas(df, coords=coords2)
new_arr
[k]array
Long object size | 48 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim1', 'dim2'] |
Shape | [2, 2] |
Capacity | 4 |
Rows | 2 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim1 | dim2 | value | |
---|---|---|---|
0 | a | 1 | 10 |
1 | b | 2 | 20 |
- Converting an array to a polars DataFrame and then back to an array:
df = arr2.to_polars()
new_arr = ka.from_polars(df, coords=coords2)
new_arr
[k]array
Long object size | 48 bytes |
Data object type | dense |
Data object size | 32 bytes |
Dimensions | ['dim1', 'dim2'] |
Shape | [2, 2] |
Capacity | 4 |
Rows | 2 |
Coords
Dimension | Length | Type | Items |
---|---|---|---|
dim1 | 2 | object | ['a' 'b'] |
dim2 | 2 | int64 | [1 2] |
Data
dim1 | dim2 | value | |
---|---|---|---|
0 | a | 1 | 10 |
1 | b | 2 | 20 |
There are many more features and functionalities. Please refer to the source code section for more details.
!Note
karray is a work in progress. The API is subject to change in the future. We are looking for feedback, suggestions, and we appreciate your contributions.
© 2024 Carlos Gaete-Morales
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for karray-2024.3.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d915adfd1ef0c2abdfc460432fe2d778f336a1b636a2e17edd5a0687da30b51 |
|
MD5 | faa193ba4eca86fdadbe1511bc50b830 |
|
BLAKE2b-256 | 098a7a72b5dd2d9d75d0e7e5e991909fbfa73b1de1c1751a6f8674897e55509b |