Using attrs and numpy to define clear data structures with multiple tensors
Project description
Using attrs and numpy to define clear data structures containing multiple tensors.
Overview
Attensors was born out of the need to work with multiple input data for machine learning models, while keeping things tidy and having self-documented code.
Working with mixed input data, we know the shape of each tensor component of a single input. This is often times found specified in intermediary preprocessing or data loading mechanisms (e.g. dataset schemas, tensor specs). All these tensors would eventually gain the same "prefix" shape referring global dimensions (such as a batch_size
or a sequence_size
).
As problems and models get more complicated, so, too, do their inputs. These collections of input tensors are almost never defined as separate entities, instead being grouped in generic data structures such as dictionaries, lists or tuples.
Through this package we propose the combined usage of attrs
and numpy
's structured arrays to provide easy definition of tensor collections and intuitive means to work with them.
Furthermore, we can also make use of type hints (specifically Annotated
) to provide the metadata describing the tensors' dtypes and shapes.
Quick example
In general, the minimum types we need to import are the tensors
decorator from our package and Annotated
from typing if we plan to define fields via annotations.
from attensors import tensors
from typing import Annotated
If we were to take for example the New York Real Estate Data as in this tutorial, there are multiple ways we could define our multi-tensor:
# Following the tutorial, 2 tensors are needed:
# one for image data, one for tabular data
@tensors
class NYRealEstateData:
image: Annotated[np.ndarray, {"dtype": np.float32, "shape": (3, 224, 224)}]
tabular: Annotated[np.ndarray, {"dtype": np.float32, "shape": (5,)}]
# If we'd encode each scalar feature separately, we could define it as such
@tensors
class NYRealEstateData:
image: Annotated[np.ndarray, {"dtype": np.float32, "shape": (3, 224, 224)}]
latitude: float
longitude: float
zpid: int
beds: int
baths: int
# We could also group latitude and longitude separately and nest types
@tensors
class Coordinates:
latitude: float
longitude: float
@tensors
class NYRealEstateData:
image: Annotated[np.ndarray, {"dtype": np.float32, "shape": (3, 224, 224)}]
location: Coordinates
zpid: int
beds: int
baths: int
Instantiating can be done by simply providing the data, or via numpy styled routines. In the following contexts we consider the first definition provided above and 2 samples given with their respective tensors i1
, t1
and i2
, t2
.
>>> sample = NYRealEstateData(image=[i1, i2], tabular=[t1, t2])
Since NYRealEstateData
is actually a numpy.ndarray
subclass, you could also define using numpy
>>> sample = np.array([(i1, t1), (i2, t2)], dtype=NYRealEstateData._dtype).view(NYRealEstateData)
Generating dummy data can be accomplished via shortcutted numpy routines.
>>> sample = NYRealEstateData.empty((2,))
All above examples will result in a sample with a prefix shape of (2,).
>>> sample.shape
(2,)
>>> sample.image.shape
(2,3,224,224)
>>> sample.tabular.shape
(2,5)
Of course, indexing, universal functions and classic array manipulation routines such as reshape
, stack
, concatenate
are also supported.
You can read more about this in the documentation.
Documentation
Type Definition
Tensors
is the baseclass used for defining tensor collection types. It is created as a subclass of numpy's ndarray
, using the functionality of structured arrays. On top of this, some specific rules are implemented to handle broadcasting, universal functions and shortcutting some numpy routines given the underlying dtype definition given by provided attributes.
Defining Tensors
types is done via the @tensors
decorator which is a wrapper over attrs' @define
.
The decorator provides the same functionality, however it also subclasses the decorated class from Tensors
.
All arguments available for @define
are forwarded, with the exception of the following:
slots
: forced to False due to Tensors being a subclass of np.ndarrayinit
: forced to False due to Tensors having a dedicated constructor which handles expected attr wrapped subclasseson_setattr
: makes sure setting attributes are mapped to the underlying np.ndarray fields. Adding more is possible as attrs supports thisfield_transformer
: default field transformer to translate annotated fields to metadata. Adding more is possible and works in the same manner as on_setattr
Using this decorator will also translate annotated fields to metadata.
Annotated fields are expected to have a dictionary metadata with dtype
and shape
defined.
If fields are defined through type hints, but not using Annotated
the type will be considered as the dtype of the underlying tensor and it's shape will be ().
Fields can also be defined using attrs' field
function.
Instantiation
Directly instantiating Tensors is possible. In this case, you provide the fields and values
you want as keyword arguments. Either shape
or dtype
must be provided. Providing one
will cause the other to be inferred.
If only shape
is provided, this will be considered the shape of the instance, which must
prefix all included tensors shapes. The dtype
will be inferred based on the provided values.
If only dtype
is provided, the instance's shape
will be considered () as long as mc_shape
is False. If mc_shape
is set to True, the instance's shape
will be computed as the
maximum common shape amongst the provided values, while respecting the provided dtype
.
Providing both is exhaustive and will cause validation of the provided shape
as dtype
takes precedence.
Arguments shape
, dtype
, mc_shape
are NOT available for subclasses of Tensors.
Instead, the dtype is inferred from the class fields information (type and metadata).
Indexing and attributes
Classes defined with @tensors
, in the style of attrs, will provide attributes corresponding to the underlying ndarray structured array fields. Besides this, classic ndarray attributes such as shape
and dtype
will also be available. Types defined with the decorator will also contain a _dtype
attribute corresponding to the inferred numpy dtype, which is used upon instantiation.
Indexing works in the same manner as it does with structured numpy arrays, however some type casting may occur, in the following manner:
- Any indexing that would result in an unstructured array will be cast to ndarray
- Any indexing that would result in a structured array which has a different dtype than the class, will be cast to Tensors
- Any indexing that would extract a single element (type numpy.void) will be cast to the source class.
Broadcast rules
Additional broadcasting rules have been implemented through the class method broadcast_tensors
.
Tensors
types cannot be broadcast together unless they have the same field identifiers.
Scalars and non-structured ndarrays broadcast with Tensors
types will be brought to that type's structure by broadcasting independently with each tensor component.
Tensors
with the same fields will broadcast the base shape of each coresponding component and then also broadcast the prefix shape as well.
Universal functions
Numpy provides an API for the multitude of mathematical operations via universal functions. However, these do not work for structured arrays, perahps due to the impossibility to define a universal standard approach for these.
In this package, universal functions are implemented for given types that can be broadcast following the above rules and will work as such:
Any universal function called on Tensors
types will essentially be called on each underlying tensor component.
The result(s) will be one or more Tensors
types, having the same field identifiers, resulting dtypes based on the operation and broadcast shapes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file attensors-0.1.0.tar.gz
.
File metadata
- Download URL: attensors-0.1.0.tar.gz
- Upload date:
- Size: 25.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.24.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc23fc68cb7a69f4828ad7ec8bad47b0438986b247e6fb13678f489295e7baff |
|
MD5 | 1fea22f9c098e6adb1b25e370f0e8478 |
|
BLAKE2b-256 | 7fb44ef70a8bafb5509bd599c0bbd084542e07c471e8e10331289a26b03d3e4d |
File details
Details for the file attensors-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: attensors-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.24.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e11a3def80ad4c0dabfbc74c4be542d346670d0e2a08897c1972283d1632fe3 |
|
MD5 | e1fa5fd22f80dd6116ed2eeb3add0fa0 |
|
BLAKE2b-256 | 63a2926c36ce2fdfbebb0a387d4e8d1bff9c01bdbc60e7aa949fe700dc1166ce |