A Python library of data structures optimized for machine learning tasks
Project description
py4ai core
A Python library defining data structures optimized for machine learning pipelines
What is it ?
py4ai-core is a Python package with modular design that provides powerful abstractions to build data ingestion pipelines and run end to end machine learning pipelines. The library offers lightweight object-oriented interface to MongoDB as well as Pandas based data structures. The aim of the library is to provide extensive support for developing machine learning based applications with a focus on practicing clean code and modular design.
Features
Some cool features that we are proud to mention are:
Data layers
- Archiver: Offers an object-oriented design to perform ETL on Mongodb collections as well as Pandas DataFrames.
- DAO: Data Access Object to allow archivers to serialize domain objects into the proper persistence layer support object (e.g. in the case of MongoDB, a DAO serializes a domain object into a MongoDB document) and to parse objects retrieved from the given persistence layer in the correct representation in our framework (e.g. a text will be parsed in a Document while tabular data will be parsed in a pandas DataFrame).
- Database: Object representing a relational database
- Table: Object representing a table of a relational database
Data Model
Offers the following data structures:
- Document : Data structure specifically designed to work with NLP applications that parses a json-like document into a couple of uuid and dictionary of information.
- Sample : Data structure representing an observation (a.k.a. sample) as used in machine learning applications
- MultiFeatureSample : Data structure representing an observation defined by a nested list of arrays.
- Dataset : Data structure designed to be used specifically for machine learning applications representing a collection of samples.
Installation
From pypi server
pip install py4ai-core
From source
git clone https://github.com/NicolaDonelli/py4ai-core
cd py4ai-core
make install
Tests
make tests
Checks
To run predefined checks (unit-tests, linting checks, formatting checks and static typing checks):
make checks
Examples
Data Layers
The Data Layer abstractions are designed to decouple the business layers from
the detail of the persistence layer implementation. The basic abstraction that will
make this possible is the Repository
.
As an example, imagine to have a domain business object Entity
from pydantic import BaseModel
class Entity(BaseModel):
my_id: int
my_data: str
To start with, imagine we want to use csv files store on disk as a persistence
layer. To do so, we will use the CsvRepository
that uses pandas DataFrames stored
in memory and written to the disk as csv. Thus, we need to write the business logic
to serialize the Entity
into a row of the pandas DataFrame, i.e. a pandas Series:
import pandas as pd
from py4ai.core.data.layer.common.serialiazer import DataSerializer
class EntitySerializer(DataSerializer[int, int, Entity, pd.Series]):
def to_object(self, entity: Entity) -> pd.Series:
return pd.Series(entity.dict())
def to_entity(self, document: pd.Series) -> Entity:
return Entity(**document)
def to_object_key(self, key: int) -> int:
return key
def get_key(self, entity: Entity) -> int:
return entity.my_id
We can now instantiate the repository class that has all the methods for reading and writing objects from the persistence layer.
from py4ai.core.data.layer.pandas.repository import CsvRepository
repo = CsvRepository(filename, EntitySerializer())
entity = Entity(my_id=1234, my_data="Important data")
# This will create the entity in the persistence layer
await repo.create(entity)
# Retrieving the entity
retrieved = repo.retrieve(key=1234)
# Retrieving all entities
all_entities = repo.list()
Imagine now that, given the data increase in size, we now would like to change the persistence layer with a proper backend into something more structured and scalable, such as a NoSQL document-based data platform, such as MongoDB. We only need to create a new business logic to serialize/deserialize our class into a json (represented in python by a dictionary):
from bson import ObjectId
from py4ai.core.data.layer.mongo.serializer import create_mongo_id
from py4ai.core.data.layer.common.serialiazer import DataSerializer
class MongoDataSerializer(DataSerializer[int, ObjectId, Entity, dict]):
def get_key(self, entity: Entity) -> int:
return entity.my_id
def to_object(self, entity: Entity) -> dict:
doc = entity.dict()
doc["_id"] = self.to_object_key(self.get_key(entity))
return doc
def to_entity(self, document: dict) -> Entity:
return Entity(**document)
def to_object_key(self, key: int) -> ObjectId:
# This converts a string into an hash compatible with MongoDB format
return create_mongo_id(str(key))
A new repository based on the MongoDB persistence layer can now be created using
from py4ai.core.data.layer.mongo.repository import MongoRepository
repo = MongoRepository(collection, MongoDataSerializer())
This repository is compatible with the previous and can be used in place of the previous one, having the same signatures.
Abstracting Data Querying
The Repository
abstraction also allow to retrieve data based on certain query/filters:
entities = repo.retrieve_by_criteria(criteria)
However, the format of the query also depends on the type of the persistence layer and more specifically on how the data are organized. Therefore, in order to abstract and decouple the notion of the underlying persistence layer, we need to define a general class containing the possible queries for a certain database:
from typing import Generic
from abc import ABC, abstractmethod
from py4ai.core.data.layer.common.criteria import SearchCriteria
class EntityCriteriaFactory(ABC, Generic[Q]):
@abstractmethod
def by_id(self, id: int) -> SearchCriteria[Q]:
...
When considering a particular persistence layer, the querying business logic needs to be specified
from py4ai.core.data.layer.mongo.criteria import MongoSearchCriteria
class MongoCriteriaFactory(EntityCriteriaFactory[Dict[str, Any]]):
def by_id(self, my_id: int) -> MongoSearchCriteria:
return MongoSearchCriteria({"my_id": my_id})
criteria = MongoCriteriaFactory()
entities = repo.retrieve_by_criteria(criteria.by_id(1234))
Note that SearchCriteria
can be also joined using logical operators:
entities = repo.retrieve_by_criteria(
criteria.by_id(1234) & criteria.by_id(1235)
)
entities = repo.retrieve_by_criteria(
criteria.by_id(1234) | criteria.by_id(1235)
)
Data Model
Creating a PandasDataset object
import pandas as pd
import numpy as np
from py4ai.core.data.model.ml import PandasDataset
dataset = PandasDataset(features=pd.concat([pd.Series([1, np.nan, 2, 3], name="feat1"),
pd.Series([1, 2, 3, 4], name="feat2")], axis=1),
labels=pd.Series([0, 0, 0, 1], name="Label"))
# access features as a pandas dataframe
print(dataset.features.head())
# access labels as pandas dataframe
print(dataset.labels.head())
# access features as a python dictionary
dataset.getFeaturesAs('dict')
# access features as numpy array
dataset.getFeaturesAs('array')
# indexing operations
# access features and labels at the given index as a pandas dataframe
print(dataset.loc([2]).features.head())
print(dataset.loc([2]).labels.head())
Creating a PandasTimeIndexedDataset object
import pandas as pd
import numpy as np
from py4ai.core.data.model.ml import PandasTimeIndexedDataset
dateStr = [str(x) for x in pd.date_range('2010-01-01', '2010-01-04')]
dataset = PandasTimeIndexedDataset(
features=pd.concat([
pd.Series([1, np.nan, 2, 3], index=dateStr, name="feat1"),
pd.Series([1, 2, 3, 4], index=dateStr, name="feat2")
], axis=1))
How to contribute ?
We are very much willing to welcome any kind of contribution whether it is bug report, bug fixes, contributions to the existing codebase or improving the documentation.
Where to start ?
Please look at the Github issues tab to start working on open issues
Contributing to py4ai-core
Please make sure the general guidelines for contributing to the code base are respected
- Fork the py4ai-core repository.
- Create/choose an issue to work on in the Github issues page.
- Create a new branch to work on the issue.
- Commit your changes and run the tests to make sure the changes do not break any test.
- Open a Pull Request on Github referencing the issue.
- Once the PR is approved, the maintainers will merge it on the main branch.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file py4ai-core-1.0.0.tar.gz
.
File metadata
- Download URL: py4ai-core-1.0.0.tar.gz
- Upload date:
- Size: 41.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13266824e4af70d8beb88a19283adf1d208ce707e9c0fe3ac848f8b006b4984e |
|
MD5 | 08bcc9946c085d722ddf843bde35411b |
|
BLAKE2b-256 | cd7ac2361ab9195c4c8c9a0f3bcd1151ebef4bbf6e5ffa6cf5f76565f0da7dfc |
File details
Details for the file py4ai_core-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: py4ai_core-1.0.0-py3-none-any.whl
- Upload date:
- Size: 23.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 80cfdb2a88901f8771b13fea64aa696726e352e589b7258fd042e7f3eae3a4c3 |
|
MD5 | a9e61355881f2f28f23a6d8c10d6fdd9 |
|
BLAKE2b-256 | 8c6871c88db259a6a9b22e9c27a8f6cf7311f8b2bf847d2e7acb5c18c0d67f8b |