A Python library of data structures optimized for machine learning tasks
Project description
CGnal core
A Python library defining data structures optimized for machine learning pipelines
What is it ?
cgnal-core is a Python package with modular design that provides powerful abstractions to build data ingestion pipelines and run end to end machine learning pipelines. The library offers lightweight object-oriented interface to MongoDB as well as Pandas based data structures. The aim of the library is to provide extensive support for developing machine learning based applications with a focus on practicing clean code and modular design.
Features
Some cool features that we are proud to mention are:
Data layers
- Archiver: Offers an object-oriented design to perform ETL on Mongodb collections as well as Pandas DataFrames.
- DAO: Data Access Object to allow archivers to serialize domain objects into the proper persistence layer support object (e.g. in the case of MongoDB, a DAO serializes a domain object into a MongoDB document) and to parse objects retrieved from the given persistence layer in the correct representation in our framework (e.g. a text will be parsed in a Document while tabular data will be parsed in a pandas DataFrame).
- Database: Object representing a relational database
- Table: Object representing a table of a relational database
Data Model
Offers the following data structures:
- Document : Data structure specifically designed to work with NLP applications that parses a json-like document into a couple of uuid and dictionary of information.
- Sample : Data structure representing an observation (a.k.a. sample) as used in machine learning applications
- MultiFeatureSample : Data structure representing an observation defined by a nested list of arrays.
- Dataset : Data structure designed to be used specifically for machine learning applications representing a collection of samples.
Installation
From pypi server
pip install cgnal-core
From source
git clone https://github.com/CGnal/cgnal-core
cd cgnal-core
make install
Tests
make tests
Checks
To run predefined checks (unit-tests, linting checks, formatting checks and static typing checks):
make checks
Examples
Data Layers
Creating a Database of Table objects
import pandas as pd
from cgnal.core.data.layer.pandas.databases import Database
# sample df
df1 = pd.DataFrame([[1, 2, 3], [6, 5, 4]], columns=['a', 'b', 'c'])
# creating a database
db = Database('/path/to/db')
table1 = db.table('df1')
# write table to path
table1.write(df1)
# get path
table1.filename
# convert to pandas dataframe
table1.to_df()
# get table from database
db.__getitem__('df1')
Using an Archiver with Dao objects
from cgnal.core.data.layer.pandas.archivers import CsvArchiver
from cgnal.core.data.layer.pandas.dao import DataFrameDAO
# create a dao object
dao = DataFrameDAO()
# create a csv archiver
arch = CsvArchiver('/path/to/csvfile.csv', dao)
# get pandas dataframe
arch.data
# retrieve a single document object
doc = next(arch.retrieve())
# retrieve a list of document objects
docs = [i for i in arch.retrieve()]
# retrieve a document by it's id
arch.retrieveById(doc.uuid)
# archive a single document
doc = next(self.a.retrieve())
# update column_name field of the document with the given value
doc.data.update({'column_name': value})
# archive the document
arch.archiveOne(doc)
# archive list of documents
a.archiveMany([doc, doc])
# get a document object as a pandas series
arch.dao.get(doc)
Data Model
Creating a PandasDataset object
import pandas as pd
from cgnal.core.data.model.ml import PandasDataset
dataset = PandasDataset(features=pd.concat([pd.Series([1, np.nan, 2, 3], name="feat1"),
pd.Series([1, 2, 3, 4], name="feat2")], axis=1),
labels=pd.Series([0, 0, 0, 1], name="Label"))
# access features as a pandas dataframe
dataset.features
# access labels as pandas dataframe
dataset.labels
# access features as a python dictionary
dataset.getFeaturesAs('dict')
# access features as numpy array
dataset.getFeaturesAs('array')
# indexing operations
# access features and labels at the given index as a pandas dataframe
dataset.loc(2).features
dataset.loc(2).labels
Creating a PandasTimeIndexedDataset object
import pandas as pd
from cgnal.core.data.model.ml import PandasTimeIndexedDataset
dateStr = [str(x) for x in pd.date_range('2010-01-01', '2010-01-04')]
dataset = PandasTimeIndexedDataset(
features=pd.concat([
pd.Series([1, np.nan, 2, 3], index=dateStr, name="feat1"),
pd.Series([1, 2, 3, 4], index=dateStr, name="feat2")
], axis=1))
How to contribute ?
We are very much willing to welcome any kind of contribution whether it is bug report, bug fixes, contributions to the existing codebase or improving the documentation.
Where to start ?
Please look at the Github issues tab to start working on open issues
Contributing to cgnal-core
Please make sure the general guidelines for contributing to the code base are respected
- Fork the cgnal-core repository.
- Create/choose an issue to work on in the Github issues page.
- Create a new branch to work on the issue.
- Commit your changes and run the tests to make sure the changes do not break any test.
- Open a Pull Request on Github referencing the issue.
- Once the PR is approved, the maintainers will merge it on the main branch.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for cgnal_core-2.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 61daad3e544e7223b6bc2c13c06a00f5ad939aa0ef87da7faf4b23185d763f7f |
|
MD5 | 49eabed1f1e5c0a8f2be1a4042d8af73 |
|
BLAKE2b-256 | 432f898d8d64b0f42959130cb3edd4d4ae59342fcf47b2dfbde02cffea574b5d |