Skip to main content

A Python library of data structures optimized for machine learning tasks

Project description



CGnal core

PyPI PyPI version Documentation Python package


A Python library defining data structures optimized for machine learning pipelines

What is it ?

cgnal-core is a Python package with modular design that provides powerful abstractions to build data ingestion pipelines and run end to end machine learning pipelines. The library offers lightweight object-oriented interface to MongoDB as well as Pandas based data structures. The aim of the library is to provide extensive support for developing machine learning based applications with a focus on practicing clean code and modular design.

Features

Some cool features that we are proud to mention are:

Data layers

  1. Archiver: Offers an object-oriented design to perform ETL on Mongodb collections as well as Pandas DataFrames.
  2. DAO: Data Access Object to allow archivers to serialize domain objects into the proper persistence layer support object (e.g. in the case of MongoDB, a DAO serializes a domain object into a MongoDB document) and to parse objects retrieved from the given persistence layer in the correct representation in our framework (e.g. a text will be parsed in a Document while tabular data will be parsed in a pandas DataFrame).
  3. Database: Object representing a relational database
  4. Table: Object representing a table of a relational database

Data Model

Offers the following data structures:

  1. Document : Data structure specifically designed to work with NLP applications that parses a json-like document into a couple of uuid and dictionary of information.
  2. Sample : Data structure representing an observation (a.k.a. sample) as used in machine learning applications
  3. MultiFeatureSample : Data structure representing an observation defined by a nested list of arrays.
  4. Dataset : Data structure designed to be used specifically for machine learning applications representing a collection of samples.

Installation

From pypi server

pip install cgnal-core

From source

git clone https://github.com/CGnal/cgnal-core
cd cgnal-core
make install

Tests

make tests

Checks

To run predefined checks (unit-tests, linting checks, formatting checks and static typing checks):

make checks

Examples

Data Layers

Creating a Database of Table objects

import pandas as pd
from cgnal.core.data.layer.pandas.databases import Database

# sample df
df1 = pd.DataFrame([[1, 2, 3], [6, 5, 4]], columns=['a', 'b', 'c'])

# creating a database 
db = Database('/path/to/db')
table1 = db.table('df1')

# write table to path
table1.write(df1)
# get path  
table1.filename

# convert to pandas dataframe 
table1.to_df()

# get table from database 
db.__getitem__('df1')

Using an Archiver with Dao objects

from cgnal.core.data.layer.pandas.archivers import CsvArchiver
from cgnal.core.data.layer.pandas.dao import DataFrameDAO

# create a dao object 
dao = DataFrameDAO()

# create a csv archiver 
arch = CsvArchiver('/path/to/csvfile.csv', dao)

# get pandas dataframe 
arch.data

# retrieve a single document object 
doc = next(arch.retrieve())
# retrieve a list of document objects 
docs = [i for i in arch.retrieve()]
# retrieve a document by it's id 
arch.retrieveById(doc.uuid)

# archive a single document 
doc = next(self.a.retrieve())
# update column_name field of the document with the given value
doc.data.update({'column_name': value})
# archive the document 
arch.archiveOne(doc)
# archive list of documents
a.archiveMany([doc, doc])

# get a document object as a pandas series 
arch.dao.get(doc)

Data Model

Creating a PandasDataset object

import pandas as pd
from cgnal.core.data.model.ml import PandasDataset

dataset = PandasDataset(features=pd.concat([pd.Series([1, np.nan, 2, 3], name="feat1"),
                                            pd.Series([1, 2, 3, 4], name="feat2")], axis=1),
                        labels=pd.Series([0, 0, 0, 1], name="Label"))

# access features as a pandas dataframe 
dataset.features
# access labels as pandas dataframe 
dataset.labels
# access features as a python dictionary 
dataset.getFeaturesAs('dict')
# access features as numpy array 
dataset.getFeaturesAs('array')

# indexing operations 
# access features and labels at the given index as a pandas dataframe  
dataset.loc(2).features
dataset.loc(2).labels

Creating a PandasTimeIndexedDataset object

import pandas as pd
from cgnal.core.data.model.ml import PandasTimeIndexedDataset

dateStr = [str(x) for x in pd.date_range('2010-01-01', '2010-01-04')]
dataset = PandasTimeIndexedDataset(
    features=pd.concat([
        pd.Series([1, np.nan, 2, 3], index=dateStr, name="feat1"),
        pd.Series([1, 2, 3, 4], index=dateStr, name="feat2")
    ], axis=1))

How to contribute ?

We are very much willing to welcome any kind of contribution whether it is bug report, bug fixes, contributions to the existing codebase or improving the documentation.

Where to start ?

Please look at the Github issues tab to start working on open issues

Contributing to cgnal-core

Please make sure the general guidelines for contributing to the code base are respected

  1. Fork the cgnal-core repository.
  2. Create/choose an issue to work on in the Github issues page.
  3. Create a new branch to work on the issue.
  4. Commit your changes and run the tests to make sure the changes do not break any test.
  5. Open a Pull Request on Github referencing the issue.
  6. Once the PR is approved, the maintainers will merge it on the main branch.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cgnal-core-2.0.1.tar.gz (51.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cgnal_core-2.0.1-py3-none-any.whl (41.7 kB view details)

Uploaded Python 3

File details

Details for the file cgnal-core-2.0.1.tar.gz.

File metadata

  • Download URL: cgnal-core-2.0.1.tar.gz
  • Upload date:
  • Size: 51.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for cgnal-core-2.0.1.tar.gz
Algorithm Hash digest
SHA256 fdfb1887aa229c9e89c4d6572fa4a2512152e2adbbd015a3abc387b87969cb07
MD5 eabf6680d77ff06286f8a20a4c377055
BLAKE2b-256 18bb4ef46966e3e923b53abd36ddcc1e6274a208e38b04cb3de22fa409309f31

See more details on using hashes here.

File details

Details for the file cgnal_core-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: cgnal_core-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 41.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for cgnal_core-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 61275c94385039fc5892b4b8b9a101f69a86b6f038dc3242f240f62beed4f67c
MD5 30a24a7ef73b959319bf92b12c50cd3d
BLAKE2b-256 f25f0fabfa665e5917f824fca92fed96a2facba2526b5e896b22f84e268dcac7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page