Skip to main content

Data Science

Project description

test codecov

kururu - data science in the classroom

WARNING: This project will undergo major changes in the next rewrite.

Sapo Cururu-DISC 1328

Installation

Examples

Evaluated training

from aiuna import *
from kururu import *

d = dataset("abalone").data

# Each imported step is a callable object which can be used with no parameters.
steps = binarize * split * pca * svm * metric

# After the Data object goes through the steps, its last version has the test accuracy value at 'r'.
d2 = d >> steps
print(d2.r)
"""
[0.56698565]
"""

Essential concepts

The kururu framework simplifies data-related tasks (like data science) by providing straight-forward tools for carefully chosen general concepts:

  1. step - kururu unifies all data-related processes under the concept of (data science) step
    • steps are instances of Python classes derived from Step
    • steps can be partially configured (e.g., svm(kernel="poly"))
      or not configured at all (e.g., svm() or svm for short),
      which has two different meanings, depending on the use case:
      1. a step ready to process data, using default values for the omitted parameters, e.g.:
        • result = data >> svm(kernel="poly")  # more on `>>` latter
          
      2. a sampleable (ordered1) set of different steps, e.g.:
        • svms = svm(kernel="poly")  # take the subset of all polynomial kernel SVMs
          svm0 = ~svms  # sample a single configuration randomly, more on `~` latter
          result = data >> svm0
          
  2. data - all input to (and output from) each step is a Data object
    • when in a machine learning context, this includes both training and test sets, and the results as well
    • inner/outer - steps that process two data sets expect the (outer) Data object to contain an inner data field
    • stream - steps that process several data sets (e.g., partitioned or streamed) at once expect the outer data to contain a stream field
  3. operator - steps are combined by operators and applied to Data objects through operators
    • product - steps are chained by the * operator, which, analogously to the previous step definition, has two different meanings, depending on the use case:
      1. a sequence of steps (i.e., a Product object) ready to process data, using default values for the omitted parameters, e.g.:
        • sequence = pca * svm(kernel="poly") 
          result = data >> sequence  # more on `>>` latter
          
      2. a sampleable set of different sequences of steps, e.g.:
        • sequences = pca * svm(kernel="poly")  # take the subset of all combinations between all PCAs and polynomial kernel SVMs
          sequence = ~sequences  # sample a sequence randomly, more on `~` latter
          result = data >> sequence
          

ONGOING WORK FROM HERE UNTIL THE END OF THE PAGE....

  • union - sets of steps are united by the + operator, which returns a Union object. The implementation of the class Union doesn't follow the math concept of union of sets strictly. It can have repeated elements for three reasons: there is no gain in enforcing such math requirement; it can be useful in some scenario involving the stream field where the repetition is actually needed; and, more importantly, it is ordered (this applies to all sets if steps also). Like any Step object, Union has two different use cases:
    1. a set of steps ready to process data, using default values for the omitted parameters, e.g.:
      • ONGOING WORK ....
        union = mlp + svm(kernel="poly") 
        result = data >> ... * union *   # more on `...` and `>>` latter
        
    2. a sampleable set of different sequences of steps, e.g.:
      • sequences = pca * svm(kernel="poly")  # take the subset of all combinations between all PCAs and polynomial kernel SVMs
        sequence = ~sequences  # sample a sequence randomly, more on `~` latter
        result = data >> sequence
        

1 Not important for the context, but it is worth to mention it is an ordered set, because the mathematical definition of sets has no ordering. We use the term "set" instead of list to avoid the need to explicitly borrow all the needed concepts related to set operations.

Data exploration versus machine learning

Whether in a console (e.g., ipython) or a temporary script (e.g., jupyter notebooks), sometimes one needs to easily manipulate data, or to easily set up a machine learning workflow. To fit the needs of both distinct use cases (lightly leaning towards the writing of workflows), kururu provides step suffixes. Some representative examples are given in the following table where

  • XXX_ means the step XXX will be trained and tested on outer2 data (useful for data exploration);
  • in all other cases, XXX will be trained on inner data:
    • XXXi means XXX tested on inner data
    • XXXo means XXX tested on outer data
    • XXXb means XXX tested on both inner and outer data
Use case Name Shorthand
(if any)
Suffixed Form Training set Test set(s)
data exploration Principal Component Analisys PCA_ outer2 outer2
machine learning Noise Reduction NR NRi inner inner
machine learning Support Vector Machines SVM SVMo inner outer
machine learning Principal Component Analisys PCA PCAb inner both

2 Please note that, to simplify the terminology across the documentation, the Data object is called "outer" even if it has no inner field.

Finally, for completeness, the other meaningful versions of the steps presented in the previous table are now also included as follows.

Use case Name Class to use Training set Test set(s)
data exploration Principal Component Analisys PCA_ outer outer
machine learning Principal Component Analisys PCA inner both
data exploration Noise Reduction NR_ outer outer
machine learning Noise Reduction3 NR inner inner
machine learning Support Vector Machines SVM inner outer
machine learning Support Vector Machines SVMb inner both

3 Some of the steps (e.g., NR) may not implement the XXXb (e.g., NRb) version when it is not applicable.

Implementing custom Step classes

All fields that update a Data object should be lazy, i.e. should be a 'callable' without args. An easy way to do that is to put lambda: before the value intended to be returned. However, usually the return value is dependent upon fields from a previous Data object. They should be accessed only from within the callable/lambda for a proper (iterator-safe) implementation. Example: Creating a custom step

from akangatu.distep import DIStep


# DIStep means "Data Independent Step", i.e. it does not depend on previously known data.
class MyAdditionStep(DIStep):
    """Multiplies the given field by a factor."""

    def __init__(self, field, factor):
        # All relevant step parameters should be passed to super() as keyword arguments.
        super().__init__(field=field, factor=factor)

        # Instance attributes are set as usual.
        self.field = field
        self.factor = factor

    def _process_(self, data):
        # All calculations (including access to data fields)
        #   is deferred to a future access to the return field - R in this case.
        return data.update(self, R=lambda: data[self.field] * self.factor)

Field name rules

A field named with a single letter has a lower case shortcut for automatic conversion from matrix to scalar/vector and vice versa. Suffixes: ... Reserved names: ...

Contribution

Nothing clear yet, but one of the ways one can contribute is by creating their own repository (to be listed here as a partner), using this repository (and/or other related ones) as a dependence. Monkey-patch can be used if one needs to urgently integrate a module inside the same class tree used here, or ask for access to this repository, or submit a pull request. The software architecture was planned taking that into account. It provides clear interface-classes to guide the implementer/IDE, and each repository with a specific well-defined purpose.

Grants

Part of the effort spent in the present code was kindly supported by Fapesp under supervision of Prof. André C. P. L. F. de Carvalho at CEPID-CeMEAI (Grants 2013/07375-0 – 2019/01735-0).

History

Except dependencies like sklearn and other libraries, the novel ideias presented here are a result of a years-long process of drafts, thinking, trial/error and rewrittings from scratch in several languages from Delphi, passing through Haskell, Java and Scala to Python. The fundamental concepts were lightly borrowed from basic category theory concepts like algebraic data structures that permeate many recent tendencies, e.g., in programming language design.

For code (and academic) details refer to the following projects (few of them are usable by now):

2003 [TUPI imaging: there was no widespread use of git at that time / retroactive repo yet to be created]

2006 [Multicore NN: there was no widespread use of git at that time / retroactive repo yet to be created]

2013 Functional language parser/interpreter https://github.com/davips/lamdheal-j

2014 Machine learning library including Weka algorithms, optimized immutable data structure and models, hand-made BLAS/LAPACK neural networks, transparent distributed processing (in conjunction with active-learning-scala), plotting, evaluation, early replicability https://github.com/davips/mls

2015 Active learning library https://github.com/davips/active-learning-scala

2016 Thesis and dataset generation and visualization https://github.com/davips/tese https://github.com/davips/knowledge-boundary https://github.com/davips/image2arff

2018 Gaussian processses https://github.com/davips/surface

2019 Client to generate reports from stored results https://github.com/davips/mysql2csv

2020 Python project where previous attempts and evolving ideias were tested https://github.com/davips/pjml-may_archived

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kururu-0.2102.21.tar.gz (29.2 kB view hashes)

Uploaded Source

Built Distribution

kururu-0.2102.21-py3-none-any.whl (83.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page