Skip to main content

Rain library.

Project description

.. |Codecov| image:: https://img.shields.io/codecov/c/github/SIMPLE-DVS/rain :alt: Codecov :target: https://app.codecov.io/gh/SIMPLE-DVS/rain

.. |License| image:: https://img.shields.io/badge/License-GPLv3-blue.svg

|Codecov| |License|

==== Rain

.. this is a comment, insert badge here .. image:: https://img.shields.io/pypi/v/rain.svg :target: https://pypi.python.org/pypi/rain .. image:: https://img.shields.io/travis/SIMPLE-DVS/rain.svg :target: https://travis-ci.com/SIMPLE-DVS/rain

What is it?

Rain is a Python library that supports the data scientist during the development of data pipelines, here called Dataflow, in a rapid and easy way following a declarative approach. In particular helps in data preparation/engineering where data are processed, and in data analysis, consisting in the definition of the most suitable learning algorithm.

Rain contains a collection of nodes that abstract functions of the main Python's ML libraries as Scikit-learn, Pandas and PySpark. The capability to combine multiple Python libraries and the possibility to define more nodes or adding support for other libraries are the main Rain's strengths. Currently the library contains several nodes regarding Anomaly Detection strategies.

Dataflow

A DataFlow represents a Directed Acyclic Graph. Since a DataFlow must be executed in a remote machine, then the acyclicity of the DAG must be ensured to avoid deadlocks.

Nodes can be added to the DataFlow and connected one to each other by edges. A node can be seen as a meta-function, a combination of several methods of a particular ML library embedded in Rain, that provides one or more functionalities (for instance a Pandas node/meta-function could compute the mean of a column and then round it up to some given decimals).

Edges connect meta-functions outputs to meta-functions inputs using a specific semantic. In general we can say that an output can be connected to an input if and only if their types match (semantic verification). Moreover an output can have one or more outgoing edges while an input can have at most one ingoing edge.

The library contains also the so-called executors to run the Dataflow. Currently there are the Local executor, where the computation is performed in a single local machine, and the Spark executor to harness an Apache Spark cluster. A DataFlow is run in a single device because data that are transformed by nodes are directly passed to the following ones.

Installation

The library can be accessed in a stand-alone way using Python simply by installing it.

To install Rain, run this command in your terminal (preferred way to install the most recent stable release):

.. code-block:: console

$ pip install git+https://github.com/SIMPLE-DVS/rain.git

It is also possible to install Rain with all the optional dependencies by running the following command:

.. code-block:: console

$ pip install "rain[full] @ git+https://github.com/SIMPLE-DVS/rain"

If you don't have pip_ installed, this Python installation guide_ can guide you through the process.

.. _pip: https://pip.pypa.io .. _Python installation guide: http://docs.python-guide.org/en/latest/starting/installation/

Furthermore the tool comes with a back-end that leverages the library and exposes its functionalities to a GUI which eases the usage of the library itself.

QuickStart

Here we provide a simple Python script in which Rain is used and a Dataflow is configured::

import rain

df = rain.DataFlow("df1", executor=rain.LocalExecutor())

csv_loader = rain.PandasCSVLoader("load", path="./iris.csv")
filter_col = rain.PandasColumnsFiltering("filter", column_indexes=[0, 1])
writer = rain.PandasCSVWriter("write", path="./new_iris.csv")

df.add_edges(
    csv_loader @ "dataset" > filter_col @ "dataset",
    filter_col @ "transformed_dataset" > writer @ "dataset"
)

df.execute()

In the above script we:

  • first import the library;

  • instantiate a Dataflow (with Id "df1" and referenced as df) passing a Local Executor, meaning that the Dataflow will be executed in the local machine that runs the script;

  • instantiate 3 nodes (csv_loader, filter_col, writer):

  • the first one loads the "iris.csv" file stored in the root directory containing the Iris dataset, using the node PandasCSVLoader;

  • the second node filters some columns using a PandasColumnFiltering with its parameter column_indexes;

  • the last one saves the transformed dataset in a new file called "new_iris.csv" using the node PandasCSVWriter;

  • create 2 edges to link the 3 nodes:

  • the dataset output variable of the node csv_loader is sent to the dataset input variable of the node filter_col;

  • the output transformed_dataset of the filter_col is then sent to the input of the node writer (dataset);

  • finally call the execute method of the Dataflow df. In this way, when the script is run we get the expected result.

In general to use the library you have to perform the following steps:

  • create a Dataflow specifying the type of executor;

  • define all the nodes with the desired parameters to achieve your ML task;

  • define the edges to link the nodes using the specific semantic:

  • > is the symbol used to create an edge, where on the left you must specify the output of the source node while on the right you must specify the input of the destination node;

  • @ is the symbol used to access an input/output variable of a node, where on the left you must specify the variable name of the node while on the right you must specify the name of the output/input variable of the source/destination node;

  • execute the Dataflow and run the script.

More information about Rain usage, edges' semantic and all the possible executors are available here. A complete description of all the available nodes with their behavior, accepted parameters, inputs and outputs is available at this link.

.. _link: https://rain-library.readthedocs.io/en/latest/rain.nodes.html .. _here: https://rain-library.readthedocs.io/en/latest/usage.html

Full Documentation

To load all the documentation follow the steps:

  • Download sphinx and the sphinx theme specified in the requirements_dev.txt file or install all the requirements listed in that file (suggested choice)

  • From the main directory cd to the 'docs' directory.

.. code-block:: console

$ cd docs

Then run the 'make.bat singlehtml' file on Windows or run the command:

.. code-block:: console

$ sphinx-build . ./_build

The _build directory will contain the html files, open the index.html file to read the full documentation.

Authors

  • Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta, Luca Mozzoni, Vincenzo Nucci

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

rain_dm-1.1730222886.244127-py2.py3-none-any.whl (96.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file rain_dm-1.1730222886.244127-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for rain_dm-1.1730222886.244127-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 04e8ae895fad09b772c8d5dcb31789eeaf8046ff423b786175284008d3cf784a
MD5 d6ed2d1018c82c4424198d00a8d79204
BLAKE2b-256 06dd3d76185dd1eb903c4fcbfc8682edfd208685702f4768b0af0a58fed54e38

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page