Skip to main content

Rain library.

Project description

.. |Codecov| image:: https://img.shields.io/codecov/c/github/SIMPLE-DVS/rain :alt: Codecov :target: https://app.codecov.io/gh/SIMPLE-DVS/rain

.. |License| image:: https://img.shields.io/badge/License-GPLv3-blue.svg

|Codecov| |License|

==== Rain

.. this is a comment, insert badge here .. image:: https://img.shields.io/pypi/v/rain.svg :target: https://pypi.python.org/pypi/rain .. image:: https://img.shields.io/travis/SIMPLE-DVS/rain.svg :target: https://travis-ci.com/SIMPLE-DVS/rain

What is it?

Rain is a Python library that supports the data scientist during the development of data pipelines, here called Dataflow, in a rapid and easy way following a declarative approach. In particular helps in data preparation/engineering where data are processed, and in data analysis, consisting in the definition of the most suitable learning algorithm.

Rain contains a collection of nodes that abstract functions of the main Python's ML libraries as Scikit-learn, Pandas and PySpark. The capability to combine multiple Python libraries and the possibility to define more nodes or adding support for other libraries are the main Rain's strengths. Currently the library contains several nodes regarding Anomaly Detection strategies.

Dataflow

A DataFlow represents a Directed Acyclic Graph. Since a DataFlow must be executed in a remote machine, then the acyclicity of the DAG must be ensured to avoid deadlocks.

Nodes can be added to the DataFlow and connected one to each other by edges. A node can be seen as a meta-function, a combination of several methods of a particular ML library embedded in Rain, that provides one or more functionalities (for instance a Pandas node/meta-function could compute the mean of a column and then round it up to some given decimals).

Edges connect meta-functions outputs to meta-functions inputs using a specific semantic. In general we can say that an output can be connected to an input if and only if their types match (semantic verification). Moreover an output can have one or more outgoing edges while an input can have at most one ingoing edge.

The library contains also the so-called executors to run the Dataflow. Currently there are the Local executor, where the computation is performed in a single local machine, and the Spark executor to harness an Apache Spark cluster. A DataFlow is run in a single device because data that are transformed by nodes are directly passed to the following ones.

Installation

The library can be accessed in a stand-alone way using Python simply by installing it.

To install Rain, run this command in your terminal (preferred way to install the most recent stable release):

.. code-block:: console

$ pip install git+https://github.com/SIMPLE-DVS/rain.git

It is also possible to install Rain with all the optional dependencies by running the following command:

.. code-block:: console

$ pip install "rain[full] @ git+https://github.com/SIMPLE-DVS/rain"

If you don't have pip_ installed, this Python installation guide_ can guide you through the process.

.. _pip: https://pip.pypa.io .. _Python installation guide: http://docs.python-guide.org/en/latest/starting/installation/

Furthermore the tool comes with a back-end that leverages the library and exposes its functionalities to a GUI which eases the usage of the library itself.

QuickStart

Here we provide a simple Python script in which Rain is used and a Dataflow is configured::

import rain

df = rain.DataFlow("df1", executor=rain.LocalExecutor())

csv_loader = rain.PandasCSVLoader("load", path="./iris.csv")
filter_col = rain.PandasColumnsFiltering("filter", column_indexes=[0, 1])
writer = rain.PandasCSVWriter("write", path="./new_iris.csv")

df.add_edges(
    csv_loader @ "dataset" > filter_col @ "dataset",
    filter_col @ "transformed_dataset" > writer @ "dataset"
)

df.execute()

In the above script we:

  • first import the library;

  • instantiate a Dataflow (with Id "df1" and referenced as df) passing a Local Executor, meaning that the Dataflow will be executed in the local machine that runs the script;

  • instantiate 3 nodes (csv_loader, filter_col, writer):

  • the first one loads the "iris.csv" file stored in the root directory containing the Iris dataset, using the node PandasCSVLoader;

  • the second node filters some columns using a PandasColumnFiltering with its parameter column_indexes;

  • the last one saves the transformed dataset in a new file called "new_iris.csv" using the node PandasCSVWriter;

  • create 2 edges to link the 3 nodes:

  • the dataset output variable of the node csv_loader is sent to the dataset input variable of the node filter_col;

  • the output transformed_dataset of the filter_col is then sent to the input of the node writer (dataset);

  • finally call the execute method of the Dataflow df. In this way, when the script is run we get the expected result.

In general to use the library you have to perform the following steps:

  • create a Dataflow specifying the type of executor;

  • define all the nodes with the desired parameters to achieve your ML task;

  • define the edges to link the nodes using the specific semantic:

  • > is the symbol used to create an edge, where on the left you must specify the output of the source node while on the right you must specify the input of the destination node;

  • @ is the symbol used to access an input/output variable of a node, where on the left you must specify the variable name of the node while on the right you must specify the name of the output/input variable of the source/destination node;

  • execute the Dataflow and run the script.

More information about Rain usage, edges' semantic and all the possible executors are available here. A complete description of all the available nodes with their behavior, accepted parameters, inputs and outputs is available at this link.

.. _link: https://rain-library.readthedocs.io/en/latest/rain.nodes.html .. _here: https://rain-library.readthedocs.io/en/latest/usage.html

Full Documentation

To load all the documentation follow the steps:

  • Download sphinx and the sphinx theme specified in the requirements_dev.txt file or install all the requirements listed in that file (suggested choice)

  • From the main directory cd to the 'docs' directory.

.. code-block:: console

$ cd docs

Then run the 'make.bat singlehtml' file on Windows or run the command:

.. code-block:: console

$ sphinx-build . ./_build

The _build directory will contain the html files, open the index.html file to read the full documentation.

Authors

  • Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta, Luca Mozzoni, Vincenzo Nucci

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rain_dm-1.1747167964.951835-py2.py3-none-any.whl (96.8 kB view details)

Uploaded Python 2Python 3

File details

Details for the file rain_dm-1.1747167964.951835-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for rain_dm-1.1747167964.951835-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 f395395e969e1b5ee11008ae8d576723ab776ae4d8a98cc0e5034b49e076acb3
MD5 27d531d26872beac58b4499e75cff27d
BLAKE2b-256 ce870f4ca4c06731b0f50bf9481f79708a8752df900b9401d2f9618dc3a4d4df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page