Rain library.
Project description
.. |Codecov| image:: https://img.shields.io/codecov/c/github/SIMPLE-DVS/rain :alt: Codecov :target: https://app.codecov.io/gh/SIMPLE-DVS/rain
.. |License| image:: https://img.shields.io/badge/License-GPLv3-blue.svg
|Codecov| |License|
==== Rain
.. this is a comment, insert badge here .. image:: https://img.shields.io/pypi/v/rain.svg :target: https://pypi.python.org/pypi/rain .. image:: https://img.shields.io/travis/SIMPLE-DVS/rain.svg :target: https://travis-ci.com/SIMPLE-DVS/rain
What is it?
Rain is a Python library that supports the data scientist during the development of data pipelines, here called Dataflow, in a rapid and easy way following a declarative approach. In particular helps in data preparation/engineering where data are processed, and in data analysis, consisting in the definition of the most suitable learning algorithm.
Rain contains a collection of nodes that abstract functions of the main Python's ML libraries as Scikit-learn, Pandas and PySpark. The capability to combine multiple Python libraries and the possibility to define more nodes or adding support for other libraries are the main Rain's strengths. Currently the library contains several nodes regarding Anomaly Detection strategies.
Dataflow
A DataFlow represents a Directed Acyclic Graph. Since a DataFlow must be executed in a remote machine, then the acyclicity of the DAG must be ensured to avoid deadlocks.
Nodes can be added to the DataFlow and connected one to each other by edges. A node can be seen as a meta-function, a combination of several methods of a particular ML library embedded in Rain, that provides one or more functionalities (for instance a Pandas node/meta-function could compute the mean of a column and then round it up to some given decimals).
Edges connect meta-functions outputs to meta-functions inputs using a specific semantic. In general we can say that an output can be connected to an input if and only if their types match (semantic verification). Moreover an output can have one or more outgoing edges while an input can have at most one ingoing edge.
The library contains also the so-called executors to run the Dataflow. Currently there are the Local executor, where the computation is performed in a single local machine, and the Spark executor to harness an Apache Spark cluster. A DataFlow is run in a single device because data that are transformed by nodes are directly passed to the following ones.
Installation
The library can be accessed in a stand-alone way using Python simply by installing it.
To install Rain, run this command in your terminal (preferred way to install the most recent stable release):
.. code-block:: console
$ pip install git+https://github.com/SIMPLE-DVS/rain.git
It is also possible to install Rain with all the optional dependencies by running the following command:
.. code-block:: console
$ pip install "rain[full] @ git+https://github.com/SIMPLE-DVS/rain"
If you don't have pip
_ installed, this Python installation guide
_ can guide
you through the process.
.. _pip: https://pip.pypa.io .. _Python installation guide: http://docs.python-guide.org/en/latest/starting/installation/
Furthermore the tool comes with a back-end that leverages the library and exposes its functionalities to a GUI which eases the usage of the library itself.
QuickStart
Here we provide a simple Python script in which Rain is used and a Dataflow is configured::
import rain
df = rain.DataFlow("df1", executor=rain.LocalExecutor())
csv_loader = rain.PandasCSVLoader("load", path="./iris.csv")
filter_col = rain.PandasColumnsFiltering("filter", column_indexes=[0, 1])
writer = rain.PandasCSVWriter("write", path="./new_iris.csv")
df.add_edges(
csv_loader @ "dataset" > filter_col @ "dataset",
filter_col @ "transformed_dataset" > writer @ "dataset"
)
df.execute()
In the above script we:
-
first import the library;
-
instantiate a Dataflow (with Id "df1" and referenced as df) passing a Local Executor, meaning that the Dataflow will be executed in the local machine that runs the script;
-
instantiate 3 nodes (csv_loader, filter_col, writer):
-
the first one loads the "iris.csv" file stored in the root directory containing the Iris dataset, using the node PandasCSVLoader;
-
the second node filters some columns using a PandasColumnFiltering with its parameter column_indexes;
-
the last one saves the transformed dataset in a new file called "new_iris.csv" using the node PandasCSVWriter;
-
create 2 edges to link the 3 nodes:
-
the dataset output variable of the node csv_loader is sent to the dataset input variable of the node filter_col;
-
the output transformed_dataset of the filter_col is then sent to the input of the node writer (dataset);
-
finally call the execute method of the Dataflow df. In this way, when the script is run we get the expected result.
In general to use the library you have to perform the following steps:
-
create a Dataflow specifying the type of executor;
-
define all the nodes with the desired parameters to achieve your ML task;
-
define the edges to link the nodes using the specific semantic:
-
> is the symbol used to create an edge, where on the left you must specify the output of the source node while on the right you must specify the input of the destination node;
-
@ is the symbol used to access an input/output variable of a node, where on the left you must specify the variable name of the node while on the right you must specify the name of the output/input variable of the source/destination node;
-
execute the Dataflow and run the script.
More information about Rain usage, edges' semantic and all the possible executors are available here
.
A complete description of all the available nodes with their
behavior, accepted parameters, inputs and outputs is available at this link
.
.. _link: https://rain-library.readthedocs.io/en/latest/rain.nodes.html .. _here: https://rain-library.readthedocs.io/en/latest/usage.html
Full Documentation
To load all the documentation follow the steps:
-
Download sphinx and the sphinx theme specified in the requirements_dev.txt file or install all the requirements listed in that file (suggested choice)
-
From the main directory cd to the 'docs' directory.
.. code-block:: console
$ cd docs
Then run the 'make.bat singlehtml' file on Windows or run the command:
.. code-block:: console
$ sphinx-build . ./_build
The _build directory will contain the html files, open the index.html file to read the full documentation.
Authors
- Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta, Luca Mozzoni, Vincenzo Nucci, Luca Mozzoni, Vincenzo Nucci
======= History
0.1.0 (2021-07-19)
- First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file rain_dm-1.0-py2.py3-none-any.whl
.
File metadata
- Download URL: rain_dm-1.0-py2.py3-none-any.whl
- Upload date:
- Size: 97.6 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 00ae49c67ab1d8e2fac1fa1bb90537aa91caf55fd40293afb6afc87ba25dafbb |
|
MD5 | 3e52b52173f1f50cac89951eb06f193f |
|
BLAKE2b-256 | b7ec760089e6b4940403a6c4ce0f638e76b44173fc2a41c18537ad27743a2c37 |