Skip to main content

A library that takes care of several tedious aspects of working with big data on an HPC cluster.

Project description

Welcome to idact!

Build Status - master Build Status - develop Coverage Status - master PyPI - Python Version PyPI - License PyPI

Idact, or Interactive Data Analysis Convenience Tools, is a Python 3.5+ library that takes care of several tedious aspects of working with big data on an HPC cluster.

Who is it for?

Data scientists or big data enthusiasts, who:

  • Perform computations on Jupyter Notebook, using libraries such as NumPy, pandas, Matplotlib, or bokeh.
  • Have access to an HPC cluster with Slurm as the job scheduler.
  • Would like to parallelize their computations across many nodes using Dask.distributed, a library for distributed computing.
  • May find that it takes too much manual effort to deploy Jupyter Notebook and Dask on the cluster each time they need it.

Requirements

Python 3.5+.

Client

Cluster

Installation

python -m pip install idact

If you're using Conda, you may want to update your environment first:

conda update --all

Code samples

Accessing a cluster

Cluster can be accessed with a public/private key pair via SSH.

from idact import *
cluster = add_cluster(name="short-cluster-name",
                      user="user",
                      host="login-node.cluster.example.com",
                      port=22,
                      auth=AuthMethod.PUBLIC_KEY,
                      key="~/.ssh/id_rsa",
                      install_key=False)
node = cluster.get_access_node()
node.connect()

Tutorial: 01. Connecting to a cluster

Allocating and deallocating nodes

Nodes are allocated as a Slurm job. Afterwards, they can be used for deployments.

import bitmath
nodes = cluster.allocate_nodes(nodes=8,
                               cores=12,
                               memory_per_node=bitmath.GiB(120),
                               walltime=Walltime(hours=1, minutes=30),
                               native_args={
                                   '--partition': 'debug',
                                   '--account': 'data-analysis-group'
                               })
try:
    nodes.wait(timeout=120.0)
except TimeoutError:
    nodes.cancel()

Tutorial: 02. Allocating nodes

Deploying Jupyter Notebook

Jupyter Notebook is deployed on a cluster node, and made accessible through an SSH tunnel.

nb = nodes[0].deploy_notebook()
nb.open_in_browser()

Tutorial: 03. Deploying Jupyter

Deploying Dask.distributed

Dask.distributed scheduler and workers are deployed on cluster nodes, and their dashboards are made available through SSH tunnels.

dd = deploy_dask(nodes[1:])
client = dd.get_client()
client.submit(...)
dd.diagnostics.open_all()

Tutorial: 04. Deploying Dask, 09. Demo analysis

Managing cluster config

Local and remote cluster configuration can be saved, loaded, and copied to and from the cluster.

save_environment()
load_environment()

push_environment(cluster)
pull_environment(cluster)

Tutorials: 01. Connecting to a cluster, 05. Configuring idact on a cluster

Managing deployments

Deployment objects can be serialized and copied between running program instances, local or remote.

cluster.push_deployment(nodes)
cluster.push_deployment(nb)
cluster.push_deployment(dd)

cluster.pull_deployments()

Tutorials: 06. Working on a cluster, 07. Adjusting timeouts

Quick deployment app

Quick deployment app allocates nodes and deploys Jupyter notebook from command line:

idact-notebook short-cluster-name --nodes 3 --walltime 0:20:00

Tutorial: 08. Using the quick deployment app

Documentation

The documentation contains detailed API description, tutorial notebooks, and other helpful information.

Source code

The source code is available on GitHub.

License

MIT License.

This library was developed under the supervision of Leszek Grzanka, PhD as a final project of the BEng in Computer Science program at the Faculty of Computer Science, Electronics and Telecommunications at AGH University of Science and Technology, Krakow.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

idact-0.7-py3-none-any.whl (147.1 kB view details)

Uploaded Python 3

File details

Details for the file idact-0.7-py3-none-any.whl.

File metadata

  • Download URL: idact-0.7-py3-none-any.whl
  • Upload date:
  • Size: 147.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for idact-0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 539f63edce9d4f47dcd0492c548319a857e892b05b0ffface62941481be0b5d5
MD5 cf8c62b1476981d6d43bf5161380ee38
BLAKE2b-256 5adb28b0faf494fa887ded6e926817b4643e5f9be34b8373ce9e136906956c75

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page