Skip to main content

Parallel, distributed NumPy-like arrays backed by Chapel

Project description

Arkouda (αρκούδα): NumPy-like arrays at massive scale backed by Chapel.

NOTE: Arkouda is under the MIT license.

Gitter channels

Arkouda Gitter channel

Chapel Gitter channel

Talks on Arkouda

Mike Merrill's CHIUW 2019 talk

Bill Reus' CLSAC 2019 talk

(PAW-ATM) talk and abstract

Abstract:

Exploratory data analysis (EDA) is a prerequisite for all data science, as illustrated by the ubiquity of Jupyter notebooks, the preferred interface for EDA among data scientists. The operations involved in exploring and transforming the data are often at least as computationally intensive as downstream applications (e.g. machine learning algorithms), and as datasets grow, so does the need for HPC-enabled EDA. However, the inherently interactive and open-ended nature of EDA does not mesh well with current HPC usage models. Meanwhile, several existing projects from outside the traditional HPC space attempt to combine interactivity and distributed computation using programming paradigms and tools from cloud computing, but none of these projects have come close to meeting our needs for high-performance EDA.

To fill this gap, we have developed a software package, called Arkouda, which allows a user to interactively issue massively parallel computations on distributed data using functions and syntax that mimic NumPy, the underlying computational library used in the vast majority of Python data science workflows. The computational heart of Arkouda is a Chapel interpreter that accepts a pre-defined set of commands from a client (currently implemented in Python) and uses Chapel's built-in machinery for multi-locale and multithreaded execution. Arkouda has benefited greatly from Chapel's distinctive features and has also helped guide the development of the language.

In early applications, users of Arkouda have tended to iterate rapidly between multi-node execution with Arkouda and single-node analysis in Python, relying on Arkouda to filter a large dataset down to a smaller collection suitable for analysis in Python, and then feeding the results back into Arkouda computations on the full dataset. This paradigm has already proved very fruitful for EDA. Our goal is to enable users to progress seamlessly from EDA to specialized algorithms by making Arkouda an integration point for HPC implementations of expensive kernels like FFTs, sparse linear algebra, and graph traversal. With Arkouda serving the role of a shell, a data scientist could explore, prepare, and call optimized HPC libraries on massive datasets, all within the same interactive session.

Installation

Requirements:

  • requires chapel 1.22.0
  • requires zeromq version >= 4.2.5, tested with 4.2.5 and 4.3.1
  • requires hdf5
  • requires python 3.6 or greater
  • requires numpy
  • requires Sphinx and sphinx-argparse to build python documentation
  • requires pytest and pytest-env to execute the Python test harness

MacOS Environment Installation

It is usually very simple to get things going on a mac:

brew install zeromq

brew install hdf5

brew install chapel

# Although not required, is is highly recommended to install Anaconda to provide a 
# Python 3 environment and manage Python dependencies:
wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
sh Anaconda3-2020.02-Linux-x86_64.sh
source ~/.bashrc

# Otherwise, Python 3 can be installed with brew
brew install python3
# !!! the standard way of installing through pip3 installs an old version of arkouda
# !!! the arkouda python client is available via pip
# !!! pip will automatically install python dependencies (zmq and numpy)
# !!! however, pip will not build the arkouda server (see below)
# !!!pip3 install arkouda
#
# install the version of the arkouda python package which came with the arkouda_server
# if you plan on editing the arkouda python package use the -e flag
# from the local arkouda repo/directory run...
pip3 install -e .
#
# these packages are nice but not a requirement
pip3 install pandas
pip3 install jupyter

If it is preferred to build Chapel instead of using the brew install, the process is as follows:

# on my mac build chapel in my home directory with these settings...
export CHPL_HOME=~/chapel/chapel-1.22.0
source $CHPL_HOME/util/setchplenv.bash
export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=smp
export CHPL_TARGET_CPU=native
export GASNET_QUIET=Y
export CHPL_RT_OVERSUBSCRIBED=yes
cd $CHPL_HOME
make

# Build chpldoc to enable generation of Arkouda docs
make chpldoc

# Add the Chapel executable (chpl) to PATH either in ~/.bashrc (single user) 
# or /etc/environment (all users):

export PATH=$CHPL_HOME/bin/linux64-x86_64/:$PATH

Linux Environment Installation

There is no Linux Chapel install, so the first two steps in the Linux Arkouda install are to install the Chapel dependencies followed by downloading and building Chapel:

# Update Linux kernel and install Chapel dependencies
sudo apt-get update
sudo apt-get install gcc g++ m4 perl python python-dev python-setuptools bash make mawk git pkg-config

# Download latest Chapel release, explode archive, and navigate to source root directory
wget https://github.com/chapel-lang/chapel/releases/download/1.22.0/chapel-1.22.0.tar.gz
tar xvf chapel-1.22.0.tar.gz
cd chapel-1.22.0/

# Set CHPL_HOME
export CHPL_HOME=$PWD

# Add chpl to PATH
source $CHPL_HOME/util/setchplenv.bash

# Set remaining env variables and execute make
export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=smp
export CHPL_TARGET_CPU=native
export GASNET_QUIET=Y
export CHPL_RT_OVERSUBSCRIBED=yes
cd $CHPL_HOME
make

# Build chpldoc to enable generation of Arkouda docs
make chpldoc

# Optionally add the Chapel executable (chpl) to the PATH for all users: /etc/environment
export PATH=$CHPL_HOME/bin/linux64-x86_64/:$PATH

As is the case with the MacOS install, it is highly recommended to install Anaconda to provide a Python environment and manage Python dependencies:

 wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
 sh Anaconda3-2020.02-Linux-x86_64.sh
 source ~/.bashrc

Building Arkouda

Download, clone, or fork the arkouda repo. Further instructions assume that the current directory is the top-level directory of the repo.

If your environment requires non-system paths to find dependencies (e.g., if using the ZMQ and HDF5 bundled with [Anaconda]), append each path to a new file Makefile.paths like so:

# Makefile.paths

# Custom Anaconda environment for Arkouda
$(eval $(call add-path,/home/user/anaconda3/envs/arkouda))
#                      ^ Note: No space after comma.

The chpl compiler will be executed with -I, -L and an -rpath to each path.

# If zmq and hdf5 have not been installed previously, execute make install-deps
make install-deps

# Run make to build the arkouda_server executable
make

Now that the arkouda_server is built and tested, install the Python library

Installing the Arkouda Python Library

 pip3 install -e .

Testing Arkouda

There are two unit test suites for Arkouda, one for Python and one for Chapel. As mentioned above, the Arkouda Python test harness leverages the pytest and pytest-env libraries, whereas the Chapel test harness does not require any external librares.

For more details regarding Arkouda testing, please consult the Python test README and Chapel test README, respectively.

Building the Arkouda documentation

Make sure you've installed the Sphinx and sphinx-argparse packages (e.g. pip3 install -U Sphinx sphinx-argparse). Important: if you've built Chapel, you must execute make chpldoc as detailed above.

Run make doc to build both the Arkouda python documentation and the Chapel server documentation

The output is currently in subdirectories of the arkouda/doc

arkouda/doc/python # python frontend documentation
arkouda/doc/server # chapel backend server documentation 

To view the documentation for the Arkouda python client, point your browser to file:///path/to/arkouda/doc/python/index.html, substituting the appropriate path for your configuration.

Running arkouda_server

The command-line invocation depends on whether you built a single-locale version (with CHPL_COMM=none) or multi-locale version (with CHPL_COMM set).

Single-locale startup:

./arkouda_server

Multi-locale startup (user selects the number of locales):

./arkouda_server -nl 2

Also can run server with memory checking turned on using

./arkouda_server --memTrack=true

By default, the server listens on port 5555 and prints verbose output. These options can be changed with command-line flags --ServerPort=1234 and --v=false

Memory checking is turned off by default and turned on by using --memTrack=true

Logging messages are turned on by default and turned off by using --logging=false

Verbose messages are turned on by default and turned off by using --v=false

Other command line options are available, view them by using --help

Testing arkouda_server

To sanity check the arkouda server, you can run

make check

This will start the server, run a few computations, and shut the server down. In addition, the check script can be executed against a running server by running the following Python command:

python3 tests/check.py localhost 5555

Contributing to Arkouda

If you'd like to contribute, please see CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arkouda-2020.7.7.tar.gz (10.7 MB view details)

Uploaded Source

File details

Details for the file arkouda-2020.7.7.tar.gz.

File metadata

  • Download URL: arkouda-2020.7.7.tar.gz
  • Upload date:
  • Size: 10.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.18.4 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.5

File hashes

Hashes for arkouda-2020.7.7.tar.gz
Algorithm Hash digest
SHA256 ae37d83009c84bd46a16cf71a1b84583e402a419503f5e6d0581fbe2de966aac
MD5 7a3da8be338039724ff5a111a9f5feeb
BLAKE2b-256 7a7bdebc2432ee27236892f99cd17e8fe0974c2b913f74c12ceacf81559bc37d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page