Skip to main content

A numerical computing library for Python that scales.

Project description

NumS

PyPI version Build Status codecov

What is NumS?

NumS is a Numerical cloud computing library that translates Python and NumPy to distributed systems code at runtime. NumS scales NumPy operations horizontally, and provides inter-operation (task) parallelism for those operations. NumS remains faithful to the NumPy API, and provides tight integration with the Python programming language by supporting loop parallelism and branching. NumS' system-level operations are written against the Ray API; it supports S3 and basic distributed filesystem operations for storage and uses NumPy as a backend for CPU-based array operations.

Usage

Obtain the latest release of NumS using pip install nums.

NumS provides explicit implementations of the NumPy API, providing a clear API with code hinting when used in conjunction with IDEs (e.g. PyCharm) and interpreters (e.g. iPython, Jupyter Notebook) that provide such functionality.

Basics

Below is a quick snippet that simply samples a few large arrays and performs basic array operations.

import nums.numpy as nps

# Compute some products.
x = nps.random.rand(10**8)
# Note below the use of `get`, which blocks the executing process until
# an operation is completed, and constructs a numpy array
# from the blocks that comprise the output of the operation.
print((x.T @ x).get())
x = nps.random.rand(10**4, 10**4)
y = nps.random.rand(10**4)
print((x @ y).shape)
print((x.T @ x).shape)

# NumS also provides a speedup on basic array operations,
# such array search.
x = nps.random.permutation(10**8)
idx = nps.where(x == 10**8 // 2)

# Whenever possible, NumS automatically evaluates boolean operations
# to support Python branching.
if x[idx] == 10**8 // 2:
    print("The numbers are equal.")
else:
    raise Exception("This is impossible.")

I/O

NumS provides an optimized I/O interface for fast persistence of block arrays. See below for a basic example.

import nums
import nums.numpy as nps

# Write an 800MB object in parallel, utilizing all available cores and
# write speeds available to the OS file system.
x1 = nps.random.rand(10**8)
# We invoke `get` to block until the object is written.
# The result of the write operation provides status of the write
# for each block as a numpy array.
print(nums.write("x.nps", x1).get())

# Read the object back into memory in parallel, utilizing all available cores.
x2 = nums.read("x.nps")
assert nps.allclose(x1, x2)

NumS automatically loads CSV files in parallel as distinct arrays, and intelligently constructs a partitioned array for block-parallel linear algebra operations.

# Specifying has_header=True discards the first line of the CSV.
dataset = nums.read_csv("path/to/csv", has_header=True)

Logistic Regression

In this example, we'll run logistic regression on a bimodal Gaussian. We'll begin by importing the necessary modules.

import nums.numpy as nps
from nums.models.glms import LogisticRegression

NumS initializes its system dependencies automatically as soon as an operation is performed. Thus, importing modules triggers no systems-related initializations.

Parallel RNG

NumS is based on NumPy's parallel random number generators. You can sample billions of random numbers in parallel, which are automatically block-partitioned for parallel linear algebra operations.

Below, we sample an 800MB bimodal Gaussian, which is asynchronously generated and stored by the implemented system's workers.

size = 10**8
X_train = nps.concatenate([nps.random.randn(size // 2, 2), 
                           nps.random.randn(size // 2, 2) + 2.0], axis=0)
y_train = nps.concatenate([nps.zeros(shape=(size // 2,), dtype=nps.int), 
                           nps.ones(shape=(size // 2,), dtype=nps.int)], axis=0)

Training

NumS's logistic regression API follows the scikit-learn API, a familiar API to the majority of the Python scientific computing community.

model = LogisticRegression(solver="newton-cg", penalty="l2", C=10)
model.fit(X_train, y_train)

We train our logistic regression model using the Newton method. NumS's optimizer automatically optimizes scheduling of operations using a mixture of block-cyclic heuristics, and a fast, tree-based optimizer to minimize memory and network load across distributed memory devices. For tall-skinny design matrices, NumS will automatically perform data-parallel distributed training, a near optimal solution to our optimizer's objective.

Evaluation

We evaluate our dataset by computing the accuracy on a sampled test set.

X_test = nps.concatenate([nps.random.randn(10**3, 2), 
                          nps.random.randn(10**3, 2) + 2.0], axis=0)
y_test = nps.concatenate([nps.zeros(shape=(10**3,), dtype=nps.int), 
                          nps.ones(shape=(10**3,), dtype=nps.int)], axis=0)
print("train accuracy", (nps.sum(y_train == model.predict(X_train)) / X_train.shape[0]).get())
print("test accuracy", (nps.sum(y_test == model.predict(X_test)) / X_test.shape[0]).get())

We perform the get operation to transmit the computed accuracy from distributed memory to "driver" (the locally running process) memory.

Training on HIGGS

Below is an example of loading the HIGGS dataset (download here), partitioning it for training, and running logistic regression.

import nums
import nums.numpy as nps
from nums.models.glms import LogisticRegression

higgs_dataset = nums.read_csv("HIGGS.csv")
y, X = higgs_dataset[:, 0].astype(nps.int), higgs_dataset[:, 1:]
model = LogisticRegression(solver="newton-cg")
model.fit(X, y)
y_pred = model.predict(X)
print("accuracy", (nps.sum(y == y_pred) / X.shape[0]).get())

Installation

NumS releases are tested on Linux-based systems running Python 3.6, 3.7, and 3.8.

NumS runs on Windows, but not all features are tested. We recommend using Anaconda on Windows. Download and install Anaconda for Windows here. Make sure to add Anaconda to your PATH environment variable during installation.

pip installation

To install NumS on Ray with CPU support, simply run the following command.

pip install nums

conda installation

We are working on providing support for conda installations, but in the meantime, run the following with your conda environment activated.

pip install nums
# Run below to have NumPy use MKL.
conda install -fy mkl
conda install -fy numpy scipy

S3 Configuration

To run NumS with S3, configure credentials for access by following instructions here: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html

Contributing

To contribute to NumS on Ray, we recommend cloning the repository and installing the project in developer mode using the following set of commands:

cd nums
conda create --name nums python=3.7 -y
conda activate nums
pip install -e ".[testing]"

Contributing NumPy Functionality

To make basic contributions to the NumPy API, follow these steps:

  1. Replicate the function signature in nums.numpy.api. If it's a np.ndarray method, add the function signature to nums.core.array.blockarray.BlockArray.
  2. If possible, implement the function using existing methods in nums.core.array.application.ArrayApplication or nums.core.array.blockarray.BlockArray.
  3. Write a new implementation ArrayApplication or BlockArray if it's not possible to implement using existing methods, or the implementation's execution speed can be improved beyond what is achievable using existing methods.
  4. Add kernel interfaces to nums.core.systems.interfaces.ComputeInterface, and implement the interface methods for all existing compute implementations. Currently, the only compute interface is nums.core.systems.numpy_compute.
  5. Write tests covering all branches of your implementation in the corresponding test module in the project's tests/ directory.
  6. Do your best to implement the API in its entirety. It's generally better to have a partial implementation than no implementation, so if for whatever reason certain arguments are difficult to support, follow the convention we use to raise errors for unsupported arguments in functions like nums.numpy.api.min.
  7. If you run into any issues and need help with your implementation, open an issue describing the issue you're experiencing.

We encourage you to follow the nums.numpy.api.arange implementation as a reference.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nums-0.2.3.tar.gz (79.4 kB view details)

Uploaded Source

Built Distribution

nums-0.2.3-py3-none-any.whl (102.9 kB view details)

Uploaded Python 3

File details

Details for the file nums-0.2.3.tar.gz.

File metadata

  • Download URL: nums-0.2.3.tar.gz
  • Upload date:
  • Size: 79.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.1 setuptools/57.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.9

File hashes

Hashes for nums-0.2.3.tar.gz
Algorithm Hash digest
SHA256 a4245b7d40e9b6096659b93f549d5d3ae4d84b35108a5f36aa797cb2b65ae9d4
MD5 7659437317b71102ac4f1ec1876531c5
BLAKE2b-256 64799cfb93c665b4008d01f7d1d2b656ccd68b679c1a8d5ff196ba38e0dce615

See more details on using hashes here.

File details

Details for the file nums-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: nums-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 102.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.1 setuptools/57.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.9

File hashes

Hashes for nums-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0e50f8bed46ac11a4e02a3fab00b1b3e52779e0b32ac34816f7ca477a683019a
MD5 bbd8913f4d2a3f38e1da4115b72bd44b
BLAKE2b-256 0660465b53ec05f5fdc5e330e7891dea0e97b52442a34495e6e08161b05a3f95

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page