Skip to main content

Python library for SciDB streaming

Project description

Requirements

SciDB 16.9 or newer

Apache Arrow 0.6.0 or newer.

Python 2.7.x, 3.4.x, 3.5.x, 3.6.x or newer.

Required Python packages:

dill
feather-format
pandas

Note

Apache Arrow versions older than 0.8.0 contain a bug which might affect Stream users. The bug manifests on chunks of more than 128 records with null-able values. For more details, see the full bug description here. This bug has been fixed in Apache Arrow version 0.8.0.

Installation

Install latest release:

pip install scidb-strm

Install development version from GitHub:

pip install git+http://github.com/paradigm4/stream.git#subdirectory=py_pkg

The Python library needs to be installed on the SciDB server. The library needs to be installed on the client as well, if Python code is to be send from the client to the server.

SciDB-Strm Python API and Examples

Once installed the SciDB-Strm Python library can be imported with import scidbstrm. The library provides a high and low-level access to the SciDB stream operator as well as the ability to send Python code to the SciDB server.

High-level access is provided by the function map:

map(map_fun, finalize_fun=None)
Read SciDB chunks. For each chunk, call map_fun and stream its result back to SciDB. If finalize_fun is provided, call it after all the chunks have been processed.

See 0-iquery.txt for a succinct example using the map function.

See 1-map-finalize.py for an example using the map function. The Python script has to be copied onto the SciDB instance.

Python code can be send to the SciDB server for execution using the pack_func and read_func functions:

pack_func(func)
Serialize Python function for use as upload_data in input or load operators.
read_func()
Read and de-serialize function from SciDB.

See 2-pack-func.py for an example of using the pack_func and read_func functions.

Low-level access is provided by the read and write functions:

read()
Read a data chunk from SciDB. Returns a Pandas DataFrame or None.
write(df=None)
Write a data chunk to SciDB.

See 3-read-write.py for an example using the read and write functions. The Python script has to be copied onto the SciDB instance.

A convenience invocation of the Python interpreter is provided in python_map variable and it is set to:

python -uc "import scidbstrm; scidbstrm.map(scidbstrm.read_func())"

Finally, see 4-machine-learning.py for a more complex example of going throught the steps of using machine larning (preprocessing, training, and prediction).

Debugging Python Code

When debugging Python code executed as part of the stream operator do not use the print function. The stream operator communicates with the Python process using stdout. The print function writes output to stdout. So, using the print function would interfere with the inter-process communication.

Instead, write debug output to stderr using the write function. For example:

import sys

x = [1, 2, 3]
sys.stderr.write("{}\n".format(x))

The output is written in the scidb-stderr.log files of each instance, for example:

/opt/scidb/18.1/DB-scidb/0/0/scidb-stderr.log
/opt/scidb/18.1/DB-scidb/0/1/scidb-stderr.log

If using SciDB 18.1 installed in the default location and configured with one server and two instances.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for scidb-strm, version 19.3.0
Filename, size File type Python version Upload date Hashes
Filename, size scidb_strm-19.3.0-py2.py3-none-any.whl (4.7 kB) File type Wheel Python version py2.py3 Upload date Hashes View hashes
Filename, size scidb-strm-19.3.0.tar.gz (4.1 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page