Skip to main content

Python library for SciDB streaming

Project description

https://img.shields.io/badge/SciDB-22.5-blue.svg https://img.shields.io/badge/arrow-11.0.0-blue.svg https://github.com/Paradigm4/stream/actions/workflows/test.yml/badge.svg

Requirements

SciDB 19.11 or newer.

Apache Arrow 5.0.0 to 11.0.0.

Python 3.6.x, 3.7.x, 3.8.x, 3.9.x, or 3.10.x

Required Python packages:

dill
pandas
pyarrow

Installation

Install latest release:

pip install scidb-strm

Install development version from GitHub:

pip install git+http://github.com/paradigm4/stream.git#subdirectory=py_pkg

The Python library needs to be installed on the SciDB server. The library needs to be installed on the client as well, if Python code is to be send from the client to the server.

SciDB-Strm Python API and Examples

Once installed the SciDB-Strm Python library can be imported with import scidbstrm. The library provides a high and low-level access to the SciDB stream operator as well as the ability to send Python code to the SciDB server.

High-level access is provided by the function map:

map(map_fun, finalize_fun=None)

Read SciDB chunks. For each chunk, call map_fun and stream its result back to SciDB. If finalize_fun is provided, call it after all the chunks have been processed.

See 0-iquery.txt for a succinct example using the map function.

See 1-map-finalize.py for an example using the map function. The Python script has to be copied onto the SciDB instance.

Python code can be send to the SciDB server for execution using the pack_func and read_func functions:

pack_func(func)

Serialize Python function for use as upload_data in input or load operators.

read_func()

Read and de-serialize function from SciDB.

See 2-pack-func.py for an example of using the pack_func and read_func functions.

Low-level access is provided by the read and write functions:

read()

Read a data chunk from SciDB. Returns a Pandas DataFrame or None.

write(df=None)

Write a data chunk to SciDB.

See 3-read-write.py for an example using the read and write functions. The Python script has to be copied onto the SciDB instance.

A convenience invocation of the Python interpreter is provided in python_map variable and it is set to:

python -uc "import scidbstrm; scidbstrm.map(scidbstrm.read_func())"

Finally, see 4-machine-learning.py for a more complex example of going through the steps of using machine learning (preprocessing, training, and prediction).

Debugging Python Code

When debugging Python code executed as part of the stream operator do not use the print function. The stream operator communicates with the Python process using stdout. The print function writes output to stdout. So, using the print function would interfere with the inter-process communication.

Instead, use the debug function provided by the library. The function formats the arguments as strings and printed them all out separated by space. For example:

debug("Value of i is", 10)

Alternatively, output can be written directly to stderr using the write function. For example:

import sys

x = [1, 2, 3]
sys.stderr.write("{}\n".format(x))

The output is written in the scidb-stderr.log files of each instance, for example:

/opt/scidb/18.1/DB-scidb/0/0/scidb-stderr.log
/opt/scidb/18.1/DB-scidb/0/1/scidb-stderr.log

If using SciDB 18.1 installed in the default location and configured with one server and two instances.

ImportError: No module named

When trying to de-serialize a Python function uploaded to SciDB using pack_func, one might encounter:

ImportError: No module named ...

This error is because dill, the Python serialization library, links the function to the module in which it is defined. This can be resolved in two ways:

  1. Make the named module available on all the SciDB instances

  2. If the module is small, the recursive dill mode can be used. Replace:

    foo_pack = scidbstrm.pack_func(foo)

    with:

    foo_pack = numpy.array([dill.dumps(foo, 0, recurse=True)])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scidb-strm-19.11.4.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

scidb_strm-19.11.4-py2.py3-none-any.whl (5.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file scidb-strm-19.11.4.tar.gz.

File metadata

  • Download URL: scidb-strm-19.11.4.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for scidb-strm-19.11.4.tar.gz
Algorithm Hash digest
SHA256 64c0e0269d31f374d38f137881771531b4cebacf94343776c0b74a07ee5198c8
MD5 de600104cfac14b9a73a09891368c217
BLAKE2b-256 a25be18c9b6f5dff9dd6ba19daf425fcb032005de5fab2f7e261a59b0604d21b

See more details on using hashes here.

File details

Details for the file scidb_strm-19.11.4-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for scidb_strm-19.11.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 fde8f951ae5d7ba871489ff7acebd3cab50a0325542a5754e6a4e0c62515f394
MD5 b3bd6749a9a01b1972c0af066ca511e4
BLAKE2b-256 7e3a86e0e77e2f250a4b4b053d3117ef0adbddcbf3a51f6d3c3e4d9b46dde61d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page