Skip to main content

No project description provided

Project description

SSEC-JHU dplutils

CI Status Documentation Status codecov Security Status PyPI

SSEC-JHU Logo

Distributed Data Pipeline Utilities

Usage:

Check out the docs at https://dplutils.readthedocs.io/en/stable/

Setup locally

pip install dplutils

You can create pipelines and run in a local ray cluster as in the example below. Either in client mode running directly in an interactive python session, or by submission to a locally running ray cluster.

Further details on running a pipeline are listed below in the using the docker container environment.

Setup Using Docker

Get (or build, see below) docker image

docker pull ghcr.io/ssec-jhu/dplutils:latest

Start cluster

To start a cluster, start one ray head node and any number of worker nodes on network-connected hosts. To start the head node, running the container using the docker engine, this can be used:

docker run -d -n rayhead -v /path/to/data:/data --net host \
  dplutils /opt/startray.sh --head --block

which will start the head node (blocking in order that the container stay up). The --net host option is given to expose all open ports on the host as ray requires several bi-directional connections to workers. The -v ... option is an example of mounted a local path into the container so it can access files (for example directory containing source data, and output directory). It also exposes a dashboard at 8265 that can be viewed using a web browser.

On the workers, similarly start using the command:

docker run -d -v /path/to/data:/data --net host \
  dplutils /opt/startray.sh --block --address={head-node}:6379

For hosts with custome resources (e.g. other than those that get auto-detected such as CPUs and GPUs), you can pass resources to the start command:

docker run -d --net host -v /path/to/data:/data \
  dplutils /opt/startray.sh --block --address={head-node}:6379 \
  --resources '{"mycustomresource": 1}'

In the dashboard you should see the workers listed in the clusters tab.

Start pipeline

Pipelines can be run via interactive python sessions or asynchronously. In an interactive session one would import or define a pipeline within the session and then call run or writeto method to kick off execution. For longer running or production jobs it is generally advisable to submit a job to ray. dplutils contains helpers for making it easy to run configurable pipelines via the command line. For example, assuming a script like:

from dplutils.pipeline import PipelineTask
from dplutils.pipeline.ray import RayDataPipelineExecutor
from dplutils.cli import cli_run

if __name__ == '__main__':
   pl = RayDataPipelineExecutor([PipelineTask('task1', lambda x: x.assign(newcol=1))])
   cli_run(pl)

We can submit the job in the following way, assuming a container is already running and has had the ray head node started (here named rayhead; see above):

docker exec -it rayhead ray job submit -- python /path/to/script.py -o outdir

Note that as this is run within the container environment, the paths are what is exposed within and not necessarily the same as in the host environment

The progress and log files can be viewed on the ray dashboard, and generated data will be available as a parquet table written to /outdir (one file per batch, as they are completed)

Installation, Build, & Run instructions

An "official" docker image is provided based on the latest release, but for development or those needing a custom build or running outside of a containerized environment, below are instructions for installing the code from the source repository.

Setup

Install dependencies:

  • pip install -r requirements/dev.txt

Tests

Run tox:

  • tox -e test, to run just the tests
  • tox, to run linting, tests and build. This should be run without errors prior to commit

Docker

From the repo directory, run

docker build -f docker/Dockerfile --tag dplutils .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dplutils-0.8.0.tar.gz (164.4 kB view details)

Uploaded Source

Built Distribution

dplutils-0.8.0-py3-none-any.whl (28.9 kB view details)

Uploaded Python 3

File details

Details for the file dplutils-0.8.0.tar.gz.

File metadata

  • Download URL: dplutils-0.8.0.tar.gz
  • Upload date:
  • Size: 164.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for dplutils-0.8.0.tar.gz
Algorithm Hash digest
SHA256 5ed4f1ee7c1aed5c3e801ae9711a39370afc14b179af63868b596163e48b8b68
MD5 3e67cf60d3c9bd77fe68d41ac146206c
BLAKE2b-256 34e1e8bc44ab5cfa0f2d3bacf5615e5afde91a432b804ead2fdb491b6499a3e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for dplutils-0.8.0.tar.gz:

Publisher: dist.yml on ssec-jhu/dplutils

Attestations:

File details

Details for the file dplutils-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: dplutils-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 28.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for dplutils-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8b041333e1bcbd4c707eb1dffe89a005c272fdf8a94b4fa9884b5c67ea28aa4d
MD5 3101350326beb3daffe7e2c26671f8e9
BLAKE2b-256 0d582252b699d7dfb7e83011383bc000774989f706239e305452644e2606753e

See more details on using hashes here.

Provenance

The following attestation bundles were made for dplutils-0.8.0-py3-none-any.whl:

Publisher: dist.yml on ssec-jhu/dplutils

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page