Skip to main content

Modular NLP pipeline manager.

Project description

OpusPocus

Modular NLP pipeline manager.

OpusPocus is aimed at simplifying the description and execution of popular and custom NLP pipelines, including dataset preprocessing, model training, fine-tuning and evaluation. The pipeline manager supports execution using simple CLI (Bash) or common HPC schedulers (Slurm).

It uses OpusCleaner for data preparation and OpusTrainer for training scheduling (development in progress).

Structure

  • go.py - pipeline manager entry script
  • opuspocus/ - OpusPocus modules
  • opuspocus_cli/ - OpusPocus CLI subcommands
  • config/ - default configuration files (pipeline config, marian training config, ...)
  • examples/ - pipeline manager usage examples
  • scripts/ - helper scripts, at this moment not directly implemented in OpusPocus
  • tests/ - unit tests

Installation

  1. Install MarianNMT
$ ./scripts/install_marian_gpu.sh PATH_TO_CUDA CUDNN_VERSION [NUM_THREADS]

Alternatively, you can usel scripts/install_marian_cpu.sh for CPU version. Note that the scripts may require modification based on your system.

  1. (Optional) Setup the Python virtual environment (using virtualenv):
$ /usr/bin/virtualenv -p /usr/bin/python3.10 python-venv
  1. Install the Python dependencies.
(source python-venv/bin/activate  # if using virtual environment)
$ pip install --upgrade pip setuptools
$ pip install -r requirements.txt
  1. Setup the Python virtual environment for Opuscleaner. (OpusCleaner is currently not supported by Python>=3.10.)
$ /usr/bin/virtualenv -p /usr/bin/python3.9 opuscleaner-venv
  1. Activate the OpusCleaner virtualenv and install OpusCleaner's dependencies
$ source opuscleaner-venv/bin/activate
$ pip install --upgrade pip setuptools
$ pip install -r requirements-opuscleaner.txt

Usage (Simple Pipeline)

Either run the main script go.py or the subcommand scripts from opuspocus_cli/ directory. Run the scripts directly from the root directory for this repository. (You may need to add the path to the local OpusPocus repository directory to your PYTHONPATH.)

You can execute ./go.py --help for general description or ./go.py <subcommand> --help to list the available subcommand options.

Pipeline execution

Run ./go.py run (or opuspocus_cli/run) while providing a pipeline configuration file to execute a new pipeline:

$ ./go.py --pipeline-dir <pipeline_destination> --pipeline-donfig <config_file> --runner <runner>

Alternatively, run ./go.py run while providing an existing pipeline directory to rerun a failed pipeline execution:

$ ./go.py run --pipeline-dir <pipeline_dir> --runner <runner>

You can use --reinit to reinitialize the exitisting pipeline before running. You can use --resubmit-done to also execute pipeline steps in the DONE state.

Lastly, you can also stop and resubmit a running pipeline using --stop-previous-run

$ ./go.py run --pipeline-dir <pipeline_dir> --stop-previous-run

This is simialr to:

$ ./go.py stop --pipeline-dir <pipeline_dir>
$ ./go.py run --pipeline-dir <pipeline_dir>

Other subcommands

  • stop - stops the execution of a running pipeline
  • status- prints the status of a pipeline its steps
  • traceback - prints the dependency structure of a pipeline

Examples

I. Data preprocessing example

  1. Download the data and setup the dataset directory structure.
$ scripts/prepare_data.en-eu.sh
  1. Initialize and execute the (data preprocessing) pipeline.
$ mkdir -p experiments/en-eu/preprocess.simple
$ ./go.py run \
    --pipeline-config config/pipeline.preprocess.yml \
    --pipeline-dir experiments/en-eu/preprocess.simple \
	--runner bash
  • --pipeline-config (required) provides the details about the pipeline steps and their dependencies
  • --pipeline-dir (optional) overrides the pipeline.pipeline_dir value from the pipeline-config
  • --runner (required) runner to be used for pipeline execution. Use --runner slurm for more effective HPC execution (if Slurm is available)
  1. Check the pipeline status.
$ ./go.py traceback --pipeline-dir experiments/en-eu/preprocess.simple

OR

$ ./go.py status --pipeline-dir experiments/en-eu/preprocess.simple

II. Model training example (preprocessing follow-up)

  1. Check the preprocessing pipeline status (The data preprocessing pipeline must be finished, i.e. all steps must be in the DONE step)
$ ./go.py status --pipeline-dir experiments/en-eu/preprocess.simple
  1. Initialize and execute the training pipeline.
$ mkdir -p experiments/en-eu/train.simple
$ ./go.py run \
    --pipeline-config config/pipeline.train.simple.yml \
    --pipeline-dir experiments/en-eu/train.simple \
	--runner bash

Acknowledgements

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opuspocus-0.1.0.tar.gz (60.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opuspocus-0.1.0-py3-none-any.whl (87.3 kB view details)

Uploaded Python 3

File details

Details for the file opuspocus-0.1.0.tar.gz.

File metadata

  • Download URL: opuspocus-0.1.0.tar.gz
  • Upload date:
  • Size: 60.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for opuspocus-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1722c77de7df894f68a1b116db4462bb96342790a3f88d0667dedcc3b9d093e5
MD5 c10bc2a212822933a66b1799b146922f
BLAKE2b-256 fc86b1b6caf80c43dc699fbb35f5aa69c475d41991bf156380085e673c6709a5

See more details on using hashes here.

File details

Details for the file opuspocus-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: opuspocus-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 87.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for opuspocus-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 79873488cd0c0f8aeee6f466a684fc48339577cd76b796b600fb1ca7482219e2
MD5 6d35d9bc3880a82b3829d670952522f8
BLAKE2b-256 9b77be38fa7b09bb33ee63aee0bb2805225d526fbbc928ac113daa1b3e61bcb6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page