Skip to main content

Pymarian

Project description

PyMarian

  • Python bindings to Marian (C++) is using [PyBind11]
  • The python package is built using scikit-build-core

Install

# build marian with -DPYMARIAN=on option to create a pymarian wheel
cmake . -Bbuild -DCOMPILE_CUDA=off -DPYMARIAN=on -DCMAKE_BUILD_TYPE=Release
cmake --build build -j       # -j option parallelizes build on all cpu cores
python -m pip install build/pymarian-*.whl

The above commands use python executable in the PATH to determine Python version for compiling marian native extension. Make sure to have the desired python executable in your environment before invoking these cmake commands.

Python API

Python API is designed to take same argument as marian CLI string.

NOTE: these APIs are experimental only and not finalized. see mtapi_server.py for an example use of Translator API

Translator

# Translator
from pymarian import Translator
cli_string = "..."
translator = Translator(cli_string)

sources = ["sent1" , "sent2" ]
result = translator.translate(sources)
print(result)

Evaluator

# Evaluator
from pymarian import Evaluator
cli_string = '-m path/to/model.npz -v path/to.vocab.spm path/to.vocab.spm --like comet-qe'
evaluator = Evaluator(cli_str)

data = [
    ["Source1", "Hyp1"],
    ["Source2", "Hyp2"]
]
scores = evaluator.run(data)
for score in scores:
    print(score)

CLI Usage

. pymarian-evaluate : CLI to download and use pretrained metrics such as COMETs, COMETOIDs, ChrFoid, and BLEURT . pymarian-mtapi : REST API demo powered by Flask . pymarian-qtdemo : GUI App demo powered by QT

pymarian-eval

$ pymarian-eval -h 
usage: pymarian-eval [-h] [-m MODEL] [-v VOCAB] [-l {comet-qe,bleurt,comet}] [-V] [-] [-t MT_FILE] [-s SRC_FILE] [-r REF_FILE] [-f FIELD [FIELD ...]] [-o OUT] [-a {skip,append,only}] [-w WIDTH] [--debug] [--fp16] [--mini-batch MINI_BATCH] [-d [DEVICES ...] | -c
                     CPU_THREADS] [-ws WORKSPACE] [-pc]

options:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        Model name, or path. Known models: bleurt-20, wmt20-comet-da, wmt20-comet-qe-da, wmt20-comet-qe-da-v2, wmt21-comet-da, wmt21-comet-qe-da, wmt21-comet-qe-mqm, wmt22-comet-da, wmt22-cometkiwi-da, xcomet-xl, xcomet-xxL (default: wmt22-cometkiwi-da)
  -v VOCAB, --vocab VOCAB
                        Vocabulary file (default: None)
  -l {comet-qe,bleurt,comet}, --like {comet-qe,bleurt,comet}
                        Model type. Required if --model is a local file (auto inferred for known models) (default: None)
  -V, --version         show program's version number and exit
  -, --stdin            Read input from stdin. TSV file with following format: QE metrics: "src<tab>mt", Ref based metrics ref: "src<tab>mt<tab>ref" or "mt<tab>ref" (default: False)
  -t MT_FILE, --mt MT_FILE
                        MT output file. Ignored when --stdin (default: None)
  -s SRC_FILE, --src SRC_FILE
                        Source file. Ignored when --stdin (default: None)
  -r REF_FILE, --ref REF_FILE
                        Ref file. Ignored when --stdin (default: None)
  -f FIELD [FIELD ...], --fields FIELD [FIELD ...]
                        Input fields, an ordered sequence of {src, mt, ref} (default: ['src', 'mt', 'ref'])
  -o OUT, --out OUT     output file (default: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)
  -a {skip,append,only}, --average {skip,append,only}
                        Average segment scores to produce system score. skip=do not output average (default; segment scores only); append=append average at the end; only=output the average only (i.e. system score only) (default: skip)
  -w WIDTH, --width WIDTH
                        Output score width (default: 4)
  --debug               Debug or verbose mode (default: False)
  --fp16                Enable FP16 mode (default: False)
  --mini-batch MINI_BATCH
                        Mini-batch size (default: 16)
  -d [DEVICES ...], --devices [DEVICES ...]
                        GPU device IDs (default: None)
  -c CPU_THREADS, --cpu-threads CPU_THREADS
                        Use CPU threads. 0=use GPU device 0 (default: None)
  -ws WORKSPACE, --workspace WORKSPACE
                        Workspace memory (default: 8000)
  -pc, --print-cmd      Print marian evaluate command and exit (default: False)
  --cache CACHE         Cache directory for storing models (default: $HOME/.cache/marian/metric)

More info at https://github.com/marian-nmt/marian-dev. This CLI is loaded from .../python3.10/site-packages/pymarian/eval.py (version: 1.12.25)

Performance Tuning Tips:

  • For CPU parallelization, --cpu-threads <n>
  • For GPU parallelization, assuming pymarian was compiled with cuda support, e.g., --devices 0 1 2 3 to use the specified 4 gpu devices.
  • When OOM error: adjust --mini-batch argument
  • To see full logs from marian, set --debug

pymarian-mtapi

Launch server

# example model: download and extract
wget http://data.statmt.org/romang/marian-regression-tests/models/wngt19.tar.gz 
tar xvf wngt19.tar.gz 

# launch server
pymarian-mtapi -s en -t de "-m wngt19/model.base.npz -v wngt19/en-de.spm wngt19/en-de.spm"

Example request from client

URL="http://127.0.0.1:5000/translate"
curl $URL --header "Content-Type: application/json" --request POST --data '[{"text":["Good Morning."]}]'

pymarian-qtdemo

pymarian-qtdemo

Code Formatting

pip install black isort
isort .
black .
cd src/python

Run Tests

# install pytest if necessary
python -m pip install pytest

# run tests in quiet mode
python -m pytest src/python/tests/regression

# or, add -s to see STDOUT/STDERR from tests
python -m pytest -s src/python/tests/regression

Release Instructions

Building Pymarian for Multiple Python Versions

Our CMake scripts detects python3.* available in PATH and builds pymarian for each. To support a specific version of python, make the python3.x executable available in PATH prior to running cmake. This can be achieved by (without conflicts) using conda or mamba.

# setup mamba if not already; Note: you may use conda as well
which mamba || {
   name=Miniforge3-$(uname)-$(uname -m).sh
   wget "https://github.com/conda-forge/miniforge/releases/latest/download/$name" \
      && bash $name -b -p ~/mambaforge && ~/mambaforge/bin/mamba init bash && rm $name
}

# create environment for each version
versions="$(echo 3.{12,11,10,9,8,7})"
for version in $versions; do
   echo "python $version"
   mamba env list | grep -q "^py${version}" || mamba create -q -y -n py${version} python=${version}
done

# stack all environments
for version in $versions; do mamba activate py${version} --stack; done
# check if all python versions are available
for version in $versions; do which python$version; done


# Build as usual
cmake . -B build -DCOMPILE_CUDA=off -DPYMARIAN=on
cmake --build build -j
ls build/pymarian*.whl

Release Pymarian to PyPI

Releasing Pymarian on PyPI is a two step process:

  1. Building maximally compatible package
  2. Uploading to PyPI

1. Build Pymarian for Public Release

We want to ensure the Pymarian is compatible with many versions of Python and operating systems. Currently we support Linux builds only. For compatibility across Linux distros, we should use an envrionment with older GLIBC, which is achieved using docker.

Run bash build.sh to build the wheels for linux. Inspect build-manylinux.sh for the actual build script that runs inside docker environment.

2. Upload to PyPI

# pip install twine # if required

twine upload -r testpypi $MARIAN_ROOT/build-pymarian/manylinux/*.whl

twine upload -r pypi $MARIAN_ROOT/build-pymarian/manylinux/*.whl

Initial Setup: create ~/.pypirc with following:

[distutils]
index-servers =
    pypi
    testpypi

[pypi]
repository: https://upload.pypi.org/legacy/
username:__token__
password:<token>

[testpypi]
repository: https://test.pypi.org/legacy/
username:__token__
password:<token>

Obtain token from https://pypi.org/manage/account/

Known issues

  1. In conda or mamba environment, if you see .../miniconda3/envs/<envname>/bin/../lib/libstdc++.so.6: version 'GLIBCXX_3.4.30' not found error, install libstdcxx-ng

    conda install -c conda-forge libstdcxx-ng
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pymarian-1.12.42-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (604.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

pymarian-1.12.42-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (604.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pymarian-1.12.42-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (604.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pymarian-1.12.42-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (604.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

pymarian-1.12.42-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (604.2 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file pymarian-1.12.42-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymarian-1.12.42-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e0d0ff8fb885364ca32ed66ca6e1976dec3557fad528a4cf183aac4f3c1453ea
MD5 69c7759e80e0c237ece9a4bbf85b67d7
BLAKE2b-256 147718bdebd274130f098ee065cd003e3f8ae6bfea1dbe6b30142bef5ef1da35

See more details on using hashes here.

File details

Details for the file pymarian-1.12.42-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymarian-1.12.42-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 47a6674303bc1fd2250e388aad90f6e1dbe1d5a40b850a18c8c4f2a33cc44b52
MD5 2ea68a7c2c529a3e37585f81108f414c
BLAKE2b-256 6638891a26cb20ad9c5e6633152a785ecb44b21f7dd27e02cdab74e5984f4ce4

See more details on using hashes here.

File details

Details for the file pymarian-1.12.42-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymarian-1.12.42-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 be303eb93b3ac6dac44ce19bb1852f20f6ba8040f1013819eda72946ee1cc9b5
MD5 40bdfaeeab8724e9d21cdadb1303e423
BLAKE2b-256 69c4a3ed4d72be3af8241b33d732f6c96c0b4aed81f2ed037312744f54c1eba8

See more details on using hashes here.

File details

Details for the file pymarian-1.12.42-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymarian-1.12.42-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3b768242e520ab0c9fd2dc612ff15b20e5030a19c281f8f25b6a9079ba9afb4e
MD5 0c82f0c45d8f5537877f7f15ea5302ab
BLAKE2b-256 977e0180098b6c7dc3ed57a2ec714bfdbbfb05c76918ce855d7248eb5237e66d

See more details on using hashes here.

File details

Details for the file pymarian-1.12.42-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pymarian-1.12.42-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d600f4ff70be385e79f0b63c818ef39881839f5cbbbf1be0b54e685163b3e27f
MD5 438d01bda6ec5179773dbcf9674052fc
BLAKE2b-256 a605282484299f8e03e25014eab31d2127fc108ab0e04c1217d81b85217c4cfa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page