Convert Helsinki-NLP OPUS / Marian translation models into self-contained ONNX files for deployment to Teradata Vantage via Bring Your Own Model (BYOM).

These details have not been verified by PyPI

Project links

Project description

teradata-opus-translate

Convert Helsinki-NLP OPUS (Marian) translation models into self-contained ONNX files for deployment to Teradata Vantage via Bring Your Own Model (BYOM).

What it does

teradata-opus-translate turns a HuggingFace OPUS / Marian translation model into a single self-contained ONNX file with com.microsoft.BeamSearch embedded in the graph, plus a single-file fast tokenizer.json. The resulting artifacts load directly into Teradata BYOM tables and are scored at SQL time through the TD_MLDB.ONNXSeq2Seq table operator.

The package handles the parts of the conversion that aren't obvious from the HuggingFace optimum / torch.onnx.export defaults: stripping generation features the BYOM beam-search op silently ignores, wiring the SQL-tunable generation parameters as graph inputs (so customers override them per query rather than baking them in), keeping the ONNX opset on the BYOM ORT 1.16.3 ceiling, and verifying token-level parity against MarianMTModel.generate() before the file is written.

The intended workflow is: convert once, upload the artifacts to BYOM with teradataml.save_byom, then translate at scale through the database with no Python hop. The two callable surface (convert_model, convert_tokenizer) is deliberately small — everything tunable at scoring time stays out of the export step.

Install

pip install teradata-opus-translate

Requires Python 3.12+. Key dependencies are pulled in automatically: transformers, torch, onnx, onnxruntime, tokenizers, sentencepiece, sacremoses, numpy.

To upload converted models to Teradata BYOM using the teradataml example below, install teradataml separately:

pip install teradataml

Quickstart

from teradata_opus_translate import convert_model, convert_tokenizer

convert_model(
    "Helsinki-NLP/opus-mt-de-en",
    output_path="model.onnx",
)
convert_tokenizer(
    "Helsinki-NLP/opus-mt-de-en",
    output_path="tokenizer.json",
)

This downloads the model from the HuggingFace Hub, exports it to ONNX with com.microsoft.BeamSearch embedded, runs token-parity verification against MarianMTModel.generate(), and writes both files to your working directory. The two files together are everything BYOM needs.

API reference

`convert_model`

def convert_model(
    source: str | os.PathLike[str],
    *,
    precision: Literal["fp32", "int8"] = "fp32",
    output_path: str | os.PathLike[str],
    opset: int = 14,
    verify: bool = True,
    verify_samples: list[str] | None = None,
    no_repeat_ngram_size: int | None = None,
    early_stopping: bool | None = None,
    cache_dir: str | os.PathLike[str] | None = None,
    verbose: bool = False,
    log_level: str | int | None = None,
) -> ConvertModelResult

name	type	default	description
`source`	`str \| PathLike`	required	HuggingFace repo id (e.g. `"Helsinki-NLP/opus-mt-de-en"`) or a local directory containing a downloaded HF repo. An existing directory is auto-detected as local; everything else is treated as an HF id.
`precision`	`"fp32" \| "int8"`	`"fp32"`	Precision mode. v1 ships dynamic int8 only; static int8 is deferred.
`output_path`	`str \| PathLike`	required	Destination `.onnx` path. Parent directories are created if missing; existing files are overwritten.
`opset`	`int`	`14`	ONNX opset for the encoder/decoder subgraphs. `14` matches BYOM 7.x's ORT 1.16.3 ceiling. Increase only if your BYOM version is newer and you've verified compatibility.
`verify`	`bool`	`True`	Run full token-parity verification against `MarianMTModel.generate()` after export. Always runs when set, regardless of model size. Failure raises `AssertionError` naming the first divergent sample.
`verify_samples`	`list[str] \| None`	`None`	Source-language strings used for verification. `None` picks a sensible default set per inferred source language (covers every language in the curated `Helsinki-NLP/opustranslate` collection). Pass an explicit list for custom pairs or local paths.
`no_repeat_ngram_size`	`int \| None`	`None`	Baked into the BeamSearch graph as a node attribute at export time. Defaults to `model.config.no_repeat_ngram_size`. Cannot be overridden at SQL-scoring time.
`early_stopping`	`bool \| None`	`None`	Baked into the BeamSearch graph as a node attribute at export time. Defaults to `model.config.early_stopping`. Cannot be overridden at SQL-scoring time.
`cache_dir`	`str \| PathLike \| None`	`None`	Optional HuggingFace cache directory, passed through to `from_pretrained`.
`verbose`	`bool`	`False`	If true, configure the package logger at INFO.
`log_level`	`str \| int \| None`	`None`	Explicit logging level; takes precedence over `verbose`.

Returns a ConvertModelResult.

Raises: NotImplementedError (precision="int8" until that path lands), ValueError (unknown precision), RuntimeError (output ONNX > 2 GiB; v1 does not support external_data), AssertionError (parity divergence when verify=True).

Note — SQL-tunable parameters are not export arguments. The five generation parameters customers most often want to tune (num_beams, max_length, min_length, length_penalty, repetition_penalty) are deliberately not exposed by convert_model. They remain as inputs of the produced ONNX graph and are overridden per query at SQL time using the BYOM Const_* USING clause:
SELECT * FROM TD_MLDB.ONNXSeq2Seq (
  ON inputs PARTITION BY ANY
  ON onnx_models AS ModelTable DIMENSION
  ON sequence_tokenizers AS TokenizerTable DIMENSION
  USING
    ModelOutputFields('sequences')
    Const_num_beams(4)
    Const_max_length(64)
    Const_min_length(1)
    Const_length_penalty(1.0)
    Const_repetition_penalty(1.0)
) AS dt;
This keeps a single exported artifact tunable across many SQL workloads without re-export. no_repeat_ngram_size and early_stopping are baked in at export because the BeamSearch contrib op only accepts them as node attributes. num_return_sequences is also fixed at export (locked to 1 via a Constant node in the produced graph): each input always returns exactly one translation, and Const_num_return_sequences(N) on the BYOM USING clause has no effect. Rationale and history in docs/decisions.md Decision 10 and the v1.0.1 CHANGELOG entry.

`convert_tokenizer`

def convert_tokenizer(
    source: str | os.PathLike[str],
    *,
    output_path: str | os.PathLike[str],
    cache_dir: str | os.PathLike[str] | None = None,
    verbose: bool = False,
) -> ConvertTokenizerResult

name	type	default	description
`source`	`str \| PathLike`	required	HF repo id or local directory; same auto-detection rule as `convert_model`.
`output_path`	`str \| PathLike`	required	Destination `tokenizer.json` path. Parent directories are created if missing; existing files are overwritten.
`cache_dir`	`str \| PathLike \| None`	`None`	Optional HuggingFace cache directory, passed through to `from_pretrained`.
`verbose`	`bool`	`False`	If true, configure the package logger at INFO.

Returns a ConvertTokenizerResult.

Raises RuntimeError if the in-process round-trip self-check fails (the writer refuses to produce a tokenizer.json that disagrees with MarianTokenizer on a canary sentence).

The output is a single-file tokenizer.json that loads via tokenizers.Tokenizer.from_file(...) with no external dependencies. Marian's MarianTokenizer is a slow Python tokenizer wrapping source.spm / target.spm / vocab.json; this function rebuilds an equivalent fast tokenizer (Unigram + Metaspace + EOS template) directly from the source-side SentencePiece scores.

Result types

All result types are frozen dataclasses.

ConvertModelResult — returned by convert_model.

field	type	description
`output_path`	`Path`	Resolved absolute path of the written `.onnx` file.
`size_bytes`	`int`	Size of the written ONNX file, in bytes.
`source`	`str`	Original `source` argument as resolved for `from_pretrained`.
`source_kind`	`"hf" \| "local"`	`"local"` if `source` was a local directory, otherwise `"hf"`.
`precision`	`"fp32" \| "int8"`	Precision mode used for the export.
`parity`	`ParityResult \| None`	Token-parity result; `None` when `verify=False`.

ConvertTokenizerResult — returned by convert_tokenizer.

field	type	description
`output_path`	`Path`	Resolved absolute path of the written `tokenizer.json`.
`size_bytes`	`int`	Size of the written file, in bytes.
`source`	`str`	Original `source` argument as resolved for `from_pretrained`.
`source_kind`	`"hf" \| "local"`	`"local"` or `"hf"`.

ParityResult — produced by the verification step inside convert_model.

field	type	description
`samples`	`list[str]`	Source-language strings used for verification.
`hf_token_ids`	`list[list[int]]`	Per-sample token-id sequence from `MarianMTModel.generate()` (after canonicalising trailing EOS / pad).
`onnx_token_ids`	`list[list[int]]`	Per-sample token-id sequence from the exported ONNX BeamSearch graph (same canonicalisation).
`mismatches`	`int`	Count of samples whose two id-sequences differ. `0` means full token-parity.

End-to-end BYOM example

For a runnable walkthrough of the flow below, see the demo notebook.

The full customer workflow is convert → upload → score.

1. Convert the model and tokenizer locally

from teradata_opus_translate import convert_model, convert_tokenizer

convert_model(
    "Helsinki-NLP/opus-mt-de-en",
    output_path="opus-mt-de-en.onnx",
)
convert_tokenizer(
    "Helsinki-NLP/opus-mt-de-en",
    output_path="opus-mt-de-en.tokenizer.json",
)

2. Upload to Teradata BYOM with `save_byom`

from teradataml import create_context, save_byom

create_context(host="...", username="...", password="...")

# Model -> onnx_models(model_id, model)
save_byom(
    model_id="opus-mt-de-en",
    model_file="opus-mt-de-en.onnx",
    table_name="onnx_models",
    schema_name="OPUS_BYOM",
)

# Tokenizer goes into a separate table. save_byom always writes a column
# called "model" regardless of what the artifact actually is, so we save the
# tokenizer.json into a (model_id, model)-shaped table and alias the column
# back to "tokenizer" on SELECT in the SQL below.
save_byom(
    model_id="opus-mt-de-en",
    model_file="opus-mt-de-en.tokenizer.json",
    table_name="sequence_tokenizers",
    schema_name="OPUS_BYOM",
)

3. Score with `TD_MLDB.ONNXSeq2Seq`

CREATE MULTISET TABLE OPUS_BYOM.de_en_inputs (
    id INTEGER,
    input_text VARCHAR(2000) CHARACTER SET UNICODE
) PRIMARY INDEX (id);

INSERT INTO OPUS_BYOM.de_en_inputs VALUES (1, 'Hallo Welt.');
INSERT INTO OPUS_BYOM.de_en_inputs VALUES (2, 'Das Wetter ist heute schön.');

SELECT id, input_text, output_text
FROM TD_MLDB.ONNXSeq2Seq (
    ON OPUS_BYOM.de_en_inputs PARTITION BY ANY
    ON (SELECT model_id, model FROM OPUS_BYOM.onnx_models
        WHERE model_id = 'opus-mt-de-en') AS ModelTable DIMENSION
    ON (SELECT model_id, model AS tokenizer FROM OPUS_BYOM.sequence_tokenizers
        WHERE model_id = 'opus-mt-de-en') AS TokenizerTable DIMENSION
    USING
        Accumulate('id', 'input_text')
        TextColumn('input_text')
        ModelOutputFields('sequences')
        OutputFormat('FLOAT(1)')
        EnableMemoryCheck('false')
        OverwriteCachedModel('*')
        Const_num_beams(4)
        Const_max_length(64)
        Const_min_length(1)
        Const_length_penalty(1.0)
        Const_repetition_penalty(1.0)
) AS dt;

Adjust the Const_* cluster per workload — those are exactly the SQL-time overrides for the parameters convert_model deliberately keeps off the export API. Note that num_return_sequences is not in the cluster: each input always returns exactly one translation. The value is baked into the produced ONNX graph as a Constant node, and any Const_num_return_sequences(N) USING clause is silently ignored by BYOM.

Supported models

teradata-opus-translate v1 targets the MarianMT / OPUS family. It is tested against Helsinki-NLP/opus-mt-* models and the curated Helsinki-NLP/opustranslate collection. Other seq2seq architectures (T5, BART, NLLB, mBART) are not supported in v1.

The default verification samples cover every source language in the curated opustranslate collection plus every ISO 639-1 source code that appears in five or more Helsinki-NLP/opus-mt-* repos, so most production language pairs land on language-appropriate samples without you needing to pass verify_samples= explicitly.

Verification (`verify=True`)

When verify=True (the default), convert_model runs full token-parity verification immediately after the ONNX file is written:

For each sample in verify_samples, run MarianMTModel.generate() with the BeamSearch-compatible parameter intersection (bad_words_ids=None, forced_eos_token_id=None, renormalize_logits=False, do_sample=False).
Run the same sample through the exported ONNX graph in onnxruntime.
Canonicalise the trailing-EOS / pad-to-max difference (HF emits a trailing EOS, the BeamSearch op omits it and pads with pad_token_id).
Compare token-ID sequences exactly.

Any divergence raises AssertionError naming the first divergent sample, the HF and ONNX id sequences side by side, and the position of the first differing token. This catches export-time silent regressions before the artifact ever lands in BYOM.

Pass verify_samples=[...] to use your own canary inputs (recommended for local-path sources where the language can't be inferred from a HuggingFace id). The default sample set covers every language in the Helsinki-NLP/opustranslate collection.

Limitations

2 GiB ONNX size ceiling. ONNX uses protobuf for serialization, which caps a single message at 2 GiB. Larger models error at export time with a clear RuntimeError. v1 does not support the external_data workaround; the path forward for larger pairs is precision="int8" (in flight).
int8 quantization is dynamic only in v1. Static (calibration-based) quantization is deferred.
The BeamSearch contrib op silently ignores three generation features that MarianMTModel.generate() accepts: bad_words_ids, forced_eos_token_id, and renormalize_logits. The internal verification step disables all three on the HF side. If you compare the exported ONNX against MarianMTModel.generate() outside this package, disable these features explicitly or you'll see false-positive divergences.
Default opset is 14 because BYOM 7.x ships ORT 1.16.3, which is the ceiling for the BeamSearch contrib op signature this package targets. Don't change opset= unless you know your BYOM version ships a newer ORT and you've verified the T5EncoderSubgraph::Validate() check still accepts our 3-input encoder layout.
MarianMT / OPUS only. No T5, BART, NLLB, or mBART support in v1.

Acknowledgements

Microsoft for the com.microsoft.BeamSearch contrib op pattern, which makes single-graph beam-search inside onnxruntime possible.
The Helsinki-NLP group for the OPUS-MT model family and the curated opustranslate HuggingFace collection.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.5

May 7, 2026

1.0.4

May 7, 2026

This version

1.0.3

May 7, 2026

1.0.2

May 7, 2026

1.0.1

May 7, 2026

1.0.0

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

teradata_opus_translate-1.0.3.tar.gz (79.4 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

teradata_opus_translate-1.0.3-py3-none-any.whl (84.2 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file teradata_opus_translate-1.0.3.tar.gz.

File metadata

Download URL: teradata_opus_translate-1.0.3.tar.gz
Upload date: May 7, 2026
Size: 79.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for teradata_opus_translate-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`f7e281a9e02485516379dc5060b9948076943f32f04ce6327c3824ccd7d9cc73`
MD5	`b288c82f79044153159051a7bc2c0a3e`
BLAKE2b-256	`6e94ed9a536d03a4996d1eaeef85750b37f1fad17899daa07e5e03c88cb651ab`

See more details on using hashes here.

File details

Details for the file teradata_opus_translate-1.0.3-py3-none-any.whl.

File metadata

Download URL: teradata_opus_translate-1.0.3-py3-none-any.whl
Upload date: May 7, 2026
Size: 84.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for teradata_opus_translate-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`812738dee6670eee5c66a9008f9d0680a38c8a813fde6727e563729e15b3f85f`
MD5	`7c1e499b81d511b05644930f0be26f68`
BLAKE2b-256	`9487d065c93c695e23a7b9ed59dbbf517be382d983149abf4c92a542c93cde4f`

See more details on using hashes here.

teradata-opus-translate 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

teradata-opus-translate

What it does

Install

Quickstart

API reference

convert_model

convert_tokenizer

Result types

End-to-end BYOM example

1. Convert the model and tokenizer locally

2. Upload to Teradata BYOM with save_byom

3. Score with TD_MLDB.ONNXSeq2Seq

Supported models

Verification (verify=True)

Limitations

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`convert_model`

`convert_tokenizer`

2. Upload to Teradata BYOM with `save_byom`

3. Score with `TD_MLDB.ONNXSeq2Seq`

Verification (`verify=True`)