Skip to main content

Convert Helsinki-NLP OPUS / Marian translation models into self-contained ONNX files for deployment to Teradata Vantage via Bring Your Own Model (BYOM).

Project description

teradata-opus-translate

Convert Helsinki-NLP OPUS (Marian) translation models into self-contained ONNX files for deployment to Teradata Vantage via Bring Your Own Model (BYOM).


What it does

teradata-opus-translate turns a HuggingFace OPUS / Marian translation model into a single self-contained ONNX file with com.microsoft.BeamSearch embedded in the graph, plus a single-file fast tokenizer.json. The resulting artifacts load directly into Teradata BYOM tables and are scored at SQL time through the TD_MLDB.ONNXSeq2Seq table operator.

The package handles the parts of the conversion that aren't obvious from the HuggingFace optimum / torch.onnx.export defaults: stripping generation features the BYOM beam-search op silently ignores, wiring the SQL-tunable generation parameters as graph inputs (so customers override them per query rather than baking them in), keeping the ONNX opset on the BYOM ORT 1.16.3 ceiling, and verifying token-level parity against MarianMTModel.generate() before the file is written.

The intended workflow is: convert once, upload the artifacts to BYOM with teradataml.save_byom, then translate at scale through the database with no Python hop. The two callable surface (convert_model, convert_tokenizer) is deliberately small — everything tunable at scoring time stays out of the export step.

Install

pip install teradata-opus-translate

Requires Python 3.12+. Key dependencies are pulled in automatically: transformers, torch, onnx, onnxruntime, tokenizers, sentencepiece, sacremoses, numpy.

To upload converted models to Teradata BYOM using the teradataml example below, install teradataml separately:

pip install teradataml

Quickstart

from teradata_opus_translate import convert_model, convert_tokenizer

convert_model(
    "Helsinki-NLP/opus-mt-de-en",
    output_path="model.onnx",
)
convert_tokenizer(
    "Helsinki-NLP/opus-mt-de-en",
    output_path="tokenizer.json",
)

This downloads the model from the HuggingFace Hub, exports it to ONNX with com.microsoft.BeamSearch embedded, runs token-parity verification against MarianMTModel.generate(), and writes both files to your working directory. The two files together are everything BYOM needs.

API reference

convert_model

def convert_model(
    source: str | os.PathLike[str],
    *,
    precision: Literal["fp32", "int8"] = "fp32",
    output_path: str | os.PathLike[str],
    opset: int = 14,
    ir_version: int = 8,
    verify: bool = True,
    verify_samples: list[str] | None = None,
    no_repeat_ngram_size: int | None = None,
    early_stopping: bool | None = None,
    cache_dir: str | os.PathLike[str] | None = None,
    verbose: bool = False,
    log_level: str | int | None = None,
) -> ConvertModelResult
name type default description
source str | PathLike required HuggingFace repo id (e.g. "Helsinki-NLP/opus-mt-de-en") or a local directory containing a downloaded HF repo. An existing directory is auto-detected as local; everything else is treated as an HF id.
precision "fp32" | "int8" "fp32" Precision mode. v1 ships dynamic int8 only; static int8 is deferred.
output_path str | PathLike required Destination .onnx path. Parent directories are created if missing; existing files are overwritten.
opset int 14 ONNX opset for the encoder/decoder subgraphs. 14 matches BYOM 7.x's ORT 1.16.3 ceiling. Increase only if your BYOM version is newer and you've verified compatibility.
ir_version int 8 ONNX IR version stamped on the produced graph. 8 matches BYOM 7.x's bundled ORT (1.16.3 lineage), which rejects IR ≥ 9 with Unsupported model IR version: 9, max supported IR version: 8. Set to 9+ only if your downstream ORT supports newer IRs. See docs/decisions.md Decision 12.
verify bool True Run full token-parity verification against MarianMTModel.generate() after export. Always runs when set, regardless of model size. Failure raises AssertionError naming the first divergent sample.
verify_samples list[str] | None None Source-language strings used for verification. None picks a sensible default set per inferred source language (covers every language in the curated Helsinki-NLP/opustranslate collection). Pass an explicit list for custom pairs or local paths.
no_repeat_ngram_size int | None None Baked into the BeamSearch graph as a node attribute at export time. Defaults to model.config.no_repeat_ngram_size. Cannot be overridden at SQL-scoring time.
early_stopping bool | None None Baked into the BeamSearch graph as a node attribute at export time. Defaults to model.config.early_stopping. Cannot be overridden at SQL-scoring time.
cache_dir str | PathLike | None None Optional HuggingFace cache directory, passed through to from_pretrained.
verbose bool False If true, configure the package logger at INFO.
log_level str | int | None None Explicit logging level; takes precedence over verbose.

Returns a ConvertModelResult.

Raises: NotImplementedError (precision="int8" until that path lands), ValueError (unknown precision), RuntimeError (output ONNX > 2 GiB; v1 does not support external_data), AssertionError (parity divergence when verify=True).

Why is ir_version default 8? BYOM 7.x ships an ONNX runtime (1.16.3 lineage) that caps at IR v8 and rejects IR ≥ 9 at model-load time with Unsupported model IR version: 9, max supported IR version: 8. The upstream torch.onnx.export default is now 9, so without this pin any artifact this package produces would fail to load on BYOM 7.x. Pinning to 8 keeps the output drop-in compatible with the BYOM versions Teradata ships today; opset stays at 14 either way.

Note — SQL-tunable parameters are not export arguments. The five generation parameters customers most often want to tune (num_beams, max_length, min_length, length_penalty, repetition_penalty) are deliberately not exposed by convert_model. They remain as inputs of the produced ONNX graph and are overridden per query at SQL time using the BYOM Const_* USING clause:

SELECT * FROM TD_MLDB.ONNXSeq2Seq (
  ON inputs PARTITION BY ANY
  ON onnx_models AS ModelTable DIMENSION
  ON sequence_tokenizers AS TokenizerTable DIMENSION
  USING
    ModelOutputFields('sequences')
    Const_num_beams(4)
    Const_max_length(64)
    Const_min_length(1)
    Const_length_penalty(1.0)
    Const_repetition_penalty(1.0)
) AS dt;

This keeps a single exported artifact tunable across many SQL workloads without re-export. no_repeat_ngram_size and early_stopping are baked in at export because the BeamSearch contrib op only accepts them as node attributes. num_return_sequences is also fixed at export (locked to 1 via a Constant node in the produced graph): each input always returns exactly one translation, and Const_num_return_sequences(N) on the BYOM USING clause has no effect. Rationale and history in docs/decisions.md Decision 10 and the v1.0.1 CHANGELOG entry.

convert_tokenizer

def convert_tokenizer(
    source: str | os.PathLike[str],
    *,
    output_path: str | os.PathLike[str],
    cache_dir: str | os.PathLike[str] | None = None,
    verbose: bool = False,
) -> ConvertTokenizerResult
name type default description
source str | PathLike required HF repo id or local directory; same auto-detection rule as convert_model.
output_path str | PathLike required Destination tokenizer.json path. Parent directories are created if missing; existing files are overwritten.
cache_dir str | PathLike | None None Optional HuggingFace cache directory, passed through to from_pretrained.
verbose bool False If true, configure the package logger at INFO.

Returns a ConvertTokenizerResult.

Raises RuntimeError if the in-process round-trip self-check fails (the writer refuses to produce a tokenizer.json that disagrees with MarianTokenizer on a canary sentence).

The output is a single-file tokenizer.json that loads via tokenizers.Tokenizer.from_file(...) with no external dependencies. Marian's MarianTokenizer is a slow Python tokenizer wrapping source.spm / target.spm / vocab.json; this function rebuilds an equivalent fast tokenizer (Unigram + Metaspace + EOS template) directly from the source-side SentencePiece scores.

Result types

All result types are frozen dataclasses.

ConvertModelResult — returned by convert_model.

field type description
output_path Path Resolved absolute path of the written .onnx file.
size_bytes int Size of the written ONNX file, in bytes.
source str Original source argument as resolved for from_pretrained.
source_kind "hf" | "local" "local" if source was a local directory, otherwise "hf".
precision "fp32" | "int8" Precision mode used for the export.
parity ParityResult | None Token-parity result; None when verify=False.

ConvertTokenizerResult — returned by convert_tokenizer.

field type description
output_path Path Resolved absolute path of the written tokenizer.json.
size_bytes int Size of the written file, in bytes.
source str Original source argument as resolved for from_pretrained.
source_kind "hf" | "local" "local" or "hf".

ParityResult — produced by the verification step inside convert_model.

field type description
samples list[str] Source-language strings used for verification.
hf_token_ids list[list[int]] Per-sample token-id sequence from MarianMTModel.generate() (after canonicalising trailing EOS / pad).
onnx_token_ids list[list[int]] Per-sample token-id sequence from the exported ONNX BeamSearch graph (same canonicalisation).
mismatches int Count of samples whose two id-sequences differ. 0 means full token-parity.

End-to-end BYOM example

For a runnable walkthrough of the flow below, see the demo notebook.

The full customer workflow is convert → upload → score.

1. Convert the model and tokenizer locally

from teradata_opus_translate import convert_model, convert_tokenizer

convert_model(
    "Helsinki-NLP/opus-mt-de-en",
    output_path="opus-mt-de-en.onnx",
)
convert_tokenizer(
    "Helsinki-NLP/opus-mt-de-en",
    output_path="opus-mt-de-en.tokenizer.json",
)

2. Upload to Teradata BYOM with save_byom

from teradataml import create_context, save_byom

create_context(host="...", username="...", password="...")

# Model -> onnx_models(model_id, model)
save_byom(
    model_id="opus-mt-de-en",
    model_file="opus-mt-de-en.onnx",
    table_name="onnx_models",
    schema_name="OPUS_BYOM",
)

# Tokenizer goes into a separate table. save_byom always writes a column
# called "model" regardless of what the artifact actually is, so we save the
# tokenizer.json into a (model_id, model)-shaped table and alias the column
# back to "tokenizer" on SELECT in the SQL below.
save_byom(
    model_id="opus-mt-de-en",
    model_file="opus-mt-de-en.tokenizer.json",
    table_name="sequence_tokenizers",
    schema_name="OPUS_BYOM",
)

3. Score with TD_MLDB.ONNXSeq2Seq

CREATE MULTISET TABLE OPUS_BYOM.de_en_inputs (
    id INTEGER,
    input_text VARCHAR(2000) CHARACTER SET UNICODE
) PRIMARY INDEX (id);

INSERT INTO OPUS_BYOM.de_en_inputs VALUES (1, 'Hallo Welt.');
INSERT INTO OPUS_BYOM.de_en_inputs VALUES (2, 'Das Wetter ist heute schön.');

SELECT id, input_text, output_text
FROM TD_MLDB.ONNXSeq2Seq (
    ON OPUS_BYOM.de_en_inputs PARTITION BY ANY
    ON (SELECT model_id, model FROM OPUS_BYOM.onnx_models
        WHERE model_id = 'opus-mt-de-en') AS ModelTable DIMENSION
    ON (SELECT model_id, model AS tokenizer FROM OPUS_BYOM.sequence_tokenizers
        WHERE model_id = 'opus-mt-de-en') AS TokenizerTable DIMENSION
    USING
        Accumulate('id', 'input_text')
        TextColumn('input_text')
        ModelOutputFields('sequences')
        OutputFormat('FLOAT(1)')
        EnableMemoryCheck('false')
        OverwriteCachedModel('*')
        Const_num_beams(4)
        Const_max_length(64)
        Const_min_length(1)
        Const_length_penalty(1.0)
        Const_repetition_penalty(1.0)
) AS dt;

Adjust the Const_* cluster per workload — those are exactly the SQL-time overrides for the parameters convert_model deliberately keeps off the export API. Note that num_return_sequences is not in the cluster: each input always returns exactly one translation. The value is baked into the produced ONNX graph as a Constant node, and any Const_num_return_sequences(N) USING clause is silently ignored by BYOM.

Supported models

teradata-opus-translate v1 targets the MarianMT / OPUS family. It is tested against Helsinki-NLP/opus-mt-* models and the curated Helsinki-NLP/opustranslate collection. Other seq2seq architectures (T5, BART, NLLB, mBART) are not supported in v1.

The default verification samples cover every source language in the curated opustranslate collection plus every ISO 639-1 source code that appears in five or more Helsinki-NLP/opus-mt-* repos, so most production language pairs land on language-appropriate samples without you needing to pass verify_samples= explicitly.

Verification (verify=True)

When verify=True (the default), convert_model runs full token-parity verification immediately after the ONNX file is written:

  1. For each sample in verify_samples, run MarianMTModel.generate() with the BeamSearch-compatible parameter intersection (bad_words_ids=None, forced_eos_token_id=None, renormalize_logits=False, do_sample=False).
  2. Run the same sample through the exported ONNX graph in onnxruntime.
  3. Canonicalise the trailing-EOS / pad-to-max difference (HF emits a trailing EOS, the BeamSearch op omits it and pads with pad_token_id).
  4. Compare token-ID sequences exactly.

Any divergence raises AssertionError naming the first divergent sample, the HF and ONNX id sequences side by side, and the position of the first differing token. This catches export-time silent regressions before the artifact ever lands in BYOM.

Pass verify_samples=[...] to use your own canary inputs (recommended for local-path sources where the language can't be inferred from a HuggingFace id). The default sample set covers every language in the Helsinki-NLP/opustranslate collection.

Limitations

  • 2 GiB ONNX size ceiling. ONNX uses protobuf for serialization, which caps a single message at 2 GiB. Larger models error at export time with a clear RuntimeError. v1 does not support the external_data workaround; the path forward for larger pairs is precision="int8" (in flight).
  • int8 quantization is dynamic only in v1. Static (calibration-based) quantization is deferred.
  • The BeamSearch contrib op silently ignores three generation features that MarianMTModel.generate() accepts: bad_words_ids, forced_eos_token_id, and renormalize_logits. The internal verification step disables all three on the HF side. If you compare the exported ONNX against MarianMTModel.generate() outside this package, disable these features explicitly or you'll see false-positive divergences.
  • Default opset is 14 because BYOM 7.x ships ORT 1.16.3, which is the ceiling for the BeamSearch contrib op signature this package targets. Don't change opset= unless you know your BYOM version ships a newer ORT and you've verified the T5EncoderSubgraph::Validate() check still accepts our 3-input encoder layout.
  • MarianMT / OPUS only. No T5, BART, NLLB, or mBART support in v1.

Acknowledgements

  • Microsoft for the com.microsoft.BeamSearch contrib op pattern, which makes single-graph beam-search inside onnxruntime possible.
  • The Helsinki-NLP group for the OPUS-MT model family and the curated opustranslate HuggingFace collection.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

teradata_opus_translate-1.0.5.tar.gz (80.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

teradata_opus_translate-1.0.5-py3-none-any.whl (85.0 kB view details)

Uploaded Python 3

File details

Details for the file teradata_opus_translate-1.0.5.tar.gz.

File metadata

  • Download URL: teradata_opus_translate-1.0.5.tar.gz
  • Upload date:
  • Size: 80.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for teradata_opus_translate-1.0.5.tar.gz
Algorithm Hash digest
SHA256 de7c4e534f53b7f60a1caa2d381c7a42357db823855cf6058dbc3bf82a821a94
MD5 c564b2b4aa0d1c39aeec099cee75c0c3
BLAKE2b-256 75cfc99ff924e5e26d1aa26f414bf9ecb3b9c357eb1093acb07b26f8b41e625b

See more details on using hashes here.

File details

Details for the file teradata_opus_translate-1.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for teradata_opus_translate-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 58f029905eb501cc6c0e74858a23379ac3858cd145ac26a72d97706f278f07b4
MD5 26b7643f7de1cfbe29fd7c824029ae80
BLAKE2b-256 fe7e7d1fdf802a8c09edcf3ad10802bb2e30ac30f94a48ea6f9b2fd2b1dc987d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page