The framework can be useful for finetuning and few-shot learning models for classification task and NER.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

OpenAutoNLU Pipeline

OpenAutoNLU is an open-source pipeline for training natural language understanding (NLU) models for text classification (multiclass) and named entity recognition (NER). It supports few-shot learning (SetFit, AncSetFit with optional anchor labels), classic fine-tuning, data quality diagnostics, out-of-distribution (OOD) detection, optional LLM-based augmentation and synthetic test generation, and ONNX export for deployment.

You provide train (and optionally test) data; high-level pipelines (TextClassificationTrainingPipeline, TokenClassificationTrainingPipeline) load it, run optional data-quality checks, then automatically choose the training method from the data: AncSetFit for very small datasets (2–5 samples per class), SetFit for medium size (6–80), and fine-tuning for larger data. You can override configs (batch size, OOD method, augmentation, etc.) and save models in ONNX format. A Streamlit app and Docker images (CPU/GPU) are included for interactive use.

Built by MWS AI and contributors (see pyproject.toml for authors). Aimed at practitioners and researchers who want a single, data-driven workflow for few-shot and full-size NLU training without manually picking methods or tuning low-level knobs.

!requires Python >=3.12, <3.13

Usage examples are located in the examples folder.

Installation

To work with the repository in developer mode, install it as an editable package:

pip install -e .

This way you don't need to reinstall the package after code changes. To install all dependencies (two configurations are available: cpu and cuda) for development, run:

uv sync --extra cuda

Documentation

To build and view the documentation locally:

uv sync
cd docs && uv run make html
open build/html/index.html

Running with Docker

With GPU (recommended host: 16GB RAM, 8 CPU, A100 40GB, ~30GB disk):

docker-compose up -d

Without GPU (macOS or CPU-only):

docker build --build-arg EXTRA=cpu -t open-autonlu .
docker run -p 8501:8501 open-autonlu

Code example with default parameters:

Training

from open_autonlu.auto_classes import (
    TextClassificationTrainingPipeline,
    TokenClassificationTrainingPipeline
)
from open_autonlu.methods.data_types import SaveFormat

# Text Classification training
pipeline = TextClassificationTrainingPipeline(
    train_path="train.csv",
    test_path="test.csv",
    config_overrides={"language": "en"}  # "en" or "ru"
)
result = pipeline.train()
pipeline.save("./model", SaveFormat.ONNX)

# NER training
pipeline = TokenClassificationTrainingPipeline(
    train_path="train.json",
    test_path="test.json",
    config_overrides={"language": "en"}  # "en" or "ru"
)
result = pipeline.train()
pipeline.save("./model", SaveFormat.ONNX)

Inference

from open_autonlu.auto_classes import (
    TextClassificationInferenceManager,
    TokenClassificationInferenceManager
)

# Text Classification inference
inferer = TextClassificationInferenceManager("./model")
results = inferer.predict(["Hello world", "Goodbye"], batch_size=32)
for r in results:
    print(f"{r.most_probable.label}: {r.most_probable.score:.3f}")

# NER inference
ner_inferer = TokenClassificationInferenceManager("./ner_model")
results = ner_inferer.predict(["John works at Google"], batch_size=1)
for r in results:
    for entity in r.labels:
        print(f"{entity.text}: {entity.label}")

Data Quality Diagnostics

The diagnose() method evaluates training data quality using multiple evaluators:

cartography (MulticlassCLF) Dataset Cartography
vinfo (MulticlassCLF) V-Usable information
uncertainty (MulticlassCLF, NER)
retag (MulticlassCLF, NER)
label aggregation (NER)

Run the data quality stage:

from open_autonlu.auto_classes import TextClassificationTrainingPipeline

pipeline = TextClassificationTrainingPipeline(train_path="train.csv")
evaluation_result = pipeline.diagnose()

Configuration Overrides

The config_overrides parameter allows you to customize training behavior with modifying default configurations.

Basic Usage

from open_autonlu.auto_classes import TextClassificationTrainingPipeline
from open_autonlu.methods.data_types import OodMethod, SaveFormat

pipeline = TextClassificationTrainingPipeline(
    train_path="train.csv",
    config_overrides={
        "language": "en",                # Prompt language for LLM pipelines ("en" or "ru")
        "ood_method": OodMethod.LOGIT,   # OOD detection method
        "batch_size": 32,                # Batch size
    }
)
result = pipeline.train()
pipeline.save("./model", SaveFormat.ONNX)

OOD Detection Methods

Out-of-Distribution detection identifies inputs that don't belong to any trained class.

Method	Description	Best for
`OodMethod.AUTO`	Auto-select based on training method	Default
`OodMethod.MARGINAL_MAHALANOBIS_OOD`	Mahalanobis distance from embedding distribution	Finetuning
`OodMethod.MSP_OOD`	Maximum Softmax Probability threshold	SetFit, AncSetFit
`OodMethod.LOGIT`	Adds `outOfScope` class during training	Alternative approach
`OodMethod.NONE`	Disable OOD detection	When not needed

The threshold_factor parameter controls OOD detection sensitivity. It is a multiplier applied to the OOD detection threshold. Higher values make detection more conservative (fewer samples are marked as OOD), while lower values make it more aggressive (more samples are flagged as OOD). Default value is 1.0.

from open_autonlu.methods.data_types import OodMethod

# Override ood_method and adjust sensitivity
config_overrides = {
    "ood_method": OodMethod.MARGINAL_MAHALANOBIS_OOD,
    "threshold_factor": 1.5,  # More conservative OOD detection
}

LLM Data Augmentation

Automatically augment underrepresented classes using LLM generation. The language parameter controls which prompts are sent to the LLM ("en" for English, "ru" for Russian).

import os
from open_autonlu.auto_classes import TextClassificationTrainingPipeline

pipeline = TextClassificationTrainingPipeline(
    train_path="train.csv",
    config_overrides={
        "language": "en",
        "llm_augmentation": {
            "enabled": True,
            "use_domain_analysis": True,  # Analyze domain for better prompts
            "threshold": 81,               # Augment classes with < 81 samples
            "max_attempts": 10,            # Max generation attempts
            "num_shot": 5,                 # Examples in prompt
            "config_overrides": {
                "LlmClientConfig": {
                    "api_key": os.environ["MODEL_API_KEY"],
                    "model_id": "gpt-4",
                }
            }
        }
    }
)

Synthetic Test Generation

Generate synthetic test data using LLM when no test set is provided.

import os
from open_autonlu.auto_classes import TextClassificationTrainingPipeline

pipeline = TextClassificationTrainingPipeline(
    train_path="train.csv",  # No test_path provided
    config_overrides={
        "language": "en",  
        "llm_test_generation": {
            "enabled": True,
            "num_samples_per_class": 100,
            "use_domain_analysis": True,
            "synthetic_test_path": "./synthetic_test.csv",  # Save generated data
            "config_overrides": {
                "LlmClientConfig": {
                    "api_key": os.environ["MODEL_API_KEY"],
                    "model_id": "gpt-4",
                }
            }
        }
    }
)
result = pipeline.train()  # Test data generated automatically

Method-Specific Overrides

# SetFit configuration
config_overrides = {
    "SetFitMethodConfig": {
        "num_iterations": 25,
        "body_lr": 2e-5,
        "batch_size": 16,
    }
}

# Finetuner configuration
config_overrides = {
    "FinetunerConfig": {
        "num_hpo_trials": 15,  # Hyperparameter optimization trials
    }
}

Data Formats

Text Classification (CSV)

text,label,anc_label
"Remove my meeting tomorrow",calendar_remove,remove calendar event
"Add a dentist appointment on Friday",calendar_set,add calendar event

The anc_label column is optional. It contains a natural language description of what the class means. It is a human-readable explanation of the label.

NER (JSON)

The package supports two NER data formats:

Offsets format — entities are defined by character spans with start and end positions:

[
  {"text": "What time is it in Australia", "spans": [{"start": 19, "end": 28, "label": "place_name"}]},
  {"text": "What is the forecast today for Moscow", "spans": [{"start": 21, "end": 26, "label": "date"}, {"start": 31, "end": 37, "label": "place_name"}]}
]

Brackets format — entities are marked inline using [label : entity] notation:

[
  {"text": "play a track by [artist : the rolling stones]"},
  {"text": "play [song : hello] by [artist : adele]"}
]

Example data

The files in examples/test_data/noise_n_shot_data/ (text classification) and examples/test_data/noise_n_shot_data_ner/ (NER) were made with external sampling scripts.

Text classification: the scripts use the SNIPS dataset (intent/slot-style). They build train/test splits with optional n-shot sampling and label noise. In the included example, 1% of training labels were noised (randomly flipped to another class). The resulting CSVs follow the formats described above.
NER: the scripts use the MASSIVE dataset. They produce few-shot train/test subsets with optional label noise (1% of labels noised) and export data in the offsets/BIO-style JSON expected by the NER pipeline.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ayaz_zaripov

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Mar 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_autonlu-1.0.0.tar.gz (361.4 kB view details)

Uploaded Mar 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

open_autonlu-1.0.0-py3-none-any.whl (136.1 kB view details)

Uploaded Mar 3, 2026 Python 3

File details

Details for the file open_autonlu-1.0.0.tar.gz.

File metadata

Download URL: open_autonlu-1.0.0.tar.gz
Upload date: Mar 3, 2026
Size: 361.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for open_autonlu-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`0a9d2a4c5c48fe465f41d4d93e9df99d3ac26dc5cf02ccbb3ebbd8da8ee62f4d`
MD5	`1c7988e163610ec9464bcd31ea368ab0`
BLAKE2b-256	`eafce64fe283e50f17928bc99a33e488bc7e678a25bce655a43d2c1aec3b7bc5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_autonlu-1.0.0.tar.gz:

Publisher: publish.yml on mts-ai/OpenAutoNLU

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: open_autonlu-1.0.0.tar.gz
- Subject digest: 0a9d2a4c5c48fe465f41d4d93e9df99d3ac26dc5cf02ccbb3ebbd8da8ee62f4d
- Sigstore transparency entry: 1017914876
- Sigstore integration time: Mar 3, 2026
Source repository:
- Permalink: mts-ai/OpenAutoNLU@8646e10991db4db1955f00b632353945bdac4365
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/mts-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8646e10991db4db1955f00b632353945bdac4365
- Trigger Event: release

File details

Details for the file open_autonlu-1.0.0-py3-none-any.whl.

File metadata

Download URL: open_autonlu-1.0.0-py3-none-any.whl
Upload date: Mar 3, 2026
Size: 136.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for open_autonlu-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`64a4a60ad3493812783b9ffac1056e406a6d2393e48b464720eb92e580352c17`
MD5	`b67d89e2e95997ba69375257de271fba`
BLAKE2b-256	`61b0d0e6a279425e8491f199f6db3e9d3a73b1fe88d17232f82cb932b4fc4a39`

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_autonlu-1.0.0-py3-none-any.whl:

Publisher: publish.yml on mts-ai/OpenAutoNLU

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: open_autonlu-1.0.0-py3-none-any.whl
- Subject digest: 64a4a60ad3493812783b9ffac1056e406a6d2393e48b464720eb92e580352c17
- Sigstore transparency entry: 1017914878
- Sigstore integration time: Mar 3, 2026
Source repository:
- Permalink: mts-ai/OpenAutoNLU@8646e10991db4db1955f00b632353945bdac4365
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/mts-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8646e10991db4db1955f00b632353945bdac4365
- Trigger Event: release

open-autonlu 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

OpenAutoNLU Pipeline

Installation

Documentation

Running with Docker

Code example with default parameters:

Training

Inference

Data Quality Diagnostics

Configuration Overrides

Basic Usage

OOD Detection Methods

LLM Data Augmentation

Synthetic Test Generation

Method-Specific Overrides

Data Formats

Text Classification (CSV)

NER (JSON)

Example data

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance