Skip to main content

The framework can be useful for finetuning and few-shot learning models for classification task and NER.

Project description

OpenAutoNLU Pipeline

arXiv Python 3.12

OpenAutoNLU is an open-source pipeline for training natural language understanding (NLU) models for text classification (multiclass) and named entity recognition (NER). It supports few-shot learning (SetFit, AncSetFit with optional anchor labels), classic fine-tuning, data quality diagnostics, out-of-distribution (OOD) detection, optional LLM-based augmentation and synthetic test generation, and ONNX export for deployment.

You provide train (and optionally test) data; high-level pipelines (TextClassificationTrainingPipeline, TokenClassificationTrainingPipeline) load it, run optional data-quality checks, then automatically choose the training method from the data: AncSetFit for very small datasets (2–5 samples per class), SetFit for medium size (6–80), and fine-tuning for larger data. You can override configs (batch size, OOD method, augmentation, etc.) and save models in ONNX format. A Streamlit app and Docker images (CPU/GPU) are included for interactive use.

Built by MWS AI and contributors (see pyproject.toml for authors). Aimed at practitioners and researchers who want a single, data-driven workflow for few-shot and full-size NLU training without manually picking methods or tuning low-level knobs.

!requires Python >=3.12, <3.13

Usage examples are located in the examples folder.

Installation

To work with the repository in developer mode, install it as an editable package:

pip install -e .

This way you don't need to reinstall the package after code changes. To install all dependencies (two configurations are available: cpu and cuda) for development, run:

uv sync --extra cuda

Documentation

To build and view the documentation locally:

uv sync
cd docs && uv run make html
open build/html/index.html

Running with Docker

With GPU (recommended host: 16GB RAM, 8 CPU, A100 40GB, ~30GB disk):

docker-compose up -d

Without GPU (macOS or CPU-only):

docker build --build-arg EXTRA=cpu -t open-autonlu .
docker run -p 8501:8501 open-autonlu

Code example with default parameters:

Training

from open_autonlu.auto_classes import (
    TextClassificationTrainingPipeline,
    TokenClassificationTrainingPipeline
)
from open_autonlu.methods.data_types import SaveFormat

# Text Classification training
pipeline = TextClassificationTrainingPipeline(
    train_path="train.csv",
    test_path="test.csv",
    config_overrides={"language": "en"}  # "en" or "ru"
)
result = pipeline.train()
pipeline.save("./model", SaveFormat.ONNX)

# NER training
pipeline = TokenClassificationTrainingPipeline(
    train_path="train.json",
    test_path="test.json",
    config_overrides={"language": "en"}  # "en" or "ru"
)
result = pipeline.train()
pipeline.save("./model", SaveFormat.ONNX)

Inference

from open_autonlu.auto_classes import (
    TextClassificationInferenceManager,
    TokenClassificationInferenceManager
)

# Text Classification inference
inferer = TextClassificationInferenceManager("./model")
results = inferer.predict(["Hello world", "Goodbye"], batch_size=32)
for r in results:
    print(f"{r.most_probable.label}: {r.most_probable.score:.3f}")

# NER inference
ner_inferer = TokenClassificationInferenceManager("./ner_model")
results = ner_inferer.predict(["John works at Google"], batch_size=1)
for r in results:
    for entity in r.labels:
        print(f"{entity.text}: {entity.label}")

Data Quality Diagnostics

The diagnose() method evaluates training data quality using multiple evaluators:

Run the data quality stage:

from open_autonlu.auto_classes import TextClassificationTrainingPipeline

pipeline = TextClassificationTrainingPipeline(train_path="train.csv")
evaluation_result = pipeline.diagnose()

Configuration Overrides

The config_overrides parameter allows you to customize training behavior with modifying default configurations.

Basic Usage

from open_autonlu.auto_classes import TextClassificationTrainingPipeline
from open_autonlu.methods.data_types import OodMethod, SaveFormat

pipeline = TextClassificationTrainingPipeline(
    train_path="train.csv",
    config_overrides={
        "language": "en",                # Prompt language for LLM pipelines ("en" or "ru")
        "ood_method": OodMethod.LOGIT,   # OOD detection method
        "batch_size": 32,                # Batch size
    }
)
result = pipeline.train()
pipeline.save("./model", SaveFormat.ONNX)

OOD Detection Methods

Out-of-Distribution detection identifies inputs that don't belong to any trained class.

Method Description Best for
OodMethod.AUTO Auto-select based on training method Default
OodMethod.MARGINAL_MAHALANOBIS_OOD Mahalanobis distance from embedding distribution Finetuning
OodMethod.MSP_OOD Maximum Softmax Probability threshold SetFit, AncSetFit
OodMethod.LOGIT Adds outOfScope class during training Alternative approach
OodMethod.NONE Disable OOD detection When not needed

The threshold_factor parameter controls OOD detection sensitivity. It is a multiplier applied to the OOD detection threshold. Higher values make detection more conservative (fewer samples are marked as OOD), while lower values make it more aggressive (more samples are flagged as OOD). Default value is 1.0.

from open_autonlu.methods.data_types import OodMethod

# Override ood_method and adjust sensitivity
config_overrides = {
    "ood_method": OodMethod.MARGINAL_MAHALANOBIS_OOD,
    "threshold_factor": 1.5,  # More conservative OOD detection
}

LLM Data Augmentation

Automatically augment underrepresented classes using LLM generation. The language parameter controls which prompts are sent to the LLM ("en" for English, "ru" for Russian).

import os
from open_autonlu.auto_classes import TextClassificationTrainingPipeline

pipeline = TextClassificationTrainingPipeline(
    train_path="train.csv",
    config_overrides={
        "language": "en",
        "llm_augmentation": {
            "enabled": True,
            "use_domain_analysis": True,  # Analyze domain for better prompts
            "threshold": 81,               # Augment classes with < 81 samples
            "max_attempts": 10,            # Max generation attempts
            "num_shot": 5,                 # Examples in prompt
            "config_overrides": {
                "LlmClientConfig": {
                    "api_key": os.environ["MODEL_API_KEY"],
                    "model_id": "gpt-4",
                }
            }
        }
    }
)

Synthetic Test Generation

Generate synthetic test data using LLM when no test set is provided.

import os
from open_autonlu.auto_classes import TextClassificationTrainingPipeline

pipeline = TextClassificationTrainingPipeline(
    train_path="train.csv",  # No test_path provided
    config_overrides={
        "language": "en",  
        "llm_test_generation": {
            "enabled": True,
            "num_samples_per_class": 100,
            "use_domain_analysis": True,
            "synthetic_test_path": "./synthetic_test.csv",  # Save generated data
            "config_overrides": {
                "LlmClientConfig": {
                    "api_key": os.environ["MODEL_API_KEY"],
                    "model_id": "gpt-4",
                }
            }
        }
    }
)
result = pipeline.train()  # Test data generated automatically

Method-Specific Overrides

# SetFit configuration
config_overrides = {
    "SetFitMethodConfig": {
        "num_iterations": 25,
        "body_lr": 2e-5,
        "batch_size": 16,
    }
}

# Finetuner configuration
config_overrides = {
    "FinetunerConfig": {
        "num_hpo_trials": 15,  # Hyperparameter optimization trials
    }
}

Data Formats

Text Classification (CSV)

text,label,anc_label
"Remove my meeting tomorrow",calendar_remove,remove calendar event
"Add a dentist appointment on Friday",calendar_set,add calendar event

The anc_label column is optional. It contains a natural language description of what the class means. It is a human-readable explanation of the label.

NER (JSON)

The package supports two NER data formats:

Offsets format — entities are defined by character spans with start and end positions:

[
  {"text": "What time is it in Australia", "spans": [{"start": 19, "end": 28, "label": "place_name"}]},
  {"text": "What is the forecast today for Moscow", "spans": [{"start": 21, "end": 26, "label": "date"}, {"start": 31, "end": 37, "label": "place_name"}]}
]

Brackets format — entities are marked inline using [label : entity] notation:

[
  {"text": "play a track by [artist : the rolling stones]"},
  {"text": "play [song : hello] by [artist : adele]"}
]

Example data

The files in examples/test_data/noise_n_shot_data/ (text classification) and examples/test_data/noise_n_shot_data_ner/ (NER) were made with external sampling scripts.

  • Text classification: the scripts use the SNIPS dataset (intent/slot-style). They build train/test splits with optional n-shot sampling and label noise. In the included example, 1% of training labels were noised (randomly flipped to another class). The resulting CSVs follow the formats described above.
  • NER: the scripts use the MASSIVE dataset. They produce few-shot train/test subsets with optional label noise (1% of labels noised) and export data in the offsets/BIO-style JSON expected by the NER pipeline.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_autonlu-1.0.0.tar.gz (361.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

open_autonlu-1.0.0-py3-none-any.whl (136.1 kB view details)

Uploaded Python 3

File details

Details for the file open_autonlu-1.0.0.tar.gz.

File metadata

  • Download URL: open_autonlu-1.0.0.tar.gz
  • Upload date:
  • Size: 361.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for open_autonlu-1.0.0.tar.gz
Algorithm Hash digest
SHA256 0a9d2a4c5c48fe465f41d4d93e9df99d3ac26dc5cf02ccbb3ebbd8da8ee62f4d
MD5 1c7988e163610ec9464bcd31ea368ab0
BLAKE2b-256 eafce64fe283e50f17928bc99a33e488bc7e678a25bce655a43d2c1aec3b7bc5

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_autonlu-1.0.0.tar.gz:

Publisher: publish.yml on mts-ai/OpenAutoNLU

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file open_autonlu-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: open_autonlu-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 136.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for open_autonlu-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 64a4a60ad3493812783b9ffac1056e406a6d2393e48b464720eb92e580352c17
MD5 b67d89e2e95997ba69375257de271fba
BLAKE2b-256 61b0d0e6a279425e8491f199f6db3e9d3a73b1fe88d17232f82cb932b4fc4a39

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_autonlu-1.0.0-py3-none-any.whl:

Publisher: publish.yml on mts-ai/OpenAutoNLU

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page