The framework can be useful for finetuning and few-shot learning models for classification task and NER.
Project description
OpenAutoNLU Pipeline
OpenAutoNLU is an open-source pipeline for training natural language understanding (NLU) models for text classification (multiclass) and named entity recognition (NER). It supports few-shot learning (SetFit, AncSetFit with optional anchor labels), classic fine-tuning, data quality diagnostics, out-of-distribution (OOD) detection, optional LLM-based augmentation and synthetic test generation, and ONNX export for deployment.
You provide train (and optionally test) data; high-level pipelines (TextClassificationTrainingPipeline, TokenClassificationTrainingPipeline) load it, run optional data-quality checks, then automatically choose the training method from the data: AncSetFit for very small datasets (2–5 samples per class), SetFit for medium size (6–80), and fine-tuning for larger data. You can override configs (batch size, OOD method, augmentation, etc.) and save models in ONNX format. A Streamlit app and Docker images (CPU/GPU) are included for interactive use.
Built by MWS AI and contributors (see pyproject.toml for authors). Aimed at practitioners and researchers who want a single, data-driven workflow for few-shot and full-size NLU training without manually picking methods or tuning low-level knobs.
!requires Python >=3.12, <3.13
Usage examples are located in the examples folder.
Installation
To work with the repository in developer mode, install it as an editable package:
pip install -e .
This way you don't need to reinstall the package after code changes. To install all dependencies (two configurations are available: cpu and cuda) for development, run:
uv sync --extra cuda
Documentation
To build and view the documentation locally:
uv sync
cd docs && uv run make html
open build/html/index.html
Running with Docker
With GPU (recommended host: 16GB RAM, 8 CPU, A100 40GB, ~30GB disk):
docker-compose up -d
Without GPU (macOS or CPU-only):
docker build --build-arg EXTRA=cpu -t open-autonlu .
docker run -p 8501:8501 open-autonlu
Code example with default parameters:
Training
from open_autonlu.auto_classes import (
TextClassificationTrainingPipeline,
TokenClassificationTrainingPipeline
)
from open_autonlu.methods.data_types import SaveFormat
# Text Classification training
pipeline = TextClassificationTrainingPipeline(
train_path="train.csv",
test_path="test.csv",
config_overrides={"language": "en"} # "en" or "ru"
)
result = pipeline.train()
pipeline.save("./model", SaveFormat.ONNX)
# NER training
pipeline = TokenClassificationTrainingPipeline(
train_path="train.json",
test_path="test.json",
config_overrides={"language": "en"} # "en" or "ru"
)
result = pipeline.train()
pipeline.save("./model", SaveFormat.ONNX)
Inference
from open_autonlu.auto_classes import (
TextClassificationInferenceManager,
TokenClassificationInferenceManager
)
# Text Classification inference
inferer = TextClassificationInferenceManager("./model")
results = inferer.predict(["Hello world", "Goodbye"], batch_size=32)
for r in results:
print(f"{r.most_probable.label}: {r.most_probable.score:.3f}")
# NER inference
ner_inferer = TokenClassificationInferenceManager("./ner_model")
results = ner_inferer.predict(["John works at Google"], batch_size=1)
for r in results:
for entity in r.labels:
print(f"{entity.text}: {entity.label}")
Data Quality Diagnostics
The diagnose() method evaluates training data quality using multiple evaluators:
cartography(MulticlassCLF) Dataset Cartographyvinfo(MulticlassCLF) V-Usable informationuncertainty(MulticlassCLF, NER)retag(MulticlassCLF, NER)label aggregation(NER)
Run the data quality stage:
from open_autonlu.auto_classes import TextClassificationTrainingPipeline
pipeline = TextClassificationTrainingPipeline(train_path="train.csv")
evaluation_result = pipeline.diagnose()
Configuration Overrides
The config_overrides parameter allows you to customize training behavior with modifying default configurations.
Basic Usage
from open_autonlu.auto_classes import TextClassificationTrainingPipeline
from open_autonlu.methods.data_types import OodMethod, SaveFormat
pipeline = TextClassificationTrainingPipeline(
train_path="train.csv",
config_overrides={
"language": "en", # Prompt language for LLM pipelines ("en" or "ru")
"ood_method": OodMethod.LOGIT, # OOD detection method
"batch_size": 32, # Batch size
}
)
result = pipeline.train()
pipeline.save("./model", SaveFormat.ONNX)
OOD Detection Methods
Out-of-Distribution detection identifies inputs that don't belong to any trained class.
| Method | Description | Best for |
|---|---|---|
OodMethod.AUTO |
Auto-select based on training method | Default |
OodMethod.MARGINAL_MAHALANOBIS_OOD |
Mahalanobis distance from embedding distribution | Finetuning |
OodMethod.MSP_OOD |
Maximum Softmax Probability threshold | SetFit, AncSetFit |
OodMethod.LOGIT |
Adds outOfScope class during training |
Alternative approach |
OodMethod.NONE |
Disable OOD detection | When not needed |
The threshold_factor parameter controls OOD detection sensitivity. It is a multiplier applied to the OOD detection threshold. Higher values make detection more conservative (fewer samples are marked as OOD), while lower values make it more aggressive (more samples are flagged as OOD). Default value is 1.0.
from open_autonlu.methods.data_types import OodMethod
# Override ood_method and adjust sensitivity
config_overrides = {
"ood_method": OodMethod.MARGINAL_MAHALANOBIS_OOD,
"threshold_factor": 1.5, # More conservative OOD detection
}
LLM Data Augmentation
Automatically augment underrepresented classes using LLM generation. The language parameter controls which prompts are sent to the LLM ("en" for English, "ru" for Russian).
import os
from open_autonlu.auto_classes import TextClassificationTrainingPipeline
pipeline = TextClassificationTrainingPipeline(
train_path="train.csv",
config_overrides={
"language": "en",
"llm_augmentation": {
"enabled": True,
"use_domain_analysis": True, # Analyze domain for better prompts
"threshold": 81, # Augment classes with < 81 samples
"max_attempts": 10, # Max generation attempts
"num_shot": 5, # Examples in prompt
"config_overrides": {
"LlmClientConfig": {
"api_key": os.environ["MODEL_API_KEY"],
"model_id": "gpt-4",
}
}
}
}
)
Synthetic Test Generation
Generate synthetic test data using LLM when no test set is provided.
import os
from open_autonlu.auto_classes import TextClassificationTrainingPipeline
pipeline = TextClassificationTrainingPipeline(
train_path="train.csv", # No test_path provided
config_overrides={
"language": "en",
"llm_test_generation": {
"enabled": True,
"num_samples_per_class": 100,
"use_domain_analysis": True,
"synthetic_test_path": "./synthetic_test.csv", # Save generated data
"config_overrides": {
"LlmClientConfig": {
"api_key": os.environ["MODEL_API_KEY"],
"model_id": "gpt-4",
}
}
}
}
)
result = pipeline.train() # Test data generated automatically
Method-Specific Overrides
# SetFit configuration
config_overrides = {
"SetFitMethodConfig": {
"num_iterations": 25,
"body_lr": 2e-5,
"batch_size": 16,
}
}
# Finetuner configuration
config_overrides = {
"FinetunerConfig": {
"num_hpo_trials": 15, # Hyperparameter optimization trials
}
}
Data Formats
Text Classification (CSV)
text,label,anc_label
"Remove my meeting tomorrow",calendar_remove,remove calendar event
"Add a dentist appointment on Friday",calendar_set,add calendar event
The anc_label column is optional. It contains a natural language description of what the class means. It is a human-readable explanation of the label.
NER (JSON)
The package supports two NER data formats:
Offsets format — entities are defined by character spans with start and end positions:
[
{"text": "What time is it in Australia", "spans": [{"start": 19, "end": 28, "label": "place_name"}]},
{"text": "What is the forecast today for Moscow", "spans": [{"start": 21, "end": 26, "label": "date"}, {"start": 31, "end": 37, "label": "place_name"}]}
]
Brackets format — entities are marked inline using [label : entity] notation:
[
{"text": "play a track by [artist : the rolling stones]"},
{"text": "play [song : hello] by [artist : adele]"}
]
Example data
The files in examples/test_data/noise_n_shot_data/ (text classification) and examples/test_data/noise_n_shot_data_ner/ (NER) were made with external sampling scripts.
- Text classification: the scripts use the SNIPS dataset (intent/slot-style). They build train/test splits with optional n-shot sampling and label noise. In the included example, 1% of training labels were noised (randomly flipped to another class). The resulting CSVs follow the formats described above.
- NER: the scripts use the MASSIVE dataset. They produce few-shot train/test subsets with optional label noise (1% of labels noised) and export data in the offsets/BIO-style JSON expected by the NER pipeline.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file open_autonlu-1.0.0.tar.gz.
File metadata
- Download URL: open_autonlu-1.0.0.tar.gz
- Upload date:
- Size: 361.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a9d2a4c5c48fe465f41d4d93e9df99d3ac26dc5cf02ccbb3ebbd8da8ee62f4d
|
|
| MD5 |
1c7988e163610ec9464bcd31ea368ab0
|
|
| BLAKE2b-256 |
eafce64fe283e50f17928bc99a33e488bc7e678a25bce655a43d2c1aec3b7bc5
|
Provenance
The following attestation bundles were made for open_autonlu-1.0.0.tar.gz:
Publisher:
publish.yml on mts-ai/OpenAutoNLU
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
open_autonlu-1.0.0.tar.gz -
Subject digest:
0a9d2a4c5c48fe465f41d4d93e9df99d3ac26dc5cf02ccbb3ebbd8da8ee62f4d - Sigstore transparency entry: 1017914876
- Sigstore integration time:
-
Permalink:
mts-ai/OpenAutoNLU@8646e10991db4db1955f00b632353945bdac4365 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/mts-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8646e10991db4db1955f00b632353945bdac4365 -
Trigger Event:
release
-
Statement type:
File details
Details for the file open_autonlu-1.0.0-py3-none-any.whl.
File metadata
- Download URL: open_autonlu-1.0.0-py3-none-any.whl
- Upload date:
- Size: 136.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64a4a60ad3493812783b9ffac1056e406a6d2393e48b464720eb92e580352c17
|
|
| MD5 |
b67d89e2e95997ba69375257de271fba
|
|
| BLAKE2b-256 |
61b0d0e6a279425e8491f199f6db3e9d3a73b1fe88d17232f82cb932b4fc4a39
|
Provenance
The following attestation bundles were made for open_autonlu-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on mts-ai/OpenAutoNLU
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
open_autonlu-1.0.0-py3-none-any.whl -
Subject digest:
64a4a60ad3493812783b9ffac1056e406a6d2393e48b464720eb92e580352c17 - Sigstore transparency entry: 1017914878
- Sigstore integration time:
-
Permalink:
mts-ai/OpenAutoNLU@8646e10991db4db1955f00b632353945bdac4365 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/mts-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8646e10991db4db1955f00b632353945bdac4365 -
Trigger Event:
release
-
Statement type: