GigaAM: A package for audio modeling and ASR.
Project description
GigaAM: the family of open-source acoustic models for speech processing
Latest News
- 2024/12 — MIT License, GigaAM-v2 (-15% and -12% WER Reduction for CTC and RNN-T models, respectively), ONNX export support
- 2024/05 — GigaAM-RNNT (-19% WER Reduction), long-form inference using external Voice Activity Detection
- 2024/04 — GigaAM Release: GigaAM-CTC (SoTA Speech Recognition model for the Russian language), GigaAM-Emo
Table of Contents
- Overview
- Installation
- GigaAM: The Foundational Model
- GigaAM for Speech Recognition
- GigaAM-Emo: Emotion Recognition
- License
- Links
Overview
GigaAM (Giga Acoustic Model) is a family of open-source models for Russian speech processing tasks, including speech recognition and emotion recognition. The models are built on top of the Conformer architecture and leverage self-supervised learning (wav2vec2-based for GigaAM-v1 and HuBERT-based for GigaAM-v2).
GigaAM models are state-of-the-art open-source solutions for their respective tasks in the Russian language.
This repository includes:
- GigaAM: A foundational self-supervised model pre-trained on massive Russian speech datasets.
- GigaAM-CTC and GigaAM-RNNT: Fine-tuned models for automatic speech recognition (ASR).
- GigaAM-Emo: A fine-tuned model for emotion recognition.
Installation
Requirements
- Python ≥ 3.8
- ffmpeg installed and added to your system's PATH
Install the GigaAM Package
-
Clone the repository:
git clone https://github.com/salute-developers/GigaAM.git cd GigaAM
-
Install the package in editable mode:
pip install -e .
-
Verify the installation:
import gigaam model = gigaam.load_model("ctc") print(model)
GigaAM: The Foundational Model
GigaAM is a Conformer-based foundational model (240M parameters) pre-trained on 50,000+ hours of diverse Russian speech data.
It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition.
There are 2 available versions:
- GigaAM-v1 was trained with a wav2vec2-like approach and can be used by loading the
v1_sslmodel version. - GigaAM-v2 was trained with a HuBERT-like approach and allows us to get GigaAM-v2 ASR model with better quality. It can be used by loading the
v2_sslorsslmodel version.
More information about GigaAM-v1 can be found in our post on Habr.
GigaAM Usage Example
import gigaam
model = gigaam.load_model('ssl') # Options: "ssl", "v1_ssl"
embedding, _ = model.embed_audio(audio_path)
GigaAM for Speech Recognition
We fine-tuned the GigaAM encoder for ASR using two different architectures:
- GigaAM-CTC was fine-tuned with Connectionist Temporal Classification and a character-based tokenizer.
- GigaAM-RNNT was fine-tuned with RNN Transducer loss.
Fine-tuning was done for both GigaAM-v1 and GigaAM-v2 SSL models, so we have 4 ASR models: v1 and v2 versions for both CTC and RNNT.
Training Data
The models were trained on publicly available Russian datasets:
| Dataset | Size (hours) | Weight |
|---|---|---|
| Golos | 1227 | 0.6 |
| SOVA | 369 | 0.2 |
| Russian Common Voice | 207 | 0.1 |
| Russian LibriSpeech | 93 | 0.1 |
Performance Metrics (Word Error Rate)
| Model | Parameters | Golos Crowd | Golos Farfield | OpenSTT YouTube | OpenSTT Phone Calls | OpenSTT Audiobooks | Mozilla Common Voice 12 | Mozilla Common Voice 19 | Russian LibriSpeech |
|---|---|---|---|---|---|---|---|---|---|
| Whisper-large-v3 | 1.5B | 13.9 | 16.6 | 18.0 | 28.0 | 14.4 | 5.7 | 5.5 | 9.5 |
| NVIDIA FastConformer | 115M | 2.2 | 6.6 | 21.2 | 30.0 | 13.9 | 2.7 | 5.7 | 11.3 |
| GigaAM-CTC-v1 | 242M | 3.0 | 5.7 | 16.0 | 23.2 | 12.5 | 2.0 | 10.5 | 7.5 |
| GigaAM-RNNT-v1 | 243M | 2.3 | 5.0 | 14.0 | 21.7 | 11.7 | 1.9 | 9.9 | 7.7 |
| GigaAM-CTC-v2 | 242M | 2.5 | 4.3 | 14.1 | 21.1 | 10.7 | 2.1 | 3.1 | 5.5 |
| GigaAM-RNNT-v2 | 243M | 2.2 | 3.9 | 13.3 | 20.0 | 10.2 | 1.8 | 2.7 | 5.5 |
Speech Recognition Example (GigaAM-ASR)
Basic usage: short audio transcribation (up to 30 seconds)
import gigaam
model_name = "rnnt" # Options: "v2_ctc" or "ctc", "v2_rnnt" or "rnnt", "v1_ctc", "v1_rnnt"
model = gigaam.load_model(model_name)
transcription = model.transcribe(audio_path)
Long-form audio transcribation
- Install external VAD dependencies (pyannote.audio library) with
pip install gigaam[longform]
-
- Generate Hugging Face API token
- Accept the conditions to access pyannote/voice-activity-detection files and content.
- Accept the conditions to access pyannote/segmentation files and content.
- Use the
model.transcribe_longformmethod:import os import gigaam os.environ["HF_TOKEN"] = "<HF_TOKEN>" model = gigaam.load_model("ctc") recognition_result = model.transcribe_longform("long_example.wav") for utterance in recognition_result: transcription = utterance["transcription"] start, end = utterance["boundaries"] print(f"[{gigaam.format_time(start)} - {gigaam.format_time(end)}]: {transcription}")
ONNX inference example
- Export the model to ONNX using the
model.to_onnxmethod:onnx_dir = "onnx" model_type = "rnnt" # or "ctc" model = gigaam.load_model( model_type, fp16_encoder=False, # only fp32 tensors use_flash=False, # disable flash attention ) model.to_onnx(dir_path=onnx_dir)
- Run ONNX inference:
from gigaam.onnx_utils import load_onnx_sessions, transcribe_sample sessions = load_onnx_sessions(onnx_dir, model_type) transcribe_sample("example.wav", model_type, sessions)
All these examples can also be found in inference_example.ipynb notebook.
GigaAM-Emo: Emotion Recognition
GigaAM-Emo is a fine-tuned model for emotion recognition trained on the Dusha dataset. It significantly outperforms existing models on several metrics.
Performance Metrics
| Crowd | Podcast | |||||
|---|---|---|---|---|---|---|
| Unweighted Accuracy | Weighted Accuracy | Macro F1-score | Unweighted Accuracy | Weighted Accuracy | Macro F1-score | |
| DUSHA baseline (MobileNetV2 + Self-Attention) |
0.83 | 0.76 | 0.77 | 0.89 | 0.53 | 0.54 |
| АБК (TIM-Net) | 0.84 | 0.77 | 0.78 | 0.90 | 0.50 | 0.55 |
| GigaAM-Emo | 0.90 | 0.87 | 0.84 | 0.90 | 0.76 | 0.67 |
Emotion Recognition Example (GigaAM-Emo)
import gigaam
model = gigaam.load_model('emo')
emotion2prob: Dict[str, int] = model.get_probs("example.wav")
print(", ".join([f"{emotion}: {prob:.3f}" for emotion, prob in emotion2prob.items()]))
License
GigaAM's code and model weights are released under the MIT License.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gigaam-0.1.0.tar.gz.
File metadata
- Download URL: gigaam-0.1.0.tar.gz
- Upload date:
- Size: 23.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39f0eeed09b6047d72e24bb14bc1f0aee3f822c97dc2610dc4d59cecb57ad096
|
|
| MD5 |
f512c78a456c1e252742ee7fa4e4719b
|
|
| BLAKE2b-256 |
14e216ab253517b8d8383bcb0be99e5ad6ac51d7b829895f5eb17415467866f7
|
File details
Details for the file gigaam-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gigaam-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
756ed6811da00570d207bcb45641897967059e5ff4965ada8f5cfe4c81601bf3
|
|
| MD5 |
af9f514eb3bdf9b88a9df2f1c0867da5
|
|
| BLAKE2b-256 |
60feaff8be8e1752bc961751def590b5c7e2a7431942136425d681012dca9a7e
|