GigaAM: A package for audio modeling and ASR.

These details have not been verified by PyPI

Project links

Homepage

Project description

GigaAM: the family of open-source acoustic models for speech processing

plot

Latest News

2024/12 — MIT License, GigaAM-v2 (-15% and -12% WER Reduction for CTC and RNN-T models, respectively), ONNX export support
2024/05 — GigaAM-RNNT (-19% WER Reduction), long-form inference using external Voice Activity Detection
2024/04 — GigaAM Release: GigaAM-CTC (SoTA Speech Recognition model for the Russian language), GigaAM-Emo

Overview
Installation
GigaAM: The Foundational Model
GigaAM for Speech Recognition
- GigaAM-CTC
- GigaAM-RNNT
GigaAM-Emo: Emotion Recognition
License
Links

Overview

GigaAM (Giga Acoustic Model) is a family of open-source models for Russian speech processing tasks, including speech recognition and emotion recognition. The models are built on top of the Conformer architecture and leverage self-supervised learning (wav2vec2-based for GigaAM-v1 and HuBERT-based for GigaAM-v2).

GigaAM models are state-of-the-art open-source solutions for their respective tasks in the Russian language.

This repository includes:

GigaAM: A foundational self-supervised model pre-trained on massive Russian speech datasets.
GigaAM-CTC and GigaAM-RNNT: Fine-tuned models for automatic speech recognition (ASR).
GigaAM-Emo: A fine-tuned model for emotion recognition.

Installation

Requirements

Python ≥ 3.8
ffmpeg installed and added to your system's PATH

Install the GigaAM Package

Clone the repository:

git clone https://github.com/salute-developers/GigaAM.git
cd GigaAM

Install the package in editable mode:
```
pip install -e .
```

Verify the installation:

import gigaam
model = gigaam.load_model("ctc")
print(model)

GigaAM: The Foundational Model

GigaAM is a Conformer-based foundational model (240M parameters) pre-trained on 50,000+ hours of diverse Russian speech data.

It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition.

There are 2 available versions:

GigaAM-v1 was trained with a wav2vec2-like approach and can be used by loading the v1_ssl model version.
GigaAM-v2 was trained with a HuBERT-like approach and allows us to get GigaAM-v2 ASR model with better quality. It can be used by loading the v2_ssl or ssl model version.

More information about GigaAM-v1 can be found in our post on Habr.

GigaAM Usage Example

import gigaam
model = gigaam.load_model('ssl') # Options: "ssl", "v1_ssl"
embedding, _ = model.embed_audio(audio_path)

GigaAM for Speech Recognition

We fine-tuned the GigaAM encoder for ASR using two different architectures:

GigaAM-CTC was fine-tuned with Connectionist Temporal Classification and a character-based tokenizer.
GigaAM-RNNT was fine-tuned with RNN Transducer loss.

Fine-tuning was done for both GigaAM-v1 and GigaAM-v2 SSL models, so we have 4 ASR models: v1 and v2 versions for both CTC and RNNT.

Training Data

The models were trained on publicly available Russian datasets:

Dataset	Size (hours)	Weight
Golos	1227	0.6
SOVA	369	0.2
Russian Common Voice	207	0.1
Russian LibriSpeech	93	0.1

Performance Metrics (Word Error Rate)

Model	Parameters	Golos Crowd	Golos Farfield	OpenSTT YouTube	OpenSTT Phone Calls	OpenSTT Audiobooks	Mozilla Common Voice 12	Mozilla Common Voice 19	Russian LibriSpeech
Whisper-large-v3	1.5B	13.9	16.6	18.0	28.0	14.4	5.7	5.5	9.5
NVIDIA FastConformer	115M	2.2	6.6	21.2	30.0	13.9	2.7	5.7	11.3
GigaAM-CTC-v1	242M	3.0	5.7	16.0	23.2	12.5	2.0	10.5	7.5
GigaAM-RNNT-v1	243M	2.3	5.0	14.0	21.7	11.7	1.9	9.9	7.7
GigaAM-CTC-v2	242M	2.5	4.3	14.1	21.1	10.7	2.1	3.1	5.5
GigaAM-RNNT-v2	243M	2.2	3.9	13.3	20.0	10.2	1.8	2.7	5.5

Speech Recognition Example (GigaAM-ASR)

Basic usage: short audio transcribation (up to 30 seconds)

import gigaam
model_name = "rnnt"  # Options: "v2_ctc" or "ctc", "v2_rnnt" or "rnnt", "v1_ctc", "v1_rnnt"
model = gigaam.load_model(model_name)
transcription = model.transcribe(audio_path)

Long-form audio transcribation

Install external VAD dependencies (pyannote.audio library) with
```
pip install gigaam[longform]
```
- Generate Hugging Face API token
- Accept the conditions to access pyannote/voice-activity-detection files and content.
- Accept the conditions to access pyannote/segmentation files and content.

Use the model.transcribe_longform method:

import os
import gigaam

os.environ["HF_TOKEN"] = "<HF_TOKEN>"

model = gigaam.load_model("ctc")
recognition_result = model.transcribe_longform("long_example.wav")

for utterance in recognition_result:
   transcription = utterance["transcription"]
   start, end = utterance["boundaries"]
   print(f"[{gigaam.format_time(start)} - {gigaam.format_time(end)}]: {transcription}")

ONNX inference example

Export the model to ONNX using the model.to_onnx method:

onnx_dir = "onnx"
model_type = "rnnt" # or "ctc"

model = gigaam.load_model(
   model_type,
   fp16_encoder=False,  # only fp32 tensors
   use_flash=False,  # disable flash attention
)
model.to_onnx(dir_path=onnx_dir)

Run ONNX inference:

from gigaam.onnx_utils import load_onnx_sessions, transcribe_sample

sessions = load_onnx_sessions(onnx_dir, model_type)
transcribe_sample("example.wav", model_type, sessions)

All these examples can also be found in inference_example.ipynb notebook.

GigaAM-Emo: Emotion Recognition

GigaAM-Emo is a fine-tuned model for emotion recognition trained on the Dusha dataset. It significantly outperforms existing models on several metrics.

Performance Metrics

		Crowd			Podcast
	Unweighted Accuracy	Weighted Accuracy	Macro F1-score	Unweighted Accuracy	Weighted Accuracy	Macro F1-score
DUSHA baseline (MobileNetV2 + Self-Attention)	0.83	0.76	0.77	0.89	0.53	0.54
АБК (TIM-Net)	0.84	0.77	0.78	0.90	0.50	0.55
GigaAM-Emo	0.90	0.87	0.84	0.90	0.76	0.67

Emotion Recognition Example (GigaAM-Emo)

import gigaam
model = gigaam.load_model('emo')
emotion2prob: Dict[str, int] = model.get_probs("example.wav")

print(", ".join([f"{emotion}: {prob:.3f}" for emotion, prob in emotion2prob.items()]))

License

GigaAM's code and model weights are released under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Apr 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gigaam-0.1.0.tar.gz (23.3 kB view details)

Uploaded Apr 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gigaam-0.1.0-py3-none-any.whl (22.4 kB view details)

Uploaded Apr 11, 2025 Python 3

File details

Details for the file gigaam-0.1.0.tar.gz.

File metadata

Download URL: gigaam-0.1.0.tar.gz
Upload date: Apr 11, 2025
Size: 23.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for gigaam-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`39f0eeed09b6047d72e24bb14bc1f0aee3f822c97dc2610dc4d59cecb57ad096`
MD5	`f512c78a456c1e252742ee7fa4e4719b`
BLAKE2b-256	`14e216ab253517b8d8383bcb0be99e5ad6ac51d7b829895f5eb17415467866f7`

See more details on using hashes here.

File details

Details for the file gigaam-0.1.0-py3-none-any.whl.

File metadata

Download URL: gigaam-0.1.0-py3-none-any.whl
Upload date: Apr 11, 2025
Size: 22.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for gigaam-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`756ed6811da00570d207bcb45641897967059e5ff4965ada8f5cfe4c81601bf3`
MD5	`af9f514eb3bdf9b88a9df2f1c0867da5`
BLAKE2b-256	`60feaff8be8e1752bc961751def590b5c7e2a7431942136425d681012dca9a7e`

See more details on using hashes here.

gigaam 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

GigaAM: the family of open-source acoustic models for speech processing

Latest News

Table of Contents

Overview

Installation

Requirements

Install the GigaAM Package

GigaAM: The Foundational Model

GigaAM Usage Example

GigaAM for Speech Recognition

Training Data

Performance Metrics (Word Error Rate)

Speech Recognition Example (GigaAM-ASR)

Basic usage: short audio transcribation (up to 30 seconds)

Long-form audio transcribation

ONNX inference example

GigaAM-Emo: Emotion Recognition

Performance Metrics

Emotion Recognition Example (GigaAM-Emo)

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes