Skip to main content

GigaAM: A package for audio modeling and ASR.

Project description

GigaAM: the family of open-source acoustic models for speech processing

plot

Latest News


Table of Contents


Overview

GigaAM (Giga Acoustic Model) is a family of open-source models for Russian speech processing tasks, including speech recognition and emotion recognition. The models are built on top of the Conformer architecture and leverage self-supervised learning (wav2vec2-based for GigaAM-v1 and HuBERT-based for GigaAM-v2).

GigaAM models are state-of-the-art open-source solutions for their respective tasks in the Russian language.

This repository includes:

  • GigaAM: A foundational self-supervised model pre-trained on massive Russian speech datasets.
  • GigaAM-CTC and GigaAM-RNNT: Fine-tuned models for automatic speech recognition (ASR).
  • GigaAM-Emo: A fine-tuned model for emotion recognition.

Installation

Requirements

  • Python ≥ 3.8
  • ffmpeg installed and added to your system's PATH

Install the GigaAM Package

  1. Clone the repository:

    git clone https://github.com/salute-developers/GigaAM.git
    cd GigaAM
    
  2. Install the package in editable mode:

    pip install -e .
    
  3. Verify the installation:

    import gigaam
    model = gigaam.load_model("ctc")
    print(model)
    

GigaAM: The Foundational Model

GigaAM is a Conformer-based foundational model (240M parameters) pre-trained on 50,000+ hours of diverse Russian speech data.

It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition.

There are 2 available versions:

  • GigaAM-v1 was trained with a wav2vec2-like approach and can be used by loading the v1_ssl model version.
  • GigaAM-v2 was trained with a HuBERT-like approach and allows us to get GigaAM-v2 ASR model with better quality. It can be used by loading the v2_ssl or ssl model version.

More information about GigaAM-v1 can be found in our post on Habr.

GigaAM Usage Example

import gigaam
model = gigaam.load_model('ssl') # Options: "ssl", "v1_ssl"
embedding, _ = model.embed_audio(audio_path)

GigaAM for Speech Recognition

We fine-tuned the GigaAM encoder for ASR using two different architectures:

Fine-tuning was done for both GigaAM-v1 and GigaAM-v2 SSL models, so we have 4 ASR models: v1 and v2 versions for both CTC and RNNT.

Training Data

The models were trained on publicly available Russian datasets:

Dataset Size (hours) Weight
Golos 1227 0.6
SOVA 369 0.2
Russian Common Voice 207 0.1
Russian LibriSpeech 93 0.1

Performance Metrics (Word Error Rate)

Model Parameters Golos Crowd Golos Farfield OpenSTT YouTube OpenSTT Phone Calls OpenSTT Audiobooks Mozilla Common Voice 12 Mozilla Common Voice 19 Russian LibriSpeech
Whisper-large-v3 1.5B 13.9 16.6 18.0 28.0 14.4 5.7 5.5 9.5
NVIDIA FastConformer 115M 2.2 6.6 21.2 30.0 13.9 2.7 5.7 11.3
GigaAM-CTC-v1 242M 3.0 5.7 16.0 23.2 12.5 2.0 10.5 7.5
GigaAM-RNNT-v1 243M 2.3 5.0 14.0 21.7 11.7 1.9 9.9 7.7
GigaAM-CTC-v2 242M 2.5 4.3 14.1 21.1 10.7 2.1 3.1 5.5
GigaAM-RNNT-v2 243M 2.2 3.9 13.3 20.0 10.2 1.8 2.7 5.5

Speech Recognition Example (GigaAM-ASR)

Basic usage: short audio transcribation (up to 30 seconds)

import gigaam
model_name = "rnnt"  # Options: "v2_ctc" or "ctc", "v2_rnnt" or "rnnt", "v1_ctc", "v1_rnnt"
model = gigaam.load_model(model_name)
transcription = model.transcribe(audio_path)

Long-form audio transcribation

  1. Install external VAD dependencies (pyannote.audio library) with
    pip install gigaam[longform]
    
  2. Use the model.transcribe_longform method:
    import os
    import gigaam
    
    os.environ["HF_TOKEN"] = "<HF_TOKEN>"
    
    model = gigaam.load_model("ctc")
    recognition_result = model.transcribe_longform("long_example.wav")
    
    for utterance in recognition_result:
       transcription = utterance["transcription"]
       start, end = utterance["boundaries"]
       print(f"[{gigaam.format_time(start)} - {gigaam.format_time(end)}]: {transcription}")
    

ONNX inference example

  1. Export the model to ONNX using the model.to_onnx method:
    onnx_dir = "onnx"
    model_type = "rnnt" # or "ctc"
    
    model = gigaam.load_model(
       model_type,
       fp16_encoder=False,  # only fp32 tensors
       use_flash=False,  # disable flash attention
    )
    model.to_onnx(dir_path=onnx_dir)
    
  2. Run ONNX inference:
    from gigaam.onnx_utils import load_onnx_sessions, transcribe_sample
    
    sessions = load_onnx_sessions(onnx_dir, model_type)
    transcribe_sample("example.wav", model_type, sessions)
    

All these examples can also be found in inference_example.ipynb notebook.


GigaAM-Emo: Emotion Recognition

GigaAM-Emo is a fine-tuned model for emotion recognition trained on the Dusha dataset. It significantly outperforms existing models on several metrics.

Performance Metrics

Crowd Podcast
Unweighted Accuracy Weighted Accuracy Macro F1-score Unweighted Accuracy Weighted Accuracy Macro F1-score
DUSHA baseline
(MobileNetV2 + Self-Attention)
0.83 0.76 0.77 0.89 0.53 0.54
АБК (TIM-Net) 0.84 0.77 0.78 0.90 0.50 0.55
GigaAM-Emo 0.90 0.87 0.84 0.90 0.76 0.67

Emotion Recognition Example (GigaAM-Emo)

import gigaam
model = gigaam.load_model('emo')
emotion2prob: Dict[str, int] = model.get_probs("example.wav")

print(", ".join([f"{emotion}: {prob:.3f}" for emotion, prob in emotion2prob.items()]))

License

GigaAM's code and model weights are released under the MIT License.


Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gigaam-0.1.0.tar.gz (23.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gigaam-0.1.0-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file gigaam-0.1.0.tar.gz.

File metadata

  • Download URL: gigaam-0.1.0.tar.gz
  • Upload date:
  • Size: 23.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for gigaam-0.1.0.tar.gz
Algorithm Hash digest
SHA256 39f0eeed09b6047d72e24bb14bc1f0aee3f822c97dc2610dc4d59cecb57ad096
MD5 f512c78a456c1e252742ee7fa4e4719b
BLAKE2b-256 14e216ab253517b8d8383bcb0be99e5ad6ac51d7b829895f5eb17415467866f7

See more details on using hashes here.

File details

Details for the file gigaam-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gigaam-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for gigaam-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 756ed6811da00570d207bcb45641897967059e5ff4965ada8f5cfe4c81601bf3
MD5 af9f514eb3bdf9b88a9df2f1c0867da5
BLAKE2b-256 60feaff8be8e1752bc961751def590b5c7e2a7431942136425d681012dca9a7e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page