Audio transcription and alignment library using Google Gemini API for TTS labeling

These details have not been verified by PyPI

Project links

Project description

Audio Transcription

A library for audio transcription and alignment using Google Gemini API, designed for TTS labeling.

Introduction

Audio Transcription is a powerful Python library that leverages Google Gemini API to convert speech to text with high accuracy. It's specifically designed for TTS (Text-to-Speech) labeling tasks, providing detailed voice descriptions and precise text-audio alignment. This library simplifies the process of transcribing audio content and matching it with corresponding text, making it an essential tool for speech analysis and TTS development. A key advantage of Audio Transcription is its ability to efficiently handle long audio files that conventional transcription solutions struggle with, breaking them down into manageable segments without sacrificing accuracy.

Features

High-Accuracy Audio Transcription: Convert speech to text using Google Gemini API
Long Audio Processing: Efficiently handle lengthy audio files that conventional solutions struggle with
Smart Audio Segmentation: Automatically split audio at appropriate silence points
Voice Description: Provide detailed voice characterization compatible with modern LLM-based TTS systems
Text-Audio Alignment: Synchronize text with audio content
Customizable: Support for custom prompts and device selection for alignment

Installation and Usage

Installation

pip install gemini-audio-transcription

Prerequisites

This library requires a Google Gemini API key. You can obtain one from Google AI Studio.

Basic Usage

Simple Transcription

from gemini_audio_transcription import AudioTranscriber

# Initialize with API key
transcriber = AudioTranscriber(api_key="your-api-key")
# Or use environment variable: export GOOGLE_API_KEY="your-api-key"

# Transcribe an audio file
results = transcriber.transcribe("path/to/audio.wav")
print(results)

Complete Audio Processing

from gemini_audio_transcription import AudioProcessor

# Initialize with options
processor = AudioProcessor(
    api_key="your-api-key",  # Optional if GOOGLE_API_KEY is set
    transcription_model="gemini-2.0-flash",
    whisper_model="large-v3",
    device="cpu"  # Use "cuda" for GPU or "mps" for Apple Silicon
)

# Process audio: transcribe and align
results = processor.process_audio(
    "path/to/audio.wav",
    save_folder="output_dir",  # Optional, to save audio chunks
    leading_silence_ms=100,    # Optional, silence at beginning of chunks
    trailing_silence_ms=100,   # Optional, silence at end of chunks
    language="en"              # Language code for alignment
)

# Save results to JSON
processor.save_transcription_json(results, "output_dir/results.json")

Text-Audio Alignment Only

If you already have the transcript and just want to align it with audio:

from gemini_audio_transcription import TextAligner

aligner = TextAligner(
    model_name="large-v3",
    device="cuda"  # Use GPU for faster processing
)

# Align existing text with audio
text = "This is the transcript text that needs to be aligned with the audio."
chunks = aligner.align_text(
    text=text,
    audio_file="path/to/audio.wav",
    save_folder="aligned_chunks",
    language="en"
)

Default Prompt and Explanation

The default prompt used by this library is designed specifically for TTS evaluation and transcription. It guides the Gemini model to:

Listen to audio and compare it to input text for accuracy
Consider text preprocessing (number-to-word conversion, special character removal)
Return results in a structured JSON format with transcript text and voice description

Here's the default prompt structure:

**Please evaluate this TTS-generated audio file based on the provided text input, following these guidelines:**

1. Carefully listen to the audio and compare it to the input text to ensure the speech matches the text exactly, without missing, mispronounced, or added words.

2. The input text is preprocessed such that:

   * All numbers are converted to words.
   * The text does not contain any special characters or symbols.

3. Return the output in the following JSON structure:
[
    {
        "text": "The original text input used to generate the speech. Each text segment is a single, complete sentence",
        "description": "Provide a detailed and objective description of the synthesized voice characteristics, including speaker gender (if perceivable), tone, emotion, pronunciation clarity, prosody (rhythm, intonation), and overall naturalness. For example: A male voice with a neutral tone, moderately expressive speaking style, and clearly articulated words. Slight robotic timbre but minimal distortion. No background noise detected. Each description must be specific, varied, and not repeated across entries."
    }
]

4. Do not include any commentary or content outside the specified JSON format.

The Voice Description feature is particularly valuable for modern TTS systems that leverage Large Language Models (LLMs), such as parlerTTS. These advanced TTS systems can utilize detailed voice characteristic descriptions to generate more natural and expressive speech. The rich metadata provided by Audio Transcription helps in:

Training and fine-tuning voice models with specific characteristics
Generating speech with desired emotional qualities and prosody
Creating consistent voice personalities across different text inputs

You can customize this prompt by providing your own when initializing the transcriber:

from gemini_audio_transcription import AudioTranscriber

custom_prompt = """
Your custom prompt here...
"""

transcriber = AudioTranscriber(
    api_key="your-api-key",
    custom_prompt=custom_prompt
)

Note

This library only works with a Google API key. The transcription functionality is powered by Google Gemini API, and you must have a valid API key to use this library. Set the API key either when initializing the transcriber or as an environment variable (GOOGLE_API_KEY).

Acknowledgements

Google Gemini API for providing the advanced transcription capabilities
stable-ts for the robust text-audio alignment functionality

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

May 8, 2025

0.1.0

May 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gemini_audio_transcription-0.1.1.tar.gz (16.9 kB view details)

Uploaded May 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gemini_audio_transcription-0.1.1-py3-none-any.whl (16.4 kB view details)

Uploaded May 8, 2025 Python 3

File details

Details for the file gemini_audio_transcription-0.1.1.tar.gz.

File metadata

Download URL: gemini_audio_transcription-0.1.1.tar.gz
Upload date: May 8, 2025
Size: 16.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for gemini_audio_transcription-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`d5662f96df1fef0b0d7850915c4d850988dfe91a691e678597243d1acaa61689`
MD5	`ebb364a4d72805a5179dea43b1f8a3c3`
BLAKE2b-256	`5b5aeb945b8fb02ab7890c5b38940d686f990f0263816dc236ad5afa938e105c`

See more details on using hashes here.

File details

Details for the file gemini_audio_transcription-0.1.1-py3-none-any.whl.

File metadata

Download URL: gemini_audio_transcription-0.1.1-py3-none-any.whl
Upload date: May 8, 2025
Size: 16.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for gemini_audio_transcription-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d5b7280c4e1b32c6e1d886e868f8fac517f7750fd6d5398b3d4a4b916a2a5c47`
MD5	`f403a7eb7c09728438b4f373967ca54e`
BLAKE2b-256	`1f66b340634e2b67e9bdaf1f7cc5396541f571e3cb7a7c328b88d3f5f7aa4cf7`

See more details on using hashes here.

gemini-audio-transcription 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Audio Transcription

Introduction

Features

Installation and Usage

Installation

Prerequisites

Basic Usage

Simple Transcription

Complete Audio Processing

Text-Audio Alignment Only

Default Prompt and Explanation

Note

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes