SONATA: SOund and Narrative Advanced Transcription Assistant
Project description
SONATA
SOund and Narrative Advanced Transcription Assistant
SONATA is an advanced Automatic Speech Recognition (ASR) system that captures the symphony of human expression by recognizing and transcribing both verbal content and emotive sounds.
Features
- High-accuracy speech-to-text transcription
- Recognition of emotive sounds and non-verbal cues
- Support for tags like
<laugh>,<sigh>,<yawn>,<surprise>,<inhale>,<groan>,<cough>,<sneeze>,<sniffle> - Open-source and extensible architecture
Installation
Install the package from PyPI:
pip install sonata-asr
Or install from source:
git clone https://github.com/hwk06023/SONATA.git
cd SONATA
pip install -e .
Usage Examples
Basic Transcription
from sonata.core.transcriber import IntegratedTranscriber
# Initialize the transcriber
transcriber = IntegratedTranscriber(asr_model="large-v3", device="cpu")
# Transcribe an audio file
result = transcriber.process_audio("path/to/audio.wav", language="en")
# Save the result to a file
transcriber.save_result(result, "output.json")
# Get the plain text transcript
plain_text = result["integrated_transcript"]["plain_text"]
print(plain_text)
Extracting Timestamps
from sonata.core.transcriber import IntegratedTranscriber
# Initialize the transcriber
transcriber = IntegratedTranscriber()
# Process audio with timestamps
result = transcriber.process_audio("path/to/audio.wav")
# Extract words with their timestamps
for item in result["integrated_transcript"]["rich_text"]:
if item["type"] == "word":
word = item["content"]
start_time = item["start"]
end_time = item["end"]
print(f"{word}: {start_time:.2f}s - {end_time:.2f}s")
Processing with GPU Acceleration
from sonata.core.transcriber import IntegratedTranscriber
# Initialize with CUDA device
transcriber = IntegratedTranscriber(
asr_model="large-v3",
device="cuda",
compute_type="float16" # Use float16 for faster GPU processing
)
# Process audio
result = transcriber.process_audio("path/to/audio.wav")
Command Line Interface
SONATA provides a command-line interface for quick transcription:
# Basic usage
sonata-asr path/to/audio.wav
# Save output to specific file
sonata-asr path/to/audio.wav --output result.json
# Use GPU acceleration
sonata-asr path/to/audio.wav --device cuda
# Process audio with preprocessing (format conversion and silence trimming)
sonata-asr path/to/audio.wav --preprocess
# Split and process long audio files
sonata-asr path/to/audio.wav --split --split-length 30 --split-overlap 5
Inference Tools
The test directory contains additional inference tools for more advanced usage:
Basic Inference Script
# Process a single file
python test/infer.py path/to/audio.wav
# Specify output file and use GPU
python test/infer.py path/to/audio.wav -o output.json -d cuda
Advanced Processing
The advanced inference script supports batch processing and additional features:
# Process a directory of audio files in parallel
python test/advanced_infer.py path/to/audio_directory/ --batch --max-workers 4
# Preprocess audio before transcription
python test/advanced_infer.py path/to/audio.wav --preprocess
The preprocessing option performs two important operations:
- Converts audio to WAV format for maximum compatibility
- Trims silence from the beginning and end, improving accuracy and reducing processing time
See the inference tools documentation for more details.
Future Roadmap
SONATA is under active development. Here are some planned features and improvements:
Enhanced Multilingual Support
- Expand language coverage beyond current supported languages
- Improve transcription quality for non-English languages
- Add language auto-detection capabilities
ASR Model Diversity
- Support for additional ASR models beyond WhisperX
- Integration with local models for offline/private use
- Finetuned models for specific domains (medical, legal, etc.)
Advanced Emotive Detection
- Expand the range of detectable emotive events
- Improve accuracy of emotive event detection
- Add custom emotive event training capabilities
Performance Improvements
- Optimize processing for large audio files
- Enhance parallel processing capabilities
- Reduce memory footprint for resource-constrained environments
User Interface
- Add web-based UI for transcription monitoring
- Develop visualization tools for speech analytics
- Create interactive transcript editor
We welcome contributions in any of these areas!
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details. This license ensures that derivative works must also be open source and use the same license.
Acknowledgements
This project leverages the following key open source components:
- WhisperX - Fast speech recognition with word-level timestamps
- Laughter-Detection - Automatic detection of laughter in audio files (MIT License)
We are grateful to the developers and contributors of these libraries for their valuable work.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sonata_asr-0.0.4.tar.gz.
File metadata
- Download URL: sonata_asr-0.0.4.tar.gz
- Upload date:
- Size: 36.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05039c271eaf78201c97fb418de19543d97d2587599cf7018bee9b1a4ae72fc3
|
|
| MD5 |
9331b13d3a62b366f99b0751e0d7706c
|
|
| BLAKE2b-256 |
32157ef7597fc43c9d0c521af008291209294061e456ec0e0416c05dc5b28b5f
|
File details
Details for the file sonata_asr-0.0.4-py3-none-any.whl.
File metadata
- Download URL: sonata_asr-0.0.4-py3-none-any.whl
- Upload date:
- Size: 39.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24e381be037934a7f609d021833787e6fff3a7c78fa4d0f4f20052ec4ba92071
|
|
| MD5 |
9e85fe5b897aa36fdeb20f2b5eb41fae
|
|
| BLAKE2b-256 |
8e7fca69c5e1c4fe95dff3f149d7045e6924b04febf4c406e04ec0a47371b18a
|