A toolkit for audio transcription, speaker diarization, and text processing
Project description
Audio Transcription Toolkit
A toolkit for audio processing, including transcription, speaker diarization, and stopword removal. This project is designed to deliver a seamless pipeline for processing audio data, identifying speakers, and generating clean text transcriptions using Whisper, PyAnnote, and Natasha.
Key Features
- Transcription: Converts audio to text using
WhisperandFasterWhisper(for faster processing). - Speaker Diarization: Separates and identifies individual speakers using PyAnnote.
- Text Post-Processing:
- Remove stopwords and swear words using Natasha.
- Customize stopword behaviors by adding your own rules.
Installation
Make sure you have Python 3.8+ installed on your machine. To install this package, run:
pip install audio_transcribing
Other requirements
If you’re using GPU for better performance, ensure torch is installed with GPU support. You can use:
pip install torch --index-url https://download.pytorch.org/whl/cu117
Quick Start
Transcription of Audio with Speaker Diarization and Cleaned Output
from audio_transcriber.audio_transcribing import Transcriber, NatashaStopwordsRemover
# Initialize the transcriber
transcriber = Transcriber(
token="your-huggingface-token", # Token for PyAnnote diarization
whisper_model="medium", # Size of Whisper model
use_faster_whisper=True # Use Faster Whisper if performance is a priority
)
# Load the audio file
with open("your_file.mp3", "rb") as f:
audio_content = f.read()
# Transcribe audio
result = transcriber.transcribe(audio_content, language="ru")
print("Transcription with speaker diarization:")
print(result)
# Post-process text (remove stopwords and optional swear words)
cleaned_result = NatashaStopwordsRemover.remove_stopwords(result, remove_swear_words=True)
print("\nCleaned transcription:")
print(cleaned_result)
Add Stopwords or Swear Words to Natasha Processor
Natasha can be used only with russian language.
from audio_transcriber.audio_transcribing import NatashaStopwordsRemover
# Initialize Natasha processor
stopwords_remover = NatashaStopwordsRemover()
# Add new custom stopwords
stopwords_remover.add_words_to_stopwords(["эм", "эй"])
# Add additional swear words
stopwords_remover.add_words_to_swear_words(["тварь", ])
Modules Overview
1. Transcriber
The core of the project, managing transcription, speaker diarization, and post-processing. Key methods:
transcribe(content: bytes, language=None, max_speakers=None): Transcribes audio content and includes speaker annotations.
2. NatashaStopwordsRemover
Text post-processing with Natasha NLP:
remove_stopwords(text: str, remove_swear_words=True, go_few_times=False): Removes stopwords and optionally swear words from transcribed text.remove_words(text: str, words: list[str]): Removes predefined words from text.
Limitations
- Audio Format: Tested on WAV and MP3 formats.
- Speaker Diarization: PyAnnote separates speakers but does not assign "real names" like "John" or "Mary".
- Stopword Customization: Requires russian language input for additional stopwords or swear words.
License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file audio_transcribing-0.2.6.tar.gz.
File metadata
- Download URL: audio_transcribing-0.2.6.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e56e1bfbd9d9fc4a81d62874351158a34c90b815d564c66a5e561ef9dbdb326
|
|
| MD5 |
1c0dcea6f9e6e819adf64083c6fa4c30
|
|
| BLAKE2b-256 |
e4b53d19173d6c7d1d12f47731e2fab230b5a06faa66c599d3f6a3fd023f158a
|