Skip to main content

A toolkit for audio transcription, speaker diarization, and text processing

Project description

Audio Transcription Toolkit

A toolkit for audio processing, including transcription, speaker diarization, and stopword removal. This project is designed to deliver a seamless pipeline for processing audio data, identifying speakers, and generating clean text transcriptions using Whisper, PyAnnote, and Natasha.

Key Features

  • Transcription: Converts audio to text using Whisper and FasterWhisper (for faster processing).
  • Speaker Diarization: Separates and identifies individual speakers using PyAnnote.
  • Text Post-Processing:
    • Remove stopwords and swear words using Natasha.
    • Customize stopword behaviors by adding your own rules.

Installation

Make sure you have Python 3.8+ installed on your machine. To install this package, run:

pip install audio_transcribing

Other requirements

If you’re using GPU for better performance, ensure torch is installed with GPU support. You can use:

pip install torch --index-url https://download.pytorch.org/whl/cu117

Quick Start

Transcription of Audio with Speaker Diarization and Cleaned Output

from audio_transcriber.audio_transcribing import Transcriber, NatashaStopwordsRemover

# Initialize the transcriber
transcriber = Transcriber(
  token="your-huggingface-token",  # Token for PyAnnote diarization
  whisper_model="medium",  # Size of Whisper model
  use_faster_whisper=True  # Use Faster Whisper if performance is a priority
)

# Load the audio file
with open("your_file.mp3", "rb") as f:
  audio_content = f.read()

# Transcribe audio
result = transcriber.transcribe(audio_content, language="ru")

print("Transcription with speaker diarization:")
print(result)

# Post-process text (remove stopwords and optional swear words)
cleaned_result = NatashaStopwordsRemover.remove_stopwords(result, remove_swear_words=True)

print("\nCleaned transcription:")
print(cleaned_result)

Add Stopwords or Swear Words to Natasha Processor

Natasha can be used only with russian language.

from audio_transcriber.audio_transcribing import NatashaStopwordsRemover

# Initialize Natasha processor
stopwords_remover = NatashaStopwordsRemover()

# Add new custom stopwords
stopwords_remover.add_words_to_stopwords(["эм", "эй"])

# Add additional swear words
stopwords_remover.add_words_to_swear_words(["тварь", ])

Modules Overview

1. Transcriber

The core of the project, managing transcription, speaker diarization, and post-processing. Key methods:

  • transcribe(content: bytes, language=None, max_speakers=None): Transcribes audio content and includes speaker annotations.

2. NatashaStopwordsRemover

Text post-processing with Natasha NLP:

  • remove_stopwords(text: str, remove_swear_words=True, go_few_times=False): Removes stopwords and optionally swear words from transcribed text.
  • remove_words(text: str, words: list[str]): Removes predefined words from text.

Limitations

  • Audio Format: Tested on WAV and MP3 formats.
  • Speaker Diarization: PyAnnote separates speakers but does not assign "real names" like "John" or "Mary".
  • Stopword Customization: Requires russian language input for additional stopwords or swear words.

License

This project is licensed under the MIT License.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audio_transcribing-0.2.6.tar.gz (15.6 kB view details)

Uploaded Source

File details

Details for the file audio_transcribing-0.2.6.tar.gz.

File metadata

  • Download URL: audio_transcribing-0.2.6.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for audio_transcribing-0.2.6.tar.gz
Algorithm Hash digest
SHA256 2e56e1bfbd9d9fc4a81d62874351158a34c90b815d564c66a5e561ef9dbdb326
MD5 1c0dcea6f9e6e819adf64083c6fa4c30
BLAKE2b-256 e4b53d19173d6c7d1d12f47731e2fab230b5a06faa66c599d3f6a3fd023f158a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page