Skip to main content

A package of speech machine learning pipeline to automatically get transcriptions with speaker labels from audio inputs

Project description

DOI

SpeechMLPipeline

SpeechMLPipeline is a Python package for users to run the complete speech machine learning pipeline via one simple function (Audio-to-Text Transcription, Speaker Change Detection, and Speaker Identification) to get transcriptions with speaker labels from input audio files. SpeechMLPipeline applys and implements the most widely used and the innovative machine learning models at each step of the pipeline:

The OpenAI Whisper is selected for the Transcription as it is the most accurate model available for English transcription. The OpenAI Whisper with timestamp adjustment is used to reduce the misalignment between the timestamps and the transcription texts by identifying the silence parts and predicting timestamps at the word level.

The PyAnnotate models is by far one of the most popular models for speaker diarization. The speaker change detection results are directly inferred from speaker diarization results.

The Audio-based Spectral Clustering Model is developed by extracting audio features from Librosa and applying spectral clustering to audio features. This model is one of the most common speaker change detection models used in academic research.

The Text-based Llama2-70b Speaker Change Detection Model is developed by asking Llama2 if the speaker changes across two consecutive text segments by understanding the interrelationships between these two texts via their semantic meaning.

The Rule-based NLP Speaker Change Detection Model is applied to detect speaker change by analyzing text using well-defined rules developed by human comprehension.

The Ensemble Audio-and-text-based Speaker Change Detection Model is built by ensembling audio-based or text-based speaker change detection models. The voting methods are used to aggregate the predictions of the speaker change detection models above except for Rule-based NLP model. The aggregated predictions are then corrected by Rule-based NLP model.

The Speechbrain models are used to perform the speaker identification by comparing the similarities between the vector embeddings of each input audio segment and labelled speakers audio segments.

Create New Python Environment to Avoid Packages Versions Conflict If Needed

python -m venv <envname>
source <envname>/bin/activate

Install speechmlpipeline and its dependencies via Github

git lfs install
git clone https://github.com/princeton-ddss/SpeechMLPipeline
cd <.../SpeechMLPipeline>
pip install -r requirements.txt
pip install .

Install speechmlpipeline via Pypi

pip install speechmlpipeline

Download Models Offline to Run Them without Internet Connection

Download PyAnnotate Models using Git Large File Storage (LFS)

PyAnnotate models are already in the models folder of the current repo.

To download PyAnnotate models, please git clone the repo first.

git lfs install
git clone https://github.com/princeton-ddss/SpeechMLPipeline

To use the PyAnnotate models, please replace <local_path> with the local parent folder of the downloaded AudioAndTextBasedSpeakerChangeDetection repo in models/pyannote3.1/Diarization/config.yaml and models/pyannote3.1/Segmentation/config.yaml.

Download Spacy, Llama2, and Speechbrain Models by using the Download Module in the Repo

<hf_access_token> is the access token to Hugging Face. Please create a Hugging Face account if it does not exist.
The new access token could be created by following the instructions.

<models_list> is the list of names of models to be downloaded. Usually, the value of models_list should be set as ['whisper', 'speechbrain', 'llama2-70b'].

<download_model_path> is the local path where all the downloaded models would be saved.

from speechmlpipeline.DownloadModels.download_models_main_function import download_models_main_function

download_models_main_function(<download_model_path>, <models_list>, <hf_access_token>)

Usage

The complete pipeline could be ran by using run_speech_ml_pipeline function.

Please view the function and its inputs description inside the Python file src/speechmlpipeline/main_pipeline_local_function.py.

Please view the sample codes to run the function in sample_run.py and sample_run_existingllama2output.py in the src/speechmlpipeline.

from main_pipeline_local_function import TranscriptionInputs, SpeakerChangeDetectionInputs, 
    EnsembleDetectionInputs, SpeakerIdentificationInputs, run_speech_ml_pipeline

# Run Whole Pipeline except for Downloading Models
run_speech_ml_pipeline(transcription = <transcription_inputs>,
                       speakerchangedetection=<detection_inputs>, ensembledetection=<ensemble_detection_inputs>,
                       speakeridentification=<speaker_identification_inputs>)

License

License: MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

speechmlpipeline-0.0.1-py3-none-any.whl (26.0 kB view details)

Uploaded Python 3

File details

Details for the file speechmlpipeline-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for speechmlpipeline-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5ddf8be195ec9e7a8071f01b81218913eb0aee8d5e7ad9c84bbda834d1e7da8c
MD5 e635e4fc809e95dc2a8280d13480004a
BLAKE2b-256 693a07b13bffa72a53b361af6d0035e5c99fe7a017f7852f23dda5461577a648

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page