Skip to main content

Distill the Automatic Speech Recognition (TensorFlow)

Project description

Automatic Speech Recognition

The project aim is to distill the Automatic Speech Recognition research. At the beginning, you can load a ready-to-use pipeline with a pre-trained model. Benefit from the eager TensorFlow 2.0 and freely monitor model weights, activations or gradients.

import automatic_speech_recognition as asr

file = 'to/test/sample.wav'  # sample rate 16 kHz, and 16 bit depth
sample = asr.utils.read_audio(file)
pipeline = asr.load('deepspeech2', lang='en')
pipeline.model.summary()     # TensorFlow model
sentences = pipeline.predict([sample])

We support english (thanks to Open Seq2Seq). The evaluation results of the English benchmark LibriSpeech dev-clean are in the table. To reference, the DeepSpeech (Mozilla) achieves around 7.5% WER, whereas the state-of-the-art (RWTH Aachen University) equals 2.3% WER (recent evaluation results can be found here). Both of them, use the external language model to boost results.

Model Name Decoder WER-dev
deepspeech2 greedy 6.71

Shortly it turns out that you need to adjust pipeline a little bit. Take a look at the CTC Pipeline. The pipeline is responsible for connecting a neural network model with all non-differential transformations (features extraction or prediction decoding). Pipeline components are independent. You can adjust them to your needs e.g. use more sophisticated feature extraction, different data augmentation, or add the language model decoder (static n-grams or huge transformers). You can do much more like distribute the training using the Strategy, or experiment with mixed precision policy.


import numpy as np
import automatic_speech_recognition as asr

dataset = asr.dataset.Audio.from_csv('train.csv', batch_size=32)
dev_dataset = asr.dataset.Audio.from_csv('dev.csv', batch_size=32)
alphabet = asr.text.Alphabet(lang='en')
features_extractor = asr.features.FilterBanks(
    features_num=160,
    winlen=0.02,
    winstep=0.01,
    winfunc=np.hanning
)
model = asr.model.get_deepspeech2(
    input_dim=160,
    output_dim=29,
    rnn_units=800,
    is_mixed_precision=True
)
optimizer = asr.optimizer.Adam(
    lr=1e-4,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-8
)
decoder = asr.decoder.GreedyDecoder()
pipeline = asr.pipeline.CTCPipeline(
    alphabet, features_extractor, model, optimizer, decoder
)
pipeline.fit(dataset, dev_dataset, epochs=25)
pipeline.save('/checkpoint')

test_dataset = asr.dataset.Audio.from_csv('test.csv')
wer, cer = asr.evaluate.calculate_error_rates(pipeline, test_dataset)
print(f'WER: {wer}   CER: {cer}')

Installation

You can use pip:

pip install automatic-speech-recognition

Otherwise clone the code and create a new environment via conda:

git clone https://github.com/rolczynski/Automatic-Speech-Recognition.git
conda env create -f=environment.yml     # or use: environment-gpu.yml
conda activate Automatic-Speech-Recognition

References

The fundamental repositories:

Moreover, you can explore the GitHub using key phrases like ASR, DeepSpeech, or Speech-To-Text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for automatic-speech-recognition, version 1.0.2
Filename, size File type Python version Upload date Hashes
Filename, size automatic_speech_recognition-1.0.2-py3-none-any.whl (40.3 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size automatic-speech-recognition-1.0.2.tar.gz (19.0 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page