Keras(Tensorflow) implementations of Automatic Speech Recognition
Project description
DeepAsr
DeepAsr is an open-source implementation of end-to-end Automatic Speech Recognition (ASR) engine.
DeepAsr provides multiple Speech Recognition architectures, Currently it provides Baidu's Deep Speech 2 using Keras (Tensorflow).
Using DeepAsr you can:
- perform speech-to-text using pre-trained models
- tune pre-trained models to your needs
- create new models on your own
DeepAsr key features:
- Multi GPU support: You can do much more like distribute the training using the Strategy, or experiment with mixed precision policy.
- CuDNN support: Model using CuDNNLSTM implementation by NVIDIA Developers. CPU devices is also supported.
- DataGenerator: The feature extraction (on CPU) can be parallel to model training (on GPU).
Installation
You can use pip:
pip install deepasr
Getting started
The speech recognition is a tough task. You don't need to know all details to use one of the pretrained models. However it's worth to understand conceptional crucial components:
- Input: WAVE files with mono 16-bit 16 kHz (up to 5 seconds)
- FeaturesExtractor: Convert audio files using MFCC Features or Spectrogram
- Model: CTC model defined in Keras (references: [1], [2])
- Decoder: Greedy algorithm with the language model support decode a sequence of probabilities using Alphabet
- DataGenerator: Stream data to the model via generator
- Callbacks: Set of functions monitoring the training
import numpy as np
import pandas as pd
import tensorflow as tf
import deepasr as asr
def get_config(features, multi_gpu):
alphabet_en = asr.vocab.Alphabet(lang='en')
features_extractor = asr.features.preprocess(feature_type=features, features_num=161,
samplerate=16000,
winlen=0.02,
winstep=0.01,
winfunc=np.hanning)
model = asr.model.get_deepspeech2(
input_dim=161,
output_dim=29,
is_mixed_precision=True
)
optimizer = tf.keras.optimizers.Adam(
lr=1e-4,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-8
)
decoder = asr.decoder.GreedyDecoder()
pipeline = asr.pipeline.ctc_pipeline.CTCPipeline(
alphabet=alphabet_en, features_extractor=features_extractor, model=model, optimizer=optimizer, decoder=decoder,
sample_rate=16000, mono=True, multi_gpu=multi_gpu
)
return pipeline
def run(train_data, test_data, features='fbank', batch_size=32, epochs=10, multi_gpu=True):
pipeline = get_config(features, multi_gpu)
# history = pipeline.fit_iter(train_data, batch_size=batch_size, epochs=epochs, iter_num=1000)
history = pipeline.fit_generator(train_data, batch_size=batch_size, epochs=epochs)
pipeline.save('./checkpoints')
print("Truth:", test_data['transcript'].to_list()[0])
print("Prediction", pipeline.predict(test_data['path'].to_list()[0]))
return history
train = pd.read_csv('train_data.csv')
test = pd.read_csv('test_data.csv')
run(train, test, features='fbank', batch_size=32, epochs=100, multi_gpu=True)
Loaded pre-trained model has all components. The prediction can be invoked just by calling pipline.predict().
import pandas as pd
import deepasr as asr
pipeline = asr.pipeline.load('./checkpoints')
test_data = pd.read_csv('test_data.csv')
print("Truth:", test_data['transcripts'].to_list()[0])
print("Prediction", pipeline.predict(test_data['path'].to_list()[0]))
References
The fundamental repositories:
- Baidu - DeepSpeech2 - A PaddlePaddle implementation of DeepSpeech2 architecture for ASR
- NVIDIA - Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
- TensorFlow - The implementation of DeepSpeech2 model
- Mozilla - DeepSpeech - A TensorFlow implementation of Baidu's DeepSpeech architecture
- Espnet - End-to-End Speech Processing Toolkit
- Automatic Speech Recognition - Distill the Automatic Speech Recognition research
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.