Skip to main content

A study to benchmark whisper based ASRs in Malayalam

Project description

malayalam_asr_benchmarking

Objective of the project

Note

A study to benchmark ASRs in Malayalam. Till now the project has benchmark based on Malayalam ASR models based in Whisper.

Benchmarked Datasets

Till now we have mainly benchmarked on two datasets:

  1. Common Voice 11 Dataset

I have now done benchmarking on Mozilla’s Common Voice 11 Malayalam subset. The benchmarking results can be found in the below dataset.

  1. Malayalam Speech Corpus

I have now benchmarked on SMC’s Malayalam Speech corpus dataset. The benchmarking results can be found in the below dataset.

Install

pip install malayalam_asr_benchmarking

or from github repository

# Ensure git is installed, else install it. Eg: In ubuntu via apt install git
pip install git+https://github.com/kurianbenoy/malayalam_asr_benchmarking.git

Or locally

# Ensure git is installed, else install it. Eg: In ubuntu via apt install git
git clone https://github.com/kurianbenoy/malayalam_asr_benchmarking.git
cd malayalam_asr_benchmarking
pip install -e .

Setting up your development environment

I am developing this project with nbdev. Please take some time reading up on nbdev … how it works, directives, etc… by checking out the walk-thrus and tutorials on the nbdev website

Step 1: Install Quarto:

nbdev_install_quarto

Other options are mentioned in getting started to quarto

Step 2: Install hooks

nbdev_install_hooks

Step 3: Install our library

pip install -e '.[dev]'

How to use

from malayalam_asr_benchmarking.commonvoice import evaluate_whisper_model_common_voice

werlist = []
cerlist = []
modelsizelist = []
timelist = []

evaluate_whisper_model_common_voice("parambharat/whisper-tiny-ml", werlist, cerlist, modelsizelist, timelist)
Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.03k/1.03k [00:00<00:00, 6.09MB/s]
Downloading pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 151M/151M [00:24<00:00, 6.07MB/s]
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 827/827 [00:00<00:00, 2.64MB/s]
Downloading (…)olve/main/vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 1.14MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 494k/494k [00:00<00:00, 2.65MB/s]
Downloading (…)main/normalizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52.7k/52.7k [00:00<00:00, 252kB/s]
Downloading (…)in/added_tokens.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.11k/2.11k [00:00<00:00, 8.53MB/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.06k/2.06k [00:00<00:00, 5.10MB/s]
Downloading (…)rocessor_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 185k/185k [00:02<00:00, 76.2kB/s]

AssertionError: Torch not compiled with CUDA enabled
from malayalam_asr_benchmarking.commonvoice import evaluate_faster_whisper_model_common_voice

werlist = []
cerlist = []
modelsizelist = []
timelist = []

evaluate_faster_whisper_model_common_voice("parambharat/whisper-tiny-ml", werlist, cerlist, modelsizelist, timelist)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

File details

Details for the file malayalam_asr_benchmarking-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for malayalam_asr_benchmarking-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 42127df88e40a2caefdc836cddf3ef216cbd142d42afb762f63db411bd0979f1
MD5 9e32fd7871c3761ab6c78dec534aebbb
BLAKE2b-256 d24f940c98fe4256be138126c05435791d8d1bfd165f6acbbf4d9d2b354d36d8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page