A study to benchmark whisper based ASRs in Malayalam

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

malayalam_asr_benchmarking

The work is still in progress. I have now done some benchmarking for Common Voice 11 Malayalam dataset. The benchmarking results has been uploaded to hugging face as a dataset. At the moment I am working on benchmarking Malayalam Speech Corpus dataset as well. The benchmarking results once completed will be uploaded to huggingface datasets in the same manner.

Install

pip install malayalam_asr_benchmarking

Or locally

pip install -e .

Setting up your development environment

I am developing this project with nbdev. Please take some time reading up on nbdev … how it works, directives, etc… by checking out the walk-thrus and tutorials on the nbdev website

Step 1: Install Quarto:

nbdev_install_quarto

Other options are mentioned in getting started to quarto

Step 2: Install hooks

nbdev_install_hooks

Step 3: Install our library

pip install -e '.[dev]'

How to use

from malayalam_asr_benchmarking.commonvoice import evaluate_whisper_model_common_voice

evaluate_whisper_model_common_voice("parambharat/whisper-tiny-ml")

Found cached dataset common_voice_11_0 (/home/.cache/huggingface/datasets/mozilla-foundation___common_voice_11_0/ml/11.0.0/2c65b95d99ca879b1b1074ea197b65e0497848fd697fdb0582e0f6b75b6f4da0)
Loading cached processed dataset at /home/.cache/huggingface/datasets/mozilla-foundation___common_voice_11_0/ml/11.0.0/2c65b95d99ca879b1b1074ea197b65e0497848fd697fdb0582e0f6b75b6f4da0/cache-374585c2877047e3.arrow
Loading cached processed dataset at /home/.cache/huggingface/datasets/mozilla-foundation___common_voice_11_0/ml/11.0.0/2c65b95d99ca879b1b1074ea197b65e0497848fd697fdb0582e0f6b75b6f4da0/cache-22670505c562e0d4.arrow
/opt/conda/lib/python3.8/site-packages/transformers/generation_utils.py:1359: UserWarning: Neither `max_length` nor `max_new_tokens` has been set, `max_length` will default to 448 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(

Total time taken: 133.23447608947754
The WER of model: 38.31
The CER of model: 21.93
The model size is: 37.76M

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.0.4

Feb 10, 2024

0.0.3

Feb 8, 2024

This version

0.0.2

Mar 21, 2023

0.0.1

Mar 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

malayalam_asr_benchmarking-0.0.2.tar.gz (7.3 kB view details)

Uploaded Mar 21, 2023 Source

Built Distribution

malayalam_asr_benchmarking-0.0.2-py3-none-any.whl (8.1 kB view details)

Uploaded Mar 21, 2023 Python 3

File details

Details for the file malayalam_asr_benchmarking-0.0.2.tar.gz.

File metadata

Download URL: malayalam_asr_benchmarking-0.0.2.tar.gz
Upload date: Mar 21, 2023
Size: 7.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for malayalam_asr_benchmarking-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`109106e4e58dfc0f312548758f7b8b3c2658fd9238245d2c187b2935f9c91d47`
MD5	`f9dd2ce683bdc2c962417ba8adf6ccfb`
BLAKE2b-256	`8282e453c255028ca97fc55b00e4abd861958de7f8a8bf8422ea04091627581a`

See more details on using hashes here.

File details

Details for the file malayalam_asr_benchmarking-0.0.2-py3-none-any.whl.

File metadata

Download URL: malayalam_asr_benchmarking-0.0.2-py3-none-any.whl
Upload date: Mar 21, 2023
Size: 8.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for malayalam_asr_benchmarking-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`94bc218f4d480824483819eded03f7ca07605c7d38daad8a7277477e6d4d9941`
MD5	`fab6ca8ff9ab8963ddc5278b20fc33b2`
BLAKE2b-256	`88d31d0f0a86a7f0fcf0c0b74fa167728bb8345760ac40320d3ddc0ad4e88dc0`