Linguistic processing for languages in Common Voice

These details have not been verified by PyPI

Project links

Homepage

Project description

Common Voice Utils

This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project. It aims to provide a one-stop-shop for utilities and data useful in training ASR and TTS systems.

Tools

Phonemiser:
- A rudimentary grapheme to phoneme (g2p) system based on either:
  - a deterministic longest-match left-to-right replacement of orthographic units; or
  - a weighted finite-state transducer
Validator:
- A validation script that can be used with import_cv2.py from coqui-ai/STT
- It checks a sentence to see if it can be converted and if possible normalises the encoding, removes punctuation and returns it
Alphabet:
- The relevant alphabet of the language, appropriate for use in training ASR
Segmenter:
- A deterministic sentence segmentation algorithm tuned for segmenting paragraphs from Wikipedia

Language support

Language	Autonym	Code	(CV)	(WP)	Phon	Valid	Alphabet	Segment
Abaza	Абаза	`abq`	`ab`	—	✔	✔	✔
Arabic	اَلْعَرَبِيَّةُ	`ara`	`ar`	`ar`	—	✔	✔
Assamese	অসমীয়া	`asm`	`as`	`as`
Basaa	Basaa	`bas`	`bas`	—	✔		✔
Breton	Brezhoneg	`bre`	`br`	`br`	✔	✔	✔	✔
Catalan	Català	`cat`	`ca`	`ca`			✔
Czech	Čeština	`ces`	`cs`	`cs`	✔	✔	✔
Chuvash	Чӑвашла	`chv`	`cv`	`cv`	✔	✔	✔	✔
Hakha Chin	Hakha Lai	`cnh`	`cnh`	—			✔
Welsh	Cymraeg	`cym`	`cy`	`cy`	✔		✔
Dhivehi	ދިވެހި	`div`	`dv`	`dv`	✔
Greek	Ελληνικά	`ell`	`el`	`el`	✔		✔
German	Deutsch	`deu`	`de`	`de`		✔	✔
English	English	`eng`	`en`	`en`	—		✔
Esperanto	Esperanto	`epo`	`eo`	`eo`			✔
Spanish	Español	`spa`	`es`	`es`	✔		✔
Estonian	Eesti	`est`	`et`	`et`	✔		✔
Basque	Euskara	`eus`	`eu`	`eu`	✔	✔	✔
Persian	فارسی	`pes`	`fa`	`fa`	—
Finnish	Suomi	`fin`	`fi`	`fi`	✔	✔	✔
French	Français	`fra`	`fr`	`fr`	—		✔
Frisian	Frysk	`fry`	`fy-NL`	`fy`			✔
Irish	Gaeilge	`gle`	`ga-IE`	`ga`			✔
Hindi	हिन्दी	`hin`	`hi`	`hi`
Upper Sorbian	Hornjoserbšćina	`hsb`	`hsb`	`hsb`			✔
Hungarian	Magyar nyelv	`hun`	`hu`	`hu`	✔		✔
Armenian	Հայերեն	`hye`	`hy-AM`	`hy`	✔		✔
Interlingua	Interlingua	`ina`	`ia`	`ia`	✔		✔
Indonesian	Bahasa indonesia	`ind`	`id`	`id`	✔		✔
Italian	Italiano	`ita`	`it`	`it`	✔		✔
Japanese	日本語	`jpn`	`ja`	`ja`	—		—
Georgian	ქართული ენა	`kat`	`ka`	`ka`	✔		✔
Kabyle	Taqbaylit	`kab`	`kab`	`kab`	✔		✔
Kazakh	Қазақша	`kaz`	`kk`	`kk`	✔		✔
Kyrgyz	Кыргызча	`kir`	`ky`	`ky`	✔		✔
Komi-Zyrian	Коми кыв	`kpv`	`kv`	`kv`	✔		✔
Luganda	Luganda	`lug`	`lg`	`lg`	✔		✔
Lithuanian	Lietuvių kalba	`lit`	`lt`	`lt`	✔		✔
Latvian	Latviešu valoda	`lvs`	`lv`	`lv`	✔		✔
Mongolian	Монгол хэл	`khk`	`mn`	`mn`	✔		✔
Maltese	Malti	`mlt`	`mt`	`mt`	✔		✔
Dutch	Nederlands	`nld`	`nl`	`nl`	✔		✔
Oriya	ଓଡ଼ିଆ	`ori`	`or`	`or`
Punjabi	ਪੰਜਾਬੀ	`pan`	`pa-IN`	`pa`
Polish	Polski	`pol`	`pl`	`pl`	✔		✔
Portuguese	Português	`por`	`pt`	`pt`			✔
Kʼicheʼ	Kʼicheʼ	`quc`	`quc`	—	✔	✔	✔
Romansch (Sursilvan)	Romontsch	`roh`	`rm-sursilv`	`rm`			✔
Romansch (Vallader)	Rumantsch	`roh`	`rm-vallader`	`rm`			✔
Romanian	Românește	`ron`	`ro`	`ro`	✔		✔
Russian	Русский	`rus`	`ru`	`ru`			✔
Kinyarwanda	Kinyarwanda	`kin`	`rw`	`rw`	✔		✔
Sakha	Саха тыла	`sah`	`sah`	`sah`	✔		✔
Slovenian	Slovenščina	`slv`	`sl`	`sl`	✔		✔
Swedish	Svenska	`swe`	`sv-SE`	`sv`	✔		✔
Tamil	தமிழ்	`tam`	`ta`	`ta`	✔		✔
Thai	ภาษาไทย	`tha`	`th`	`th`	✔		✔
Turkish	Türkçe	`tur`	`tr`	`tr`	✔		✔
Tatar	Татар теле	`tat`	`tt`	`tt`	✔		✔
Ukrainian	Українська мова	`ukr`	`uk`	`uk`	✔		✔
Vietnamese	Tiếng Việt	`vie`	`vi`	`vi`	✔		✔
Votic	Vaďďa tšeeli	`vot`	`vot`	—			✔
Chinese (China)	中文	`cmn`	`zh-CN`	`zh`	—		—
Chinese (Hong Kong)	中文	`cmn`	`zh-HK`	`zh`	—		—
Chinese (Taiwan)	中文	`cmn`	`zh-TW`	`zh`	—		—

How to use it

Alphabet

>>> from cvutils import Alphabet
>>> a = Alphabet('cv')
>>> a.get_alphabet()
' -абвгдежзийклмнопрстуфхцчшщыэюяёҫӑӗӳ'

Grapheme to phoneme

>>> from cvutils import Phonemiser
>>> p = Phonemiser('ab')
>>> p.phonemise('гӏапынхъамыз')
'ʕapənqaməz'

>>> p = Phonemiser('br')
>>> p.phonemise("implijout")
'impliʒut'

Validator

>>> from cvutils import Validator
>>> v = Validator('ab')
>>> v.validate('Аллаҳ хаҵеи-ԥҳәыси иеилыхны, аҭыԥҳацәа роума иалихыз?')
'аллаҳ хаҵеи-ԥҳәыси иеилыхны аҭыԥҳацәа роума иалихыз'

>>> v = Validator('br')
>>> v.validate('Ha cʼhoant hocʼh eus da gendercʼhel da implijout ar servijer-mañ ?')
"ha c'hoant hoc'h eus da genderc'hel da implijout ar servijer-mañ"

Sentence segmentation

>>> from cvutils import Segmenter 
>>> s = Segmenter('br')
>>> para = "Peurliesañ avat e kemm ar vogalennoù e c'hengerioù evit dont da vezañ heñvel ouzh ar vogalennoù en nominativ (d.l.e. ar stumm-meneg), da skouer e hungareg: Aour, tungsten, zink, uraniom, h.a., a vez kavet e kondon Bouryatia. A-bouez-bras evit armerzh ar vro eo al labour-douar ivez pa vez gounezet gwinizh ha legumaj dreist-holl. A-hend-all e vez gounezet arc'hant dre chaseal ha pesketa."
>>> for sent in s.segment(para):
...     print(sent)
... 
Peurliesañ avat e kemm ar vogalennoù e c'hengerioù evit dont da vezañ heñvel ouzh ar vogalennoù en nominativ (d.l.e. ar stumm-meneg), da skouer e hungareg: Aour, tungsten, zink, uraniom, h.a., a vez kavet e kondon Bouryatia.
A-bouez-bras evit armerzh ar vro eo al labour-douar ivez pa vez gounezet gwinizh ha legumaj dreist-holl.
A-hend-all e vez gounezet arc'hant dre chaseal ha pesketa.

Frequently asked questions

Why not use [insert better system] for [insert task here] ?

There are potentially a lot of better language-specific systems for doing these tasks, but each one has a slightly different API, so if you want to support all the Common Voice languages or even a reasonable subset you have to learn and use the same number of language-specific APIs.

The idea of these utilities is to provide adequate implementations of things are are likely to be useful when working with all the languages in Common Voice. If you are working on a single language or have a specific setup or are using more data than just Common Voice, maybe this isn't for you. But if you want to just train coqui-ai/STT on Common Voice, then maybe it is :)

Why not just make the alphabet from the transcripts ?

Depending on the language in Common Voice, the transcripts can contain a lot of random punctuation, numerals, and incorrect character encodings (for example Latin ç instead of Cyrillic ҫ for Chuvash). These may look the same but will result in bigger sparsity for the model. Additionally some languages may have several encodings of the same character, such as the apostrophe. These will ideally be normalised before training.

Also, if you are working with a single language you probably have time to look through all the transcripts for the alphabetic symbols, but if you want to work with a large number of Common Voice languages at the same time it's useful to have them all in one place.

Acknowledgements

Grapheme to phoneme correspondences for the following languages from epitran:
- vi, uk, kk, ky, ta
Code for transducer lookup from Måns Huldén.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.30

Dec 8, 2022

0.2.29

Nov 29, 2022

0.2.28

Nov 29, 2022

0.2.27

May 27, 2022

0.2.26

May 9, 2022

0.2.25

Apr 15, 2022

0.2.24

Apr 12, 2022

0.2.23

Apr 11, 2022

0.2.22

Apr 10, 2022

0.2.21

Feb 2, 2022

0.2.18

Feb 1, 2022

0.2.17

Feb 1, 2022

0.2.16

Jan 27, 2022

0.2.15

Jan 27, 2022

0.2.14

Dec 27, 2021

0.2.13

Dec 9, 2021

0.2.12

Sep 25, 2021

0.2.11

Sep 24, 2021

0.2.10

Sep 24, 2021

0.2.9

Sep 8, 2021

0.2.8

Sep 2, 2021

0.2.7

Jun 24, 2021

0.2.6

Jun 24, 2021

0.2.5

Jun 24, 2021

0.2.4

May 7, 2021

0.2.3

May 5, 2021

0.2.2

May 5, 2021

0.2.1

May 5, 2021

0.2.0

May 5, 2021

0.1.9

Apr 10, 2021

0.1.8

Apr 3, 2021

0.1.7

Apr 2, 2021

This version

0.1.6

Apr 2, 2021

0.1.5

Apr 2, 2021

0.1.4

Apr 2, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

commonvoice-utils-0.1.6.tar.gz (33.4 kB view details)

Uploaded Apr 2, 2021 Source

Built Distribution

commonvoice_utils-0.1.6-py3-none-any.whl (58.9 kB view details)

Uploaded Apr 2, 2021 Python 3

File details

Details for the file commonvoice-utils-0.1.6.tar.gz.

File metadata

Download URL: commonvoice-utils-0.1.6.tar.gz
Upload date: Apr 2, 2021
Size: 33.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.4.2 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.7

File hashes

Hashes for commonvoice-utils-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`d42d655e27d4aaeb1f7c25456ba3798cd19f067e1a68c482291bba6efb1de71f`
MD5	`1c29dc9217e22296bf7e84418917be5c`
BLAKE2b-256	`3c77ca852f4b2044ef2c305414fbc3980fcfa775aa18e5aa2887c9c067bda834`

See more details on using hashes here.

File details

Details for the file commonvoice_utils-0.1.6-py3-none-any.whl.

File metadata

Download URL: commonvoice_utils-0.1.6-py3-none-any.whl
Upload date: Apr 2, 2021
Size: 58.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.4.2 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.7

File hashes

Hashes for commonvoice_utils-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2c72a285af8f8bb9b26992c995acc7e8e02f2329d88859e29eab261fe463c297`
MD5	`3a494ea312b994436be3a33e2e76dd5a`
BLAKE2b-256	`8aedd24b7bcdabe9fa91f91581b56b870ce746d400f801cbaf86e32df2e65250`

See more details on using hashes here.

commonvoice-utils 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Common Voice Utils

Tools

Language support

How to use it

Alphabet

Grapheme to phoneme

Validator

Sentence segmentation

Frequently asked questions

Why not use [insert better system] for [insert task here] ?

Why not just make the alphabet from the transcripts ?

See also

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes