Skip to main content

Catalan Punctuation and Capitalization Restoration Model

Project description

This Repo Contains Implementation and explanation of Punctuation and Capitalization System for ASR models

patrick-tomasso-Oaqk7qqNh_c-unsplash

Introduction

Almost all automatic speech recognition(ASR) systems convert speech into text that has no capitalization or punctuation, which can result in miss understanding the generated text. In this blog I explain and implement capitalization or punctuation model with Roberta language model for Catalan language. This tutorial is mainly based on Nvidia Nemo tutorial on capitalization or punctuation model here.

Language Model Based Capitalization and Punctuation model

  • This model predicts if a sentence needs commas, periods, question marks, ...
  • Also model predicts if a given word should be Capitelized.

As in here this model (this method) is a jointly training two token-level classifier on top of a pretrained language model.

Data Format

The Punctuation and Capitalization model expects the data in the following format:

The training and evaluation data is divided into 2 files: text.txt , labels.txt

Each line of the text.txt file contains text sequences, where words are separated with spaces.

[WORD] [SPACE] [WORD] [SPACE] [WORD], for example:

when is the next flight to new york
the next flight is ...

The labels.txt file contains corresponding labels for each word in text.txt, the labels are separated with spaces. Each label in labels.txt file consists of 2 symbols:

the first symbol of the label indicates what punctuation mark should follow the word (where O means no punctuation needed)

the second symbol determines if a word needs to be capitalized or not (where U indicates that the word should be upper cased, and O - no capitalization needed)

By default, the following punctuation marks are considered: commas, periods, and question marks; the remaining punctuation marks were removed from the data. This can be changed by introducing new labels in the labels.txt files.

Each line of the labels.txt should follow the format: [LABEL] [SPACE] [LABEL] [SPACE] [LABEL] (for labels.txt). For example, labels for the above text.txt file should be:

OU OO OO OO OO OO OU ?U
OU OO OO OO ...

Catalan Punctuation and Capitalization Data

For this tutorial I used this repo and mereged common-voice-sentences.txt, dogc.txt, dogv.txt, riuraueditors.txt, softcatala.txt, wiki.ca.txt, wiki.ca-mozilla_script.txt files.
Using the following script you can convert any correctly capitalized and punctuated text into mentioned training data format.

import string
import random


data_into_list=[line.strip() for line in open('/content/output_file.txt')]


text_train = open("text_train.txt", "a")  # append modea
labels_train = open("labels_train.txt", "a")  # append modea

text_dev = open("text_dev.txt", "a")  # append modea
labels_dev = open("labels_dev.txt", "a")  # append modea


for j in data_into_list:
 if len(j)< 100:
   label = ""
   text = ""
   for i in j.split(" "):
     try:
         if i[-1] in string.punctuation and i[0].isupper():
           label = label + f"{i[-1]}U "
           text = text + f"{i[:-1].lower()} "

         elif (i[-1] not in string.punctuation and i[0].isupper()):
           label+="OU " 
           text = text + f"{i.lower()} " 

         elif (i[-1] in string.punctuation and i[0].islower()):  
           label+=f"{i[-1]}O "
           text = text + f"{i[:-1].lower()} "

         elif (i[-1] not in string.punctuation and i[0].islower()):  
           label+="OO "
           text = text + f"{i.lower()} "
     except:
       pass
   if len(text.split())== len(label.split()) and len(text)>0:
       if random.random() < .15:       
           text_dev.write(text+"\n")
           labels_dev.write(label+"\n")  
       else:
           text_train.write(text+"\n")
           labels_train.write(label+"\n")   
   else:
     pass                    
text_dev.close()
labels_dev.close()
text_train.close()
labels_train.close()

Once you make the training and validation data ready, then it is time to train your model.


-------------------------------------------------------------------------------------------- ## Model

For this tutorial I used about 60000 sample sentences and trained them on top of roberta-base-ca. Complete notebook for data gathering as well as training the Punctuation and Capitalization model for catalan language can be found here Open In Colab


Also pretrained model for inference can be found here Open In Colab

Install with pip

import os


os.system("pip install nemo_toolkit['all']")

os.system('git clone https://github.com/NVIDIA/apex')
os.system('cd apex')

os.system('pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" ./')

pip install pun8==0.0.1

from pun.main import setpath, init_model, correct

setpath("/content/Punctuation_and_Capitalization.nemo")

init_model()

correct(["si acabo d'hora aniré a mirar roba"])


Some examples from the model

Original Text witt Capitalization & Puntuation:

Si acabo d'hora, aniré a mirar roba.
Necessitem vacances.
A partir d'aquí?
Acabat el debat, procedirem a la votació.
Ah, Déu meu!
Bona tarda, diputats, diputades.
A Barcelona i a Cubells, deu mules són cinc parells.
A beure i a menjar, mesura has de posar.

And Model Output:

---------------------------------------------------------------------------------------
Query : si acabo d'hora aniré a mirar roba
Combined: Si acabo d'hora, aniré a mirar roba.
---------------------------------------------------------------------------------------
Query : necessitem vacances
Combined: Necessitem vacances.
---------------------------------------------------------------------------------------
Query : a partir d'aquí
Combined: A partir d'aquí.
---------------------------------------------------------------------------------------
Query : acabat el debat procedirem a la votació
Combined: Acabat el debat, procedirem a la votació.
---------------------------------------------------------------------------------------
Query : ah déu meu
Combined: Ah, Déu meu.
---------------------------------------------------------------------------------------
Query : bona tarda diputats diputades
Combined: Bona tarda Diputats diputades.
---------------------------------------------------------------------------------------
Query : a barcelona i a cubells deu mules són cinc parells
Combined: A Barcelona i a Cubells, deu mules són cinc parells.
---------------------------------------------------------------------------------------
Query : a beure i a menjar mesura has de posar
Combined: A beure i a menjar mesura, has de posar.
---------------------------------------------------------------------------------------

Due to the low frequency of question and exclamation mark, as it can be seen from the results, they are not accurate as commas and periods, this problem can be easily addressed by increasing their frequency.


Here are some statistics for punctuation and capitalization model for catalan language

Screen Shot 2022-04-28 at 6 56 52 PM

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

CaCorrection-0.0.1.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

CaCorrection-0.0.1-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file CaCorrection-0.0.1.tar.gz.

File metadata

  • Download URL: CaCorrection-0.0.1.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.7.13

File hashes

Hashes for CaCorrection-0.0.1.tar.gz
Algorithm Hash digest
SHA256 4949e0bccdfe1a092ec8d231634325550f03bf971198237e71f39e06bf9a4d5b
MD5 818fed9385fafc5b4cff428f398ab0d6
BLAKE2b-256 9c216ef0bb47b69eacbbc2c0fa0ca37b9d7321ab31e2802bd0e3b7d275db99d1

See more details on using hashes here.

File details

Details for the file CaCorrection-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for CaCorrection-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 25a9e76be9e6c3bc17b097efa17975b8f2467728db41c510a02d556dc461dc1b
MD5 8163ae42d3e9777a3614092b9029f971
BLAKE2b-256 5943cf1cdf706edec1c970cf4d4eb1986da53b18df10a7a65733c11c042170fc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page