Skip to main content

Trinidad English Creole to Standard English

Project description

License PyPI Transformer Pandas AI happytransformers Python 3+ NLP T5-KES T5-TTParser

Caribe

This python library takes Trinidadian English Creole and converts it to Standard English. Future updates would include the conversion of other Caribbean English Creole languages to Standard English and additional natural language processing methods.


Installation

Use the below command to install package/library

pip install Caribe 


Usage

Sample 1: Checks the english creole input against existing known creole phrases before decoding the sentence into a more standardized version of English language. A corrector is used to check and fix small grammatical errors.

# Sample 1
import Caribe as cb


sentence = "Ah wah mi modda phone"
standard = cb.phrase_decode(sentence)
standard = cb.trinidad_decode(standard)
fixed = cb.caribe_corrector(standard)
print(fixed) #Output: I want my mother phone

Sample 2: Checks the trinidad english creole input against existing known phrases

# Sample 2 
import Caribe as cb


sentence = "Waz de scene"
standard = cb.phrase_decode(sentence)

print(standard) # Outputs: How are you

Sample 3: Checks the sentence for any grammatical errors or incomplete words and corrects it.

#Sample 3
import Caribe as cb


sentence = "I am playin fotball outsde"
standard = cb.caribe_corrector(sentence)

print(standard) # Outputs: I am playing football outside

Sample 4: Makes parts of speech tagging on creole words.

#Sample 4
import Caribe as cb
from Caribe import trinidad_decode, trinidad_decode_split, caribe_corrector

sentence = "wat iz de time there"
analyse = cb.nlp()
output = analyse.caribe_pos(sentence)

print(output) # Outputs: ["('wat', 'PRON')", "('iz', 'VERB')", "('de', 'DET')", "('time', 'NOUN')", "('there', 'ADV')"]

Sample 5: Remove punctuation marks.

#Sample 5
import Caribe as cb
from Caribe import trinidad_decode, trinidad_decode_split, caribe_corrector

sentence = "My aunt, Shelly is a lawyer!"
analyse = cb.remove_signs(sentence)


print(analyse) # Outputs: My aunt Shelly is a lawyer

Sample 6: Sentence Correction using T5-KES.

#Sample 6 Using t5_kes_corrector
import Caribe as cb


sentence = "Wat you doin for d the christmas"
correction = cb.t5_kes_corrector(sentence)


print(correction) # Output: What are you doing for christmas?

Sample 7: Sentence Correction using Decoder and T5-KES.

#Sample 7 Using t5_kes_corrector and decoder
import Caribe as cb


sentence = "Ah want ah phone for d christmas"
decoded= cb.trinidad_decode(sentence)
correction = cb.t5_kes_corrector(decoded)


print(correction) # Output: I want a phone for christmas.

  • Additional Information

    • trinidad_decode() : Decodes the sentence as a whole string.
    • trinidad_decode_split(): Decodes the sentence word by word.
    • phrase_decode(): Decodes the sentence against known creole phrases.
    • caribe_corrector(): Corrects grammatical errors in a sentence using gingerit.
    • t5_kes_corrector(): Corrects grammatical errors in a sentence using a trained NLP model.
    • trinidad_encode(): Encodes a sentence to Trinidadian English Creole.
    • caribe_pos(): Generates parts of speech tagging on creole words.
    • pos_report(): Generates parts of speech tagging on english words.
    • remove_signs(): Takes any sentence and remove punctuation marks.

  • File Encodings on NLP datasets

Caribe introduces file encoding (Beta) in version 0.1.0. This allows a dataset of any supported filetype to be translated into Trinidad English Creole. The file encoding feature only supports txt, json or csv files only.

  • Usage of File Encodings:

import Caribe as cb

convert = cb.file_encode("test.txt", "text")
# Generates a translated text file
convert = cb.file_encode("test.json", "json")
# Generates a translated json file
convert = cb.file_encode("test.csv", "csv")
# Generates a translated csv file

  • First Parser for the Trinidad English Creole Language

This model utilises T5-base pre-trained model. It was fine tuned using a combination of a custom dataset and creolised JFLEG dataset. JFLEG dataset was translated using the file encoding feature of the library.

Within the Creole continuum, there exists different levels of lects. These include:

  • Acrolect: The version of a language closest to standard international english.
  • Mesolect: The version that consists of a mixture of arcolectal and basilectal features.
  • Basilect: The version closest to a Creole language.

This NLP task was difficult because the structure of local dialect is not codified but often rely on the structure of its lexifier (English). Also spelling varies from speaker to speaker. In addition, creole words/spelling are not initial present in the vector space which made training times and optimization longer .

Results

Initial results have been mixed.

Original Text Parsed Text Expected or Correctly Parsed Text
Ah have live with mi paremnts en London. Ah live with meh parents in London. Ah live with meh parents in London.
Ah can get me fone? Ah cud get meh fone? Ah cud get meh fone?
muh moda an fada is nt relly home muh moda an fada is nt relly home. mi moda an fada not really home.
Jack isa honrable ma Jack issa honrable mah. Jack issa honourable man.
Ah waz a going tu school. Ah going to school. Ah going to school.
Wat's up buddy. Waz de scn buddy? Waz de scn buddy? / Wat's up buddy?
Ah waz thinking bout goeng tuh d d Mall. Ah thinking bout going tuh d Mall. Ah thinking bout going tuh d mall.

Usage of the TrinEC Parser

import Caribe as cb

text= "Ah have live with mi paremnts en London"

s= cb.Parser(text)

print(s.TT_Parser()) #Output: Ah live with meh parents in London.

Dictionary Data

The encoder and decoder utilises a dictionary data structure. The data for the dictionary was gathered from web-scapping social media sites among other websites and using Lise Winer Dictionary of the English Creole of Trinidad and Tobago among other scholarly resources.


Transformer Models

State of the art NLP grammar correction T5-KES model was trained on a modified version of the JFLEG dataset and is currently being tested against existing models and benchmarks. The T5-KES model may be used as both of a grammar corrector and a parser for Trinidad English Creole via training using custom datasets and translated datasets.


T5_KES_Corrector vs Caribe Corrector(Ginger Grammar Corrector)

Initial Tests were carried out with performance factors to measure the accuracy of the correction and the degree of sentence distortion from the correct sentence. Initial tests showed that the T5 corrector performed better in terms of accuracy with a lower sentence distortion.


  • Contact

For any concerns, issues with this library or want to become a collaborator to this project.

Email: keston.smith@my.uwi.edu


CHANGELOG =======================================

Version 0.0.1 (16/09/2021)

  • Initial Release

Version 0.0.2 (16/09/2021)

  • Minor bugs fixed
  • More words added

Version 0.0.3 (16/09/2021)

  • Minor bugs fixed
  • More words added
  • phase decode method created

Version 0.0.4 (17/09/2021)

  • More words added
  • caribe corrector method created

Version 0.0.5 (17/09/2021)

  • Minor Dependency issues resolved

Version 0.0.6 (17/09/2021)

  • More Words and phrases added

Version 0.0.7 (21/09/2021)

  • Major bug fixed where individual letters in words were translated randomly
  • More words added to the corpus.

Version 0.0.8 (30/09/2021)

  • caribe_pos tagging method introduced.
  • pos_report method introduced.
  • remove_signs method introduced.

Version 0.1.0 (14/10/2021)

  • trinidad_encode method converts standard english sentence to a creolised form.
  • Caribe introduces dialect file encoding on text, json and csv files. This has the ability to creolised nlp datasets.

Version 0.1.1 (20/10/2021)

  • More words added to the corpus.

Version 0.1.2 (27/10/2021)

  • caribe_pos members converted from string to tuple.
  • More words added to the corpus.

Version 0.1.5 (13/11/2021)

  • More words added to the corpus.

Version 0.1.8 (10/12/2021)

  • Introducing a new trained NLP grammar correction model "T5-KES".
  • Major Update in version 0.2.0 coming January.

Version 0.2.0 (06/02/2022)

  • New Sentence Corrector introduced: t5_kes_corrector()

Version 0.2.1 (07/02/2022)

  • More Words added to the corpus

Version 0.2.2 (13/02/2022)

  • Removal of a dependency that causes slower functionality

Version 0.2.4 (27/02/2022)

  • Introduced a new api function to the corrector: t5_kes_api_corrector(sentence) for faster performance

Version 0.2.9 (28/02/2022)

  • More Words added
  • Fixed an issue where certain words were not being encoded on NLP datasets.

Version 0.3.0 (03/03/2022)

  • Introducing the first ever parser for Trinidad and Tobago Creole English.
  • Model was trained on a combination of a custom dataset and a creolised JFLEG dataset using the file encoding feature.

Version 0.3.1 (06/03/2022)

  • Updated TTEC Parser model (Reduced training loss and trained on more custom data).
  • New api function for parser for faster performance TTparser_api().
  • Warnings/Issues: Api function calls may result in contradicting results from the main function or may sometimes result in Key Errors. Use non-api functions to achieve better results.

Version 0.3.4 (11/03/2022)

  • All api functions will disabled in V-0.3.6 until further notice.

Version 0.3.6 (14/03/2022)

  • More words added to the corpus.
  • Api functions issues resolved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Caribe-0.3.8.tar.gz (11.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page