Caribbean English Creoles to Standard English

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
Intended Audience
- Education
License
- OSI Approved :: Apache Software License
Operating System
- Microsoft :: Windows :: Windows 10
Programming Language
- Python :: 3

Project description

PyPI

Caribe

This is a natural processing python library takes Caribbean English Creoles and converts it to Standard English. Future updates would include the conversion of other Caribbean English Creole languages to Standard English and additional natural language processing methods.

Installation

Use the below command to install package/library

pip install Caribe

Visit the website: here

Main Usage

Trinidad English Creole to English

import Caribe as cb

text= "Dem men doh kno wat dey doing wid d money bai"

output= cb.tec_translator(text)

print(output.tec_translate()) #Output: These men do not know what they are doing with the money.

English to Trinidad English Creole

import Caribe as cb

text= "Where are you going now?"

output= cb.english_to_tec(text)

print(output.translate()) #Output: Weh yuh going now

Guyanese English Creole to English

from Caribe import guyanese as gy

text= "Me and meh kozn waan ah job"

output= gy.gec_translator(text)

print(output.gec_translate()) #Output: Me and my cousin want a job.

Other Usages

Sample 1: Checks the english creole input against existing known creole phrases before decoding the sentence into a more standardized version of English language. A corrector is used to check and fix small grammatical errors.

# Sample 1
import Caribe as cb


sentence = "Dey have dey reasons"
standard = cb.phrase_decode(sentence)
standard = cb.trinidad_decode(standard)
fixed = cb.caribe_corrector(standard)
print(fixed) #Output: They have their reasons.

Sample 2: Checks the trinidad english creole input against existing known phrases

# Sample 2 
import Caribe as cb


sentence = "Waz de scene"
standard = cb.phrase_decode(sentence)

print(standard) # Outputs: How are you

Sample 3: Checks the sentence for any grammatical errors or incomplete words and corrects it.

#Sample 3
import Caribe as cb


sentence = "I am playin fotball outsde"
standard = cb.caribe_corrector(sentence)

print(standard) # Outputs: I am playing football outside

Sample 4: Makes parts of speech tagging on creole words.

#Sample 4
import Caribe as cb

sentence = "wat iz de time there"
analyse = cb.nlp()
output = analyse.caribe_pos(sentence)

print(output) # Outputs: ["('wat', 'PRON')", "('iz', 'VERB')", "('de', 'DET')", "('time', 'NOUN')", "('there', 'ADV')"]

Sample 5: Remove punctuation marks.

#Sample 5
import Caribe as cb

sentence = "My aunt, Shelly is a lawyer!"
analyse = cb.remove_signs(sentence)


print(analyse) # Outputs: My aunt Shelly is a lawyer

Sample 6: Sentence Correction using T5-KES.

#Sample 6 Using t5_kes_corrector
import Caribe as cb


sentence = "Wat you doin for d the christmas"
correction = cb.t5_kes_corrector(sentence)


print(correction) # Output: What are you doing for christmas?

Sample 7: Sentence Correction using Decoder and T5-KES.

#Sample 7 Using t5_kes_corrector and decoder
import Caribe as cb


sentence = "Ah want ah phone for d christmas"
decoded= cb.trinidad_decode(sentence)
correction = cb.t5_kes_corrector(decoded)


print(correction) # Output: I want a phone for christmas.

Sample 8: Sentence Capitalisation.

#Sample 7 Using Caribe sentence capitalization model
import Caribe as cb


sentence = "john is a boy. he is 12 years old. his sister's name is Joy."

capitalized_text= cb.capitalize(sentence)

print(capitalized_text) # Output: John is a boy. He is 12 years old. His sister's name is Joy.

Additional Information
- trinidad_decode() : Decodes the sentence as a whole string.
- guyanese_decode(): Decodes the sentence as a whole string.
- trinidad_decode_split(): Decodes the sentence word by word.
- phrase_decode(): Decodes the sentence against known creole phrases.
- caribe_corrector(): Corrects grammatical errors in a sentence using a trained NLP model.
- t5_kes_corrector(): Corrects grammatical errors in a sentence using a trained NLP model.
- trinidad_encode(): Encodes a sentence to Trinidadian English Creole.
- guyanese_encode(): Encodes a sentence to Guyanese English Creole.
- trinidad_direct_translation(): Translates Trinidad English Creole to English.
- capitalize(): Capitalize groups of sentences using an NLP model.
- caribe_pos(): Generates parts of speech tagging on creole words.
- pos_report(): Generates parts of speech tagging on english words.
- remove_signs(): Takes any sentence and remove punctuation marks.

File Encodings on NLP datasets

Caribe introduces file encoding (Beta) in version 0.1.0. This allows a dataset of any supported filetype to be translated into Trinidad English Creole. The file encoding feature only supports txt, json or csv files only.

Usage of File Encodings:

import Caribe as cb

convert = cb.file_encode("test.txt", "text")
# Generates a translated text file
convert = cb.file_encode("test.json", "json")
# Generates a translated json file
convert = cb.file_encode("test.csv", "csv")
# Generates a translated csv file

First Parser for the Trinidad English Creole Language

This model utilises T5-base pre-trained model. It was fine tuned using a combination of a custom dataset and creolised JFLEG dataset. JFLEG dataset was translated using the file encoding feature of the library.

Within the Creole continuum, there exists different levels of lects. These include:

Acrolect: The version of a language closest to standard international english.
Mesolect: The version that consists of a mixture of arcolectal and basilectal features.
Basilect: The version closest to a Creole language.

This NLP task was difficult because the structure of local dialect is not standardised but often rely on the structure of its lexifier (English). Spelling also varies from speaker to speaker. Additionally, creole words/spelling are not initial present in the vector space which made training times and optimization longer .

Results

Initial results have been mixed.

Original Text	Parsed Text	Expected or Correctly Parsed Text
Ah have live with mi paremnts en London.	Ah live with meh parents in London.	Ah live with meh parents in London.
Ah can get me fone?	Ah cud get meh fone?	Ah cud get meh fone?
muh moda an fada is nt relly home	muh moda an fada is nt relly home.	muh moda an fada not really home.
Me ah go market	Ah going tuh d market.	Ah going tuh d market. / I going tuh de market.
Ah waz a going tu school.	Ah going to school.	Ah going to school.
Ah don't like her.	Ah doh like she.	Ah doh like she. / I doh like she.
Ah waz thinking bout goeng tuh d d Mall.	Ah thinking bout going tuh d Mall.	Ah thinking bout going tuh d mall.

Usage of the TrinEC Parser

import Caribe as cb

text= "Ah have live with mi paremnts en London"

s= cb.Parser(text)

print(s.TT_Parser()) #Output: Ah live with meh parents in London.

Trinidad English Creole to English Translator using the T5 model

A model was fine-tuned(supervised) on a custom dataset to translate from Trinidad English Creole to English. This task was done as an alternative method to the decoded dictionary- sentence correction method. Future Testing will illustrate a comparison between both methods.

import Caribe as cb

text= "Dem men doh kno wat dey doing wid d money bai"

output= cb.tec_translator(text)

print(output.tec_translate()) #Output: These men do not know what they are doing with the money.

Dictionary Data

The encoder and decoder utilises a dictionary data structure. The data for the dictionary was gathered from web-scapping social media sites among other websites and using Lise Winer Dictionary of the English Creole of Trinidad and Tobago among other scholarly resources.

Fine-tune a T5 model on custom datasets easier using Caribe built on HuggingFace Transformers APIs

Training Section:

Caribe allows any user to fine-tune a T5 model on a custom dataset. The snippet below trains and generates a model in the "model/" folder. Please ensure that your training and evaluation datasets are in the recommended format before training. For more info checkout T5 documentation.

from Caribe import T5_Caribe as t5

model = t5.T5_Trainer("train_dataset.csv", "eval_dataset.csv")
connect = model.connect_datasets("csv")
train = model.caribe_training(output_path="./content", epochs=10, eval_strategy="steps", decay=0.01, l_rate=2e-5, train_batch_size=8, eval_batch_size=8, checkpoints=2)

Parameters:

Epochs : Num of training iterations.
eval_strategy: Displays training in 'steps' or 'epoch'
decay: Regularization of Training weights.
l_rate: Learning rate.
train_batch_size: Number of samples from the training data per iteration.
eval_batch_size: Number of samples from the evaluation data per iteration.
checkpoints: Produces and saves preset amount of model versions during training.

Generating Text Section:

Generates text from the output of the model. Please note that if you have not train and generate the model folder or have a pre-existing model folder with the required files, the below code will generate an error.

from Caribe import caribe_generate

g = caribe_generate.generate_text("eng:How are you", temperature=1.7, num_beams=10)
output=g.output()
print(output)

Parameters:

min_length: Minimum number of generated tokens.
max_length:Max number of generated tokens.
do_sample: When True, picks words based on their conditional probability.
early_stopping: If true,stops the beam search when the least amount of num beams sentences have been completed each batch.
num_beams: Number of steps for each search path.
temperature: The value utilized to calculate the likelihood of the next token.
top_k:The tokens with the greatest likelihood should be retained for top-k sampling.
top_p: Most tokens with the highest probabilities that add up to top_p or higher.
no_repeat_ngram_size: The amount of times an n-gram that size can only occur once.

Caribe_Corrector (T5-KES) vs Gingerit Corrector

Initial Tests were carried out with performance factors to measure the accuracy of the correction and the degree of sentence distortion from the correct sentence. Initial tests showed that the T5 corrector performed better in terms of accuracy with a lower sentence distortion and attained higher MT scores. The T5 corrector also outperforms Gingerit on positional translations as shown in the table below.

Original Creole Text	Decoded Sentence	Caribe Corrector (T5-KES)	Gingerit Corrector	Correct Output
Ah need ah car fuh meh birthday	I need I car for me birthday	I need a car for my birthday.	I need my car for my birthday	I need a car for my birthday
Wah iz d time on yuh side?	want is the time on you side?	What is the time on your side?	Want is the time on your side?	What is the time on your side?
Ah man is d provider of de house	I man is the provider of the house	A man is the provider of the house.	My man is the provider of the house	A man is the provider of the house.
Ah orange issa fruit	I orange is a fruit	An orange is a fruit.	My orange is a fruit	An orange is a fruit.
Wah time yuh wah come over?	want time you want come over?	What time do you want to come over?	Want time you want to come over?	What time do you want to come over?
Dey make dey own choices at the end of de day	their make their own choices at the end of the day	They make their own choices at the end of the day.	they make their own choices at the end of the day.	They make their own choices at the end of the day.

Guyanese English Creole Features

Caribe introduces decoding, encoding and file encoding using Guyanese English Creole and translating Guyanese dialect.

Guyanese English Creole to English

from Caribe import guyanese as gy

text= "Me and meh kozn waan ah job"

output= gy.gec_translator(text)

print(output.gec_translate()) #Output: Me and my cousin want a job.

Decoding a sentence

# Decoding a creole sentence

from Caribe import guyanese as gy

sentence = "waam star, me ga fu go di markit"
output = gy.guyanese_decode(sentence)
print(output) # Output: what going on brother, me have to go the market

Encoding a sentence

# Encoding a sentence

from Caribe import guyanese as gy

sentence = "I do not want nothing to do with him"
output = gy.guyanese_encode(sentence)
print(output) # Output: ah du not waan notn tuh du wid him

File Encodings on NLP datasets using Guyanese English Creole:

from Caribe import guyanese as gy

convert = gy.guyanese_file_encode("test.txt", "text")
# Generates a translated text file
convert = gy.guyanese_file_encode("test.json", "json")
# Generates a translated json file
convert = gy.guyanese_file_encode("test.csv", "csv")
# Generates a translated csv file

News, Issues and Future Plans (14/08/2022)

Datasets are continuously being updated.
NLP Models and Dictionaries are continuously updated.
Future plans to create translations, models and datasets for Caribbean French and Spanish Creoles to their respective lexifers (Requires extensive research).
Some users complained of problems importing some of the dependencies. This is currently being monitored (10/06/2022).
New model introduced for sentence capitalization (09/06/2022) !!!
NEW model introduced for direct translation from Trinidad English Creole(TEC) to English(26/06/2022).
NEW model introduced for direct translation from Guyanese English Creole(GEC) to English(26/09/2022).
The gingerit_corrector function is deprecated.

Contact

For any concerns or issues with this library.

Email: keston.smith@my.uwi.edu

Website: https://www.thecaribe.org/

CHANGELOG =======================================

Version 0.0.1 (16/09/2021)

Initial Release

Version 0.0.2 (16/09/2021)

Minor bugs fixed
More words added

Version 0.0.3 (16/09/2021)

Minor bugs fixed
More words added
phase decode method created

Version 0.0.4 (17/09/2021)

More words added
caribe corrector method created

Version 0.0.5 (17/09/2021)

Minor Dependency issues resolved

Version 0.0.6 (17/09/2021)

More Words and phrases added

Version 0.0.7 (21/09/2021)

Major bug fixed where individual letters in words were translated randomly
More words added to the corpus.

Version 0.0.8 (30/09/2021)

caribe_pos tagging method introduced.
pos_report method introduced.
remove_signs method introduced.

Version 0.1.0 (14/10/2021)

trinidad_encode method converts standard english sentence to a creolised form.
Caribe introduces dialect file encoding on text, json and csv files. This has the ability to creolised nlp datasets.

Version 0.1.1 (20/10/2021)

More words added to the corpus.

Version 0.1.2 (27/10/2021)

caribe_pos members converted from string to tuple.
More words added to the corpus.

Version 0.1.5 (13/11/2021)

More words added to the corpus.

Version 0.1.8 (10/12/2021)

Introducing a new trained NLP grammar correction model "T5-KES".
Major Update in version 0.2.0 coming January.

Version 0.2.0 (06/02/2022)

New Sentence Corrector introduced: t5_kes_corrector()

Version 0.2.1 (07/02/2022)

More Words added to the corpus

Version 0.2.2 (13/02/2022)

Removal of a dependency that causes slower functionality

Version 0.2.4 (27/02/2022)

Introduced a new api function to the corrector: t5_kes_api_corrector(sentence) for faster performance

Version 0.2.9 (28/02/2022)

More Words added
Fixed an issue where certain words were not being encoded on NLP datasets.

Version 0.3.0 (03/03/2022)

Introducing the first ever parser for Trinidad and Tobago Creole English.
Model was trained on a combination of a custom dataset and a creolised JFLEG dataset using the file encoding feature.

Version 0.3.1 (06/03/2022)

Updated TTEC Parser model (Reduced training loss and trained on more custom data).
New api function for parser for faster performance TTparser_api().
Warnings/Issues: Api function calls may result in contradicting results from the main function or may sometimes result in Key Errors. Use non-api functions to achieve better results.

Version 0.3.4 (11/03/2022)

All api functions will disabled in V-0.3.6 until further notice.

Version 0.3.6 (14/03/2022)

More words added to the corpus.
Api functions issues resolved.

Version 0.4.7 (20/03/2022)

NLP model retrain and improved.
Official Logo Added.

Version 0.5.0 (27/03/2022)

More Words Added.

Version 0.5.1 (21/04/2022)

More Words Added.
trinidad_direct_translation function added to directly translate Trinidad English Creole to English.
caribe_corrector function will not longer use gingerit and will be replaced with the T5 model.
gingerit_corrector function added to facilitate the gingerit model.

Version 0.5.3 (05/06/2022)

More Words Added.
Trinidad English Creole Parser Model updated.

Version 0.5.5 (09/06/2022)

New NLP model introdcued for sentence capitalization.
T5-KES sentence correction model updated.

Version 0.5.9 (10/06/2022)

Resolving minor dependency issues.

Version 0.6.4 (26/06/2022)

Introducing a NEW NLP model to directly translate Trinidad English Creole to English.

Version 0.6.6 (29/06/2022)

More Words Added.

Version 0.6.8 (02/07/2022)

More Words Added
Model Updated
Entries Changed or Removed

Version 0.7.5 (31/07/2022)

Introducing an easier method to Fine-tuning a T5-model on custom datasets.

Version 0.7.7 (14/08/2022)

Introducing Guyanese English Creole

Version 0.7.8 (04/09/2022)

More Words Added

Version 0.7.9 (26/09/2022)

Introducing a NEW NLP model to directly translate Guyanese English Creole to English.

Version 0.8.1 (02/10/2022)

Introducing a NEW NLP model to directly translate Standard English to Trinidad English Creole.

Version 0.9.0 (08/08/2023)

Resolved dependency issues with cloud computing services where the service refused to install the latest version of package because of a missing/deleted dependency.
The gingerit_corrector function is deprecated from version 0.9.0 and above.
Gingerit package no longer exists. Dependency removed.

Version 0.9.2 (02/12/2023)

More words added to Guyanese creole corpus.
More words added to Trinidad Creole corpus.

Version 0.9.4 (15/07/2024)

More words added to Trinidad Creole corpus.
Fixed minor dependency issues.

Project details

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
Intended Audience
- Education
License
- OSI Approved :: Apache Software License
Operating System
- Microsoft :: Windows :: Windows 10
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.9.4

Jul 15, 2024

0.9.2

Dec 2, 2023

0.9.0

Aug 9, 2023

0.8.9

Jan 26, 2023

0.8.8

Dec 26, 2022

0.8.5

Oct 24, 2022

0.8.3

Oct 3, 2022

0.8.1

Oct 2, 2022

0.8.0

Sep 28, 2022

0.7.8

Sep 4, 2022

0.7.7

Aug 15, 2022

0.7.5

Jul 31, 2022

0.7.0

Jul 28, 2022

0.6.9

Jul 6, 2022

0.6.8

Jul 2, 2022

0.6.6

Jun 29, 2022

0.6.4

Jun 26, 2022

0.6.3

Jun 26, 2022

0.6.2

Jun 24, 2022

0.5.9

Jun 11, 2022

0.5.5

Jun 10, 2022

0.5.4

Jun 5, 2022

0.5.2

May 31, 2022

0.5.1

Apr 21, 2022

0.5.0

Mar 28, 2022

0.4.9

Mar 27, 2022

0.4.8

Mar 24, 2022

0.4.7

Mar 21, 2022

0.4.2

Mar 19, 2022

0.3.8

Mar 17, 2022

0.3.7

Mar 16, 2022

0.3.6

Mar 15, 2022

0.3.4

Mar 11, 2022

0.3.3

Mar 11, 2022

0.3.2

Mar 10, 2022

0.3.1

Mar 6, 2022

0.3.0

Mar 3, 2022

0.2.9

Feb 28, 2022

0.2.8

Feb 28, 2022

0.2.4

Feb 28, 2022

0.2.2

Feb 13, 2022

0.2.1

Feb 7, 2022

0.2.0

Feb 6, 2022

0.1.9

Jan 13, 2022

0.1.8

Dec 12, 2021

0.1.5

Nov 13, 2021

0.1.4

Oct 28, 2021

0.1.3 yanked

Oct 27, 2021

0.1.2

Oct 27, 2021

0.1.1

Oct 20, 2021

0.1.0

Oct 14, 2021

0.0.8

Sep 30, 2021

0.0.7

Sep 21, 2021

0.0.6

Sep 18, 2021

0.0.5

Sep 17, 2021

0.0.4

Sep 17, 2021

0.0.3

Sep 16, 2021

0.0.2

Sep 16, 2021

0.0.1

Sep 16, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Caribe-0.9.4.tar.gz (27.4 kB view details)

Uploaded Jul 15, 2024 Source

File details

Details for the file Caribe-0.9.4.tar.gz.

File metadata

Download URL: Caribe-0.9.4.tar.gz
Upload date: Jul 15, 2024
Size: 27.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.0

File hashes

Hashes for Caribe-0.9.4.tar.gz
Algorithm	Hash digest
SHA256	`659475a52fce78101901c5edfe5544063e033038c9091b2b1895de4b34b042df`
MD5	`a1e5e6c8565b2daffdf0361e6624477c`
BLAKE2b-256	`0f84a09e041652ddfdc30fdc270aaa1f36a3444742fe401073e8b6bd90cc3637`

See more details on using hashes here.

Caribe 0.9.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Caribe

Installation

Visit the website: here

Main Usage

Trinidad English Creole to English

English to Trinidad English Creole

Guyanese English Creole to English

Other Usages

Additional Information

File Encodings on NLP datasets

Usage of File Encodings:

First Parser for the Trinidad English Creole Language

Results

Usage of the TrinEC Parser

Trinidad English Creole to English Translator using the T5 model

Dictionary Data

Fine-tune a T5 model on custom datasets easier using Caribe built on HuggingFace Transformers APIs

Training Section:

Parameters:

Generating Text Section:

Parameters:

Caribe_Corrector (T5-KES) vs Gingerit Corrector

Guyanese English Creole Features

Guyanese English Creole to English

Decoding a sentence

Encoding a sentence

File Encodings on NLP datasets using Guyanese English Creole:

News, Issues and Future Plans (14/08/2022)

Contact

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes