A Python library for Creole text preprocessing

These details have not been verified by PyPI

Project links

Homepage

Project description

CreoleNLTK: Creole Natural Language Toolkit

CreoleNLTK is a Python library designed for preprocessing Creole text. The library includes various functions and tools to prepare text data for natural language processing (NLP) tasks. It provides functionality for cleaning, tokenization, lowercasing, stopword removal, contraction to expansion, and spelling checking.

Features

Spelling Check: Identify and correct spelling errors.
Part-of-Speech Tagging: Assign part-of-speech tags (e.g., noun, verb, adjective) to words in Creole sentences.
Number to Words (Cardinal and Ordinal): Convert numbers to their word forms
Contraction to Expansion: Expand contractions in the text.
Stopword Removal: Remove common words that do not contribute much to the meaning.
Tokenization: Break the text into words or tokens.
Text Cleaning: Remove unwanted characters and clean the text.

Installation

You can install CreoleNLTK using pip:

pip install creolenltk

Usage

Spelling Checker

from creolenltk.spelling_checker import SpellingChecker

# Initialize the spelling checker
spell_checker = SpellingChecker()

# Correct spelling errors in a word
corrected_word = spell_checker.correction('òtgraf')

print(f"Original Word: òtgraf, Corrected Word: {corrected_word}") # òtograf

Number to Words (Cardinal and Ordinal)

from creolenltk.num2word import CreoleNumberConverter

# Initialize the number converter
num_converter = CreoleNumberConverter()

# Convert numbers to cardinal words
print(num_converter.number_to_word(2024))  # de mil venntkat

# Convert numbers to ordinal words
print(num_converter.number_to_ordinal(21))  # venteyinyèm

# Replace numbers in text with cardinal form
text = "Mwen genyen 3 chat ak 21 ti chen."
converted = num_converter.replace_cardinals_in_text(text)
print(converted)  # Mwen genyen twa chat ak venteyen ti chen.

Part-of-Speech Tagging

from creolenltk.pos_tagger import PosTagger

# Initialize the POS tagger (uses the trained Creole model)
tagger = PosTagger()

sentence = "Mwen renmen Ayiti anpil."
tags = tagger.tag(sentence)
print(tags)  # [('Mwen', 'PRON'), ('renmen', 'VERB'), ('Ayiti', 'PROPN'), ('anpil', 'ADV'), ('.', 'PUNCT')]

sentence2 = "Poukisa panse yo chanje?"
tags2 = tagger.tag(sentence2)
print(tags2)  # [('Poukisa', 'ADV'), ('panse', 'VERB'), ('yo', 'PRON'), ('chanje', 'VERB'), ('?', 'PUNCT')]

Contraction to Expansion

from creolenltk.contraction_expansion import ContractionToExpansion

# Initialize the contraction expander
contraction_expander = ContractionToExpansion()

# Expand contractions in a sentence
original_sentence = "L'ap manje. m'ap rete lakay mw."
expanded_sentence = contraction_expander.expand_contractions(original_sentence)

print(f"Original Sentence: {original_sentence}\nExpanded Sentence: {expanded_sentence}") # li ap manje. mwen ap rete lakay mwen.

Stopword Removal

from creolenltk.stopword import Stopword

# Initialize the stopword handler
stopword_handler = Stopword()

# Remove stopwords from a sentence
sentence_with_stopwords = "Sa se yon fraz tès ak kèk stopwords nan Kreyòl Ayisyen."
sentence_without_stopwords = stopword_handler.remove_stopwords(sentence_with_stopwords)

print(f"Sentence with Stopwords: {sentence_with_stopwords}\nWithout Stopwords: {sentence_without_stopwords}") # fraz tès stopwords Kreyòl Ayisyen.

Tokenizer

from creolenltk.tokenizer import Tokenizer

# Initialize the tokenizer
tokenizer = Tokenizer()

# Tokenize a sentence
sentence = "Sa se yon fraz senp"
tokens = tokenizer.word_tokenize(sentence, expand_contractions=True, lowercase=True)

print(f"Sentence: {sentence}\nTokens: {tokens}") # ["sa", "se", "yon", "fraz", "senp"]

For more detailed usage and examples, refer to the documentation.

License

MIT licensed. See the bundled LICENSE file for more details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.10

Jul 11, 2025

1.0.8

Jun 27, 2025

1.0.6

Jun 27, 2025

1.0.4

Jun 26, 2025

1.0.3

Feb 1, 2024

1.0.0

Feb 1, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

creolenltk-1.0.10.tar.gz (17.4 MB view details)

Uploaded Jul 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

creolenltk-1.0.10-py3-none-any.whl (17.4 MB view details)

Uploaded Jul 11, 2025 Python 3

File details

Details for the file creolenltk-1.0.10.tar.gz.

File metadata

Download URL: creolenltk-1.0.10.tar.gz
Upload date: Jul 11, 2025
Size: 17.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for creolenltk-1.0.10.tar.gz
Algorithm	Hash digest
SHA256	`7d1a4364388b8210eded0030199c7848a01ee1a3cc718c3034a351b6862c4322`
MD5	`062d1224a9c6e139bbd71970840fc816`
BLAKE2b-256	`5aaf8d239b6fb9668070fda61873ae0fa1ddfa01dd48946738693f7456202186`

See more details on using hashes here.

File details

Details for the file creolenltk-1.0.10-py3-none-any.whl.

File metadata

Download URL: creolenltk-1.0.10-py3-none-any.whl
Upload date: Jul 11, 2025
Size: 17.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for creolenltk-1.0.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b3bac9aaaddb5759346f1fa2366f84704f3cb86f8075fc6d4533447af7c4cda9`
MD5	`65d5dc2bc3afe3c3bd72bc336c5de135`
BLAKE2b-256	`02e6b9b30495372fcd2758f0a30758e903ebe5163178c0da9988bed4ebab492d`

See more details on using hashes here.

creolenltk 1.0.10

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CreoleNLTK: Creole Natural Language Toolkit

Features

Installation

Usage

Spelling Checker

Number to Words (Cardinal and Ordinal)

Part-of-Speech Tagging

Contraction to Expansion

Stopword Removal

Tokenizer

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes