A Python library for Creole text preprocessing
Project description
CreoleNLTK: Creole Natural Language Toolkit
CreoleNLTK is a Python library designed for preprocessing Creole text. The library includes various functions and tools to prepare text data for natural language processing (NLP) tasks. It provides functionality for cleaning, tokenization, lowercasing, stopword removal, contraction to expansion, and spelling checking.
Features
- Spelling Check: Identify and correct spelling errors.
- Part-of-Speech Tagging: Assign part-of-speech tags (e.g., noun, verb, adjective) to words in Creole sentences.
- Number to Words (Cardinal and Ordinal): Convert numbers to their word forms
- Contraction to Expansion: Expand contractions in the text.
- Stopword Removal: Remove common words that do not contribute much to the meaning.
- Tokenization: Break the text into words or tokens.
- Text Cleaning: Remove unwanted characters and clean the text.
Installation
You can install CreoleNLTK using pip:
pip install creolenltk
Usage
Spelling Checker
from creolenltk.spelling_checker import SpellingChecker
# Initialize the spelling checker
spell_checker = SpellingChecker()
# Correct spelling errors in a word
corrected_word = spell_checker.correction('òtgraf')
print(f"Original Word: òtgraf, Corrected Word: {corrected_word}") # òtograf
Number to Words (Cardinal and Ordinal)
from creolenltk.num2word import CreoleNumberConverter
# Initialize the number converter
num_converter = CreoleNumberConverter()
# Convert numbers to cardinal words
print(num_converter.number_to_word(2024)) # de mil venntkat
# Convert numbers to ordinal words
print(num_converter.number_to_ordinal(21)) # venteyinyèm
# Replace numbers in text with cardinal form
text = "Mwen genyen 3 chat ak 21 ti chen."
converted = num_converter.replace_cardinals_in_text(text)
print(converted) # Mwen genyen twa chat ak venteyen ti chen.
Part-of-Speech Tagging
from creolenltk.pos_tagger import PosTagger
# Initialize the POS tagger (uses the trained Creole model)
tagger = PosTagger()
sentence = "Mwen renmen Ayiti anpil."
tags = tagger.tag(sentence)
print(tags) # [('Mwen', 'PRON'), ('renmen', 'VERB'), ('Ayiti', 'PROPN'), ('anpil', 'ADV'), ('.', 'PUNCT')]
sentence2 = "Poukisa panse yo chanje?"
tags2 = tagger.tag(sentence2)
print(tags2) # [('Poukisa', 'ADV'), ('panse', 'VERB'), ('yo', 'PRON'), ('chanje', 'VERB'), ('?', 'PUNCT')]
Contraction to Expansion
from creolenltk.contraction_expansion import ContractionToExpansion
# Initialize the contraction expander
contraction_expander = ContractionToExpansion()
# Expand contractions in a sentence
original_sentence = "L'ap manje. m'ap rete lakay mw."
expanded_sentence = contraction_expander.expand_contractions(original_sentence)
print(f"Original Sentence: {original_sentence}\nExpanded Sentence: {expanded_sentence}") # li ap manje. mwen ap rete lakay mwen.
Stopword Removal
from creolenltk.stopword import Stopword
# Initialize the stopword handler
stopword_handler = Stopword()
# Remove stopwords from a sentence
sentence_with_stopwords = "Sa se yon fraz tès ak kèk stopwords nan Kreyòl Ayisyen."
sentence_without_stopwords = stopword_handler.remove_stopwords(sentence_with_stopwords)
print(f"Sentence with Stopwords: {sentence_with_stopwords}\nWithout Stopwords: {sentence_without_stopwords}") # fraz tès stopwords Kreyòl Ayisyen.
Tokenizer
from creolenltk.tokenizer import Tokenizer
# Initialize the tokenizer
tokenizer = Tokenizer()
# Tokenize a sentence
sentence = "Sa se yon fraz senp"
tokens = tokenizer.word_tokenize(sentence, expand_contractions=True, lowercase=True)
print(f"Sentence: {sentence}\nTokens: {tokens}") # ["sa", "se", "yon", "fraz", "senp"]
For more detailed usage and examples, refer to the documentation.
License
MIT licensed. See the bundled LICENSE file for more details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file creolenltk-1.0.10.tar.gz.
File metadata
- Download URL: creolenltk-1.0.10.tar.gz
- Upload date:
- Size: 17.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d1a4364388b8210eded0030199c7848a01ee1a3cc718c3034a351b6862c4322
|
|
| MD5 |
062d1224a9c6e139bbd71970840fc816
|
|
| BLAKE2b-256 |
5aaf8d239b6fb9668070fda61873ae0fa1ddfa01dd48946738693f7456202186
|
File details
Details for the file creolenltk-1.0.10-py3-none-any.whl.
File metadata
- Download URL: creolenltk-1.0.10-py3-none-any.whl
- Upload date:
- Size: 17.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3bac9aaaddb5759346f1fa2366f84704f3cb86f8075fc6d4533447af7c4cda9
|
|
| MD5 |
65d5dc2bc3afe3c3bd72bc336c5de135
|
|
| BLAKE2b-256 |
02e6b9b30495372fcd2758f0a30758e903ebe5163178c0da9988bed4ebab492d
|