Skip to main content

A Python library for Creole text preprocessing

Project description

CreoleNLTK: Creole Natural Language Toolkit

License: MIT Python Version Build Status

CreoleNLTK is a Python library designed for preprocessing Creole text. The library includes various functions and tools to prepare text data for natural language processing (NLP) tasks. It provides functionality for cleaning, tokenization, lowercasing, stopword removal, contraction to expansion, and spelling checking.

Features

  • Spelling Check: Identify and correct spelling errors.
  • Contraction to Expansion: Expand contractions in the text.
  • Stopword Removal: Remove common words that do not contribute much to the meaning.
  • Tokenization: Break the text into words or tokens.
  • Text Cleaning: Remove unwanted characters and clean the text.

Installation

You can install CreoleNLTK using pip:

pip install creolenltk

Usage

Spelling Checker

from creolenltk.spelling_checker import SpellingChecker

# Initialize the spelling checker
spell_checker = SpellingChecker()

# Correct spelling errors in a word
corrected_word = spell_checker.correction('òtgraf')

print(f"Original Word: òtgraf, Corrected Word: {corrected_word}") # òtograf

Number to Words (Cardinal and Ordinal)

from creolenltk.num2word import CreoleNumberConverter

# Initialize the number converter
num_converter = CreoleNumberConverter()

# Convert numbers to cardinal words
print(num_converter.number_to_word(2024))  # de mil venntkat

# Convert numbers to ordinal words
print(num_converter.number_to_ordinal(21))  # venteyinyèm

# Replace numbers in text with cardinal form
text = "Mwen genyen 3 chat ak 21 ti chen."
converted = num_converter.replace_cardinals_in_text(text)
print(converted)  # Mwen genyen twa chat ak venteyen ti chen.

Part-of-Speech Tagging

from creolenltk.pos_tagger import PosTagger

# Initialize the POS tagger (uses the trained Creole model)
tagger = PosTagger()

sentence = "Mwen renmen Ayiti anpil."
tags = tagger.tag(sentence)
print(tags)  # [('Mwen', 'PRON'), ('renmen', 'VERB'), ('Ayiti', 'PROPN'), ('anpil', 'ADV'), ('.', 'PUNCT')]

sentence2 = "Poukisa panse yo chanje?"
tags2 = tagger.tag(sentence2)
print(tags2)  # [('Poukisa', 'ADV'), ('panse', 'VERB'), ('yo', 'PRON'), ('chanje', 'VERB'), ('?', 'PUNCT')]

Contraction to Expansion

from creolenltk.contraction_expansion import ContractionToExpansion

# Initialize the contraction expander
contraction_expander = ContractionToExpansion()

# Expand contractions in a sentence
original_sentence = "L'ap manje. m'ap rete lakay mw."
expanded_sentence = contraction_expander.expand_contractions(original_sentence)

print(f"Original Sentence: {original_sentence}\nExpanded Sentence: {expanded_sentence}") # li ap manje. mwen ap rete lakay mwen.

Stopword Removal

from creolenltk.stopword import Stopword

# Initialize the stopword handler
stopword_handler = Stopword()

# Remove stopwords from a sentence
sentence_with_stopwords = "Sa se yon fraz tès ak kèk stopwords nan Kreyòl Ayisyen."
sentence_without_stopwords = stopword_handler.remove_stopwords(sentence_with_stopwords)

print(f"Sentence with Stopwords: {sentence_with_stopwords}\nWithout Stopwords: {sentence_without_stopwords}") # fraz tès stopwords Kreyòl Ayisyen.

Tokenizer

from creolenltk.tokenizer import Tokenizer

# Initialize the tokenizer
tokenizer = Tokenizer()

# Tokenize a sentence
sentence = "Sa se yon fraz senp"
tokens = tokenizer.word_tokenize(sentence, expand_contractions=True, lowercase=True)

print(f"Sentence: {sentence}\nTokens: {tokens}") # ["sa", "se", "yon", "fraz", "senp"]

For more detailed usage and examples, refer to the documentation.

License

MIT licensed. See the bundled LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

creolenltk-1.0.8.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

creolenltk-1.0.8-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file creolenltk-1.0.8.tar.gz.

File metadata

  • Download URL: creolenltk-1.0.8.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.17

File hashes

Hashes for creolenltk-1.0.8.tar.gz
Algorithm Hash digest
SHA256 ab2696f9fac16acce0e4447adf396ae51cb7b6581aad4592b827f0863f3b3652
MD5 7434fbd023665534d3f27ee645d0e4ee
BLAKE2b-256 203d79f67c60d4fe2ad4c2544568d7f989696424f27858fe197b573ffe8f7e37

See more details on using hashes here.

File details

Details for the file creolenltk-1.0.8-py3-none-any.whl.

File metadata

  • Download URL: creolenltk-1.0.8-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.17

File hashes

Hashes for creolenltk-1.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 fd8bac9b492c45b7a2e626a5ca6b438589bf8514fed0e9eab97f02c0af2d7c15
MD5 a08e6367ba7a31baac239c9d1ac0693e
BLAKE2b-256 d8bf10d4ad8deb81f763ec4187ab70ca955e8be63cce232ba8325c554b9890be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page