Skip to main content

A Python library for Creole text preprocessing

Project description

CreoleNLTK: Creole Natural Language Toolkit

License: MIT Python Version Build Status

CreoleNLTK is a Python library designed for preprocessing Creole text. The library includes various functions and tools to prepare text data for natural language processing (NLP) tasks. It provides functionality for cleaning, tokenization, lowercasing, stopword removal, contraction to expansion, and spelling checking.

Features

  • Spelling Check: Identify and correct spelling errors.
  • Contraction to Expansion: Expand contractions in the text.
  • Stopword Removal: Remove common words that do not contribute much to the meaning.
  • Tokenization: Break the text into words or tokens.
  • Text Cleaning: Remove unwanted characters and clean the text.

Installation

You can install CreoleNLTK using pip:

pip install creolenltk

Usage

Spelling Checker

# -*- coding: utf-8 -*-

from creolenltk.spelling_checker import SpellingChecker

# Initialize the spelling checker
spell_checker = SpellingChecker()

# Correct spelling errors in a word
corrected_word = spell_checker.correction('òtgraf')

print(f"Original Word: òtgraf, Corrected Word: {corrected_word}") # òtograf

Contraction to Expansion

from creolenltk.contraction_expansion import ContractionToExpansion

# Initialize the contraction expander
contraction_expander = ContractionToExpansion()

# Expand contractions in a sentence
original_sentence = "L'ap manje. m'ap rete lakay mw."
expanded_sentence = contraction_expander.expand_contractions(original_sentence)

print(f"Original Sentence: {original_sentence}\nExpanded Sentence: {expanded_sentence}") # li ap manje. mwen ap rete lakay mwen.

Stopword Removal

# -*- coding: utf-8 -*-

from creolenltk.stopword import Stopword

# Initialize the stopword handler
stopword_handler = Stopword()

# Remove stopwords from a sentence
sentence_with_stopwords = "Sa se yon fraz tès ak kèk stopwords nan Kreyòl Ayisyen."
sentence_without_stopwords = stopword_handler.remove_stopwords(sentence_with_stopwords)

print(f"Sentence with Stopwords: {sentence_with_stopwords}\nWithout Stopwords: {sentence_without_stopwords}") # fraz tès stopwords Kreyòl Ayisyen.

Tokenizer

from creolenltk.tokenizer import Tokenizer

# Initialize the tokenizer
tokenizer = Tokenizer()

# Tokenize a sentence
sentence = "Sa se yon fraz senp"
tokens = tokenizer.word_tokenize(sentence, expand_contractions=True, lowercase=True)

print(f"Sentence: {sentence}\nTokens: {tokens}") # ["sa", "se", "yon", "fraz", "senp"]

For more detailed usage and examples, refer to the documentation.

License

MIT licensed. See the bundled LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

creolenltk-1.0.3.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

creolenltk-1.0.3-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file creolenltk-1.0.3.tar.gz.

File metadata

  • Download URL: creolenltk-1.0.3.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for creolenltk-1.0.3.tar.gz
Algorithm Hash digest
SHA256 3d33a074c8dd8d7fd29d07c329d04b38ef12d80afe2cfe9c821e4bf45b73aba7
MD5 76b47fa2e3c1803f3a54e5350283f3e2
BLAKE2b-256 ad7b6b85187674b46740a880e7e28f5b8e3894eb5664fd79cace7842f368625a

See more details on using hashes here.

File details

Details for the file creolenltk-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: creolenltk-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for creolenltk-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5e140a8a00a53af9c31249faa430bbda8e10103ec28453312253c7990699eaa3
MD5 1def5931efc52c6bc375da340d60d895
BLAKE2b-256 34c600bae18a12f9414890f02e19b298864cf891f46a87dbc37468b9322af2c5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page