Skip to main content

Preprocessing Library for Natural Language Processing

Project description

PreNLP

PyPI License GitHub stars GitHub forks

Preprocessing Library for Natural Language Processing

Installation

Requirements

  • Python >= 3.6
  • Mecab morphological analyzer for Korean
    • sh scripts/install_mecab.sh

With pip

prenlp can be installed using pip as follows:

pip install prenlp

Usage

Data

Dataset Loading

Popular datasets for NLP tasks are provided in prenlp.

  • Text Classification: IMDB, NSMC

General use cases (for IMDB) are as follows:

>>> imdb_train, imdb_test = prenlp.data.IMDB()
>>> len(imdb_train), len(imdb_test)
25000 25000
>>> imdb_train[0]
("Minor Spoilers<br /><br />Alison Parker (Cristina Raines) is a successful top model, living with the lawyer Michael Lerman (Chris Sarandon) in his apartment. She tried to commit ...", 'pos')

Normalization

Frequently used normalization functions for text pre-processing are provided in prenlp.

url, HTML tag, emoticon, email, phone number, etc.

General use cases (for Moses tokenizer) are as follows:

>>> from prenlp.data import Normalizer
>>> normalizer = Normalizer()

>>> normalizer.normalize('Visit this link for more details: https://github.com/')
Visit this link for more details: [URL]

>>> normalizer.normalize('Use HTML with the desired attributes: <img src="cat.jpg" height="100" />')
Use HTML with the desired attributes: [TAG]

>>> normalizer.normalize('Hello 🤩, I love you 💓 !')
Hello [EMOJI], I love you [EMOJI] !

>>> normalizer.normalize('Contact me at lyeoni.g@gmail.com')
Contact me at [EMAIL]

>>> normalizer.normalize('Call +82 10-1234-5678')
Call [TEL]

Tokenizer

Frequently used tokenizers for text pre-processing are provided in prenlp.

NLTKMosesTokenizer

General use cases (for Moses tokenizer) are as follows:

>>> from prenlp.tokenizer import NLTKMosesTokenizer
>>> tokenizer = NLTKMosesTokenizer()
>>> tokenizer('PreNLP package provides a variety of text preprocessing tools.')
['PreNLP', 'package', 'provides', 'a', 'variety', 'of', 'text', 'preprocessing', 'tools', '.']

Author

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

prenlp-0.0.5-py3-none-any.whl (35.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page