A set of python tools for Natural Language Processing

These details have not been verified by PyPI

Project links

Homepage

Project description

[nlptools] Python NLP Tools

A straightforward Natural Language Processing Toolbox

NLP Tools is a set of tools written in python that covers the most common NLP tasks with an easy and clear to understand style of code.

It is being developed together with a Series of Articles about NLP by the main author in Medium. You can find the articles at tfduque.medium.com

Installation

Installing with pip

pip install nlpytools

Usage example

Tokenization

Using the tokenizer:

from nlptools.core.structures import tokenize

tokenize("This is a sentence")

[<SOS>, this, is, a, sentence, <EOS>]

Using sentence/document format:

from nlptools.core.structures import Document
doc = Document("This is a sentence. This is another sentence.")

for sentence in doc:
    print(sentence, sentence.tokens)

This is a sentence. [<SOS>, This, is, a, sentence, ., <EOS>]
This is another sentence. [<SOS>, This, is, another, sentence, ., <EOS>]

Normalization

These are the currently available normalization steps:

pre_tokenization_functions = {'simplify_punctuation': simplify_punctuation,
                                  'normalize_whitespace': normalize_whitespace}
post_tokenization_functions = {'normalize_contractions': normalize_contractions,
                               'spell_correction': spell_correction,
                               'remove_stopwords': remove_stopwords}

Usage:

from nlptools.preprocessing.normalization import Normalizer
normalizer = Normalizer(pre_tokenization_steps=['simplify_punctuation', 'normalize_whitespace'],
                        post_tokenization_steps=['normalize_contractions', 'spell_correction'])
norm.normalize_string("This is a nnormalized sentence!!!!         Yeah,,!!") # one can also use normalize_document

'This is a normalized sentence! Yeah,!'

Stemming:

from nlptools.preprocessing.stemming import PorterStemmer
from nlptools.core.structures import tokenize
stemmer = PorterStemmer()
tokens = tokenize("The words in this sentence will be stemmed.")
stemmed_tokens = [stemmer.stem(token) for token in tokens]

['<sos>', 'the', 'word', 'in', 'thi', 'sent', 'will', 'be', 'stem', '.', '<eos>']

Lemmatizing and Tagging

First: tagging

from nlptools.preprocessing.tagging import MLTagger
tagger = MLTagger()
tag_pairs = tagger.tag("Tag this sentence")
for tag in tag_pairs:
     print(tag, tag.PoS)

<SOS> None
Tag NNP
this DT
sentence NN
<EOS> None

Every token carries its own Part of Speech in the PoS attribute after the tagging.

Then, after tagging, we can do Lemmatization

from nlptools.preprocessing.tagging import MLTagger
tagger = MLTagger(force_ud=True) # Force UD format to use compatible tags
tag_pairs = tagger.tag("The cars are running")
lemmatized_words = [lemmatizer.lemmatize(word, word.PoS) for word in tag_pairs.tokens]
print(" ".join(lemmatized_words[1:-1]))

the car are run

Featurization

from nlptools.preprocessing.featurization import Tfidf
tfidf = Tfidf()
tfidf.fit(["The first sentence", "The second sentence", "The third sentence", "First, second, third."])
tfidf.transform(["The first sentence", "The second sentence", "The third sentence", "First, second, third."]) #or just go with fit_transform

matrix([[0.30543024, 0.        , 0.        , 0.        , 0.        ,
         0.07438118, 0.        , 0.07438118],
        [0.        , 0.30543024, 0.        , 0.        , 0.        ,

For more examples and usage, please refer to the medium series.

Release History

0.1.0
- Pypi release

Contributing

Fork it (https://github.com/yourname/yourproject/fork)
Create your feature branch (git checkout -b feature/fooBar)
Write understandable code!!!
Commit your changes (git commit -am 'Add some fooBar')
Push to the branch (git push origin feature/fooBar)
Create a new Pull Request

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Dec 8, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpytools-0.1.0.tar.gz (4.5 MB view hashes)

Uploaded Dec 8, 2020 Source

Built Distribution

nlpytools-0.1.0-py3-none-any.whl (4.5 MB view hashes)

Uploaded Dec 8, 2020 Python 3

Hashes for nlpytools-0.1.0.tar.gz

Hashes for nlpytools-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f5c953ea365041ac0d8dd93c25a0a85ecf3e11f5c9073f853cf262651e748c10`
MD5	`2120e031441bbf9d91e33c5bfd68d521`
BLAKE2b-256	`f3b9e0c78c6a860aa41d7f28faab767890ac544e9bfd44697dffe91defdbc0c4`

Hashes for nlpytools-0.1.0-py3-none-any.whl

Hashes for nlpytools-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a9d70b027ab6b0c1a5fe620d82d87e5080d8bbcfb92315971cb38ef85a742016`
MD5	`3887b82c60d104c927da4a25807bc6ae`
BLAKE2b-256	`9ce5444c32ed8865565dd08f20694cc7cd712e3c1ee59e483eca68ae54e454b7`