Skip to main content

A set of python tools for Natural Language Processing

Project description

[nlptools] Python NLP Tools

A straightforward Natural Language Processing Toolbox

NLP Tools is a set of tools written in python that covers the most common NLP tasks with an easy and clear to understand style of code.

It is being developed together with a Series of Articles about NLP by the main author in Medium. You can find the articles at tfduque.medium.com

Installation

Installing with pip

pip install nlpytools

Usage example

Tokenization

  • Using the tokenizer:
from nlptools.core.structures import tokenize

tokenize("This is a sentence")
[<SOS>, this, is, a, sentence, <EOS>]
  • Using sentence/document format:
from nlptools.core.structures import Document
doc = Document("This is a sentence. This is another sentence.")

for sentence in doc:
    print(sentence, sentence.tokens)
This is a sentence. [<SOS>, This, is, a, sentence, ., <EOS>]
This is another sentence. [<SOS>, This, is, another, sentence, ., <EOS>]

Normalization

These are the currently available normalization steps:

pre_tokenization_functions = {'simplify_punctuation': simplify_punctuation,
                                  'normalize_whitespace': normalize_whitespace}
post_tokenization_functions = {'normalize_contractions': normalize_contractions,
                               'spell_correction': spell_correction,
                               'remove_stopwords': remove_stopwords}

Usage:

from nlptools.preprocessing.normalization import Normalizer
normalizer = Normalizer(pre_tokenization_steps=['simplify_punctuation', 'normalize_whitespace'],
                        post_tokenization_steps=['normalize_contractions', 'spell_correction'])
norm.normalize_string("This is a nnormalized sentence!!!!         Yeah,,!!") # one can also use normalize_document
'This is a normalized sentence! Yeah,!'

Stemming:

from nlptools.preprocessing.stemming import PorterStemmer
from nlptools.core.structures import tokenize
stemmer = PorterStemmer()
tokens = tokenize("The words in this sentence will be stemmed.")
stemmed_tokens = [stemmer.stem(token) for token in tokens]
['<sos>', 'the', 'word', 'in', 'thi', 'sent', 'will', 'be', 'stem', '.', '<eos>']

Lemmatizing and Tagging

First: tagging

from nlptools.preprocessing.tagging import MLTagger
tagger = MLTagger()
tag_pairs = tagger.tag("Tag this sentence")
for tag in tag_pairs:
     print(tag, tag.PoS)
<SOS> None
Tag NNP
this DT
sentence NN
<EOS> None

Every token carries its own Part of Speech in the PoS attribute after the tagging.

Then, after tagging, we can do Lemmatization

from nlptools.preprocessing.tagging import MLTagger
tagger = MLTagger(force_ud=True) # Force UD format to use compatible tags
tag_pairs = tagger.tag("The cars are running")
lemmatized_words = [lemmatizer.lemmatize(word, word.PoS) for word in tag_pairs.tokens]
print(" ".join(lemmatized_words[1:-1]))
the car are run

Featurization

from nlptools.preprocessing.featurization import Tfidf
tfidf = Tfidf()
tfidf.fit(["The first sentence", "The second sentence", "The third sentence", "First, second, third."])
tfidf.transform(["The first sentence", "The second sentence", "The third sentence", "First, second, third."]) #or just go with fit_transform
matrix([[0.30543024, 0.        , 0.        , 0.        , 0.        ,
         0.07438118, 0.        , 0.07438118],
        [0.        , 0.30543024, 0.        , 0.        , 0.        ,

For more examples and usage, please refer to the medium series.

Release History

  • 0.1.0
    • Pypi release

Meta

Tiago Duque – medium website

Distributed under the MIT license. See LICENSE for more information.

Check me at github

Check me at Linkedin

Contributing

  1. Fork it (https://github.com/yourname/yourproject/fork)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Write understandable code!!!
  4. Commit your changes (git commit -am 'Add some fooBar')
  5. Push to the branch (git push origin feature/fooBar)
  6. Create a new Pull Request

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpytools-0.1.0.tar.gz (4.5 MB view hashes)

Uploaded Source

Built Distribution

nlpytools-0.1.0-py3-none-any.whl (4.5 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page