Skip to main content

A small package that provides decorators for text preprocessing with nltk

Project description

NLD - Natural Language Decorators

This is a package to carry out common text preprocessing tasks in NLTK using dedicated decorators from a class that can also help keep track of the preprocessing steps taken, time it took preprocessing and build simple pipelines faster, especially for when simple exploratory analysis is being carried out.

Install

To install the package, from the root directory of the repository, run the following:

python3 setup.py install --user

This will also install nltk 3.4.5

Examples

The followings are some examples of the preprocessing steps that can be applied, including mistakes that may go undetected.

from nltk.tokenize import word_tokenize

# Jane Austen book Emma obtained from the Gutenberg Project, which is for some weird reason banned in Italy.
with open("~/Documents/austen.txt") as raw_text:
    raw_text = raw_text.read()

# Instantiate the NLD object, you can set the logger on and store all the timings for each run if you want
nldecorator = nld.NLD(logger=True, store_all_process_times=True)

@nldecorator.timeit
@nldecorator.freq_dist()
@nldecorator.stem
@nldecorator.remove_stopwords(punct=True) # this key argument must be specified
@nldecorator.substitute([("emma", "peppa")])
@nldecorator.lower
def tokenize(_input: str) -> list:
    # This single function with a pipeline of decorators is a single run
    return word_tokenize(_input)

first_result = tokenize(raw_text)

print("1\n", first_result, "\n")
print("PROCESS TIME:", nldecorator.process_time)
print("Run ID: ", nldecorator.id)
print("Decorators used:", nldecorator.chain[nldecorator.id])
1
 [('mr.', 1080), ('peppa', 860), ('could', 834), ("'s", 831), ('would', 815)] 

PROCESS TIME: 2.593135118484497
Run ID:  2d1c614f-e5ba-41d3-aea7-ffab8e5e7a7b
Decorators used: tokenize-lower_wrapper-sub_wrapper-rm_stopwords_wrapper-stem_wrapper-freq_dist_wrapper-
# Wrong Order, stopwords with capitals still present
@nldecorator.timeit
@nldecorator.n_grams(3)
@nldecorator.stem
@nldecorator.lower
@nldecorator.remove_stopwords()
def tokenize(_input):
    return word_tokenize(_input)

third_result = tokenize(raw_text[:])

print("3\n", list(third_result)[:20], "\n")
print("PROCESS TIME:", nldecorator.process_time)
print("Run ID: ", nldecorator.id)
print("Decorators used:", nldecorator.chain[nldecorator.id])
3
 [('\ufeffthe', 'project', 'gutenberg'), ('project', 'gutenberg', 'ebook'), ('gutenberg', 'ebook', 'emma'), ('ebook', 'emma', ','), ('emma', ',', 'jane'), (',', 'jane', 'austen'), ('jane', 'austen', 'this'), ('austen', 'this', 'ebook'), ('this', 'ebook', 'use'), ('ebook', 'use', 'anyon'), ('use', 'anyon', 'anywher'), ('anyon', 'anywher', 'cost'), ('anywher', 'cost', 'almost'), ('cost', 'almost', 'restrict'), ('almost', 'restrict', 'whatsoev'), ('restrict', 'whatsoev', '.'), ('whatsoev', '.', 'you'), ('.', 'you', 'may'), ('you', 'may', 'copi'), ('may', 'copi', ',')] 

PROCESS TIME: 2.7864768505096436
Run ID:  6f42acfa-0407-4ddd-8f91-40309db5f08b
Decorators used: tokenize-rm_stopwords_wrapper-lower_wrapper-stem_wrapper-ngrams_wrapper-

The following is an example using the itarator decorator and then the open_from_path decorator

nldecorator = nld.NLD(logger=True, store_all_process_times=True)

# With the open_from_path decorator you can run through the pipeline all the files from a given directory or a single file

@nldecorator.timeit
@nldecorator.lower
@nldecorator.word_tokenizer
@nldecorator.iterator()
@nldecorator.open_from_path
def return_directory():
    # there are three files of the same book in this directory.
    return "~/Documents/books/"

return_directory()
print(nldecorator.all_process_times)
return_directory()
print(nldecorator.all_process_times)
return_directory()
print(nldecorator.all_process_times)
[('lower_wrapper', 1.363213062286377)]
[('lower_wrapper', 1.363213062286377), ('lower_wrapper', 1.3505115509033203)]
[('lower_wrapper', 1.363213062286377), ('lower_wrapper', 1.3505115509033203), ('lower_wrapper', 1.2218332290649414)]

Build Dataframes

@nldecorator.build_df(column="tags")
@nldecorator.pos_tagger
@nldecorator.iterator()
def preprocess_tags(sents):
    return sents

@nldecorator.build_df("tokens")
@nldecorator.stem
@nldecorator.remove_stopwords()
@nldecorator.word_tokenizer
@nldecorator.lower
@nldecorator.iterator()
def preprocess_tokens_iter(sents):
    return sents


sents = ["This one is my awesome string, written by myself personally.", 
         "This two is my awesome string, written by myself personally 2.",
         "This three is my awesome string, written by myself personally 3."]

for i in range(3):
    preprocess_tokens_iter(sents)
    preprocess_tags(sents)

nldecorator.df
tokens tags
0 [one, awesom, string, ,, written, person, .] [(This, DT), (one, CD), (is, VBZ), (my, PRP$),...
1 [two, awesom, string, ,, written, person, 2, .] [(This, DT), (two, CD), (is, VBZ), (my, PRP$),...
2 [three, awesom, string, ,, written, person, 3, .] [(This, DT), (three, CD), (is, VBZ), (my, PRP$...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nld-0.0.1.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nld-0.0.1-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file nld-0.0.1.tar.gz.

File metadata

  • Download URL: nld-0.0.1.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.5

File hashes

Hashes for nld-0.0.1.tar.gz
Algorithm Hash digest
SHA256 393b0b6331fc653bd072f5ae443e075d0456b42c729f154620bad9d45931a374
MD5 dfcc61d6cc4e11db94030aef72f9ad9c
BLAKE2b-256 5e060f57b10d968ccc0d504476f76ead7ac0cc2fe59ff8c869c6022d3f826efc

See more details on using hashes here.

File details

Details for the file nld-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: nld-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.5

File hashes

Hashes for nld-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 33ec2b48999aab0730feb050e565d4794b111df02fdd99d328190088f20e4aa5
MD5 0b05909590e5b36f702fd56598afb32f
BLAKE2b-256 a61403c785a0390fe98f3ead9eb4340d0e3c0dc13c7003f06f01abfbbd38bc7b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page