A small package that provides decorators for text preprocessing with nltk

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

NLD - Natural Language Decorators

This is a package to carry out common text preprocessing tasks in NLTK using dedicated decorators from a class that can also help keep track of the preprocessing steps taken, time it took preprocessing and build simple pipelines faster, especially for when simple exploratory analysis is being carried out.

Install

To install the package, from the root directory of the repository, run the following:

python3 setup.py install --user

This will also install nltk 3.4.5

Examples

The followings are some examples of the preprocessing steps that can be applied, including mistakes that may go undetected.

from nltk.tokenize import word_tokenize

# Jane Austen book Emma obtained from the Gutenberg Project, which is for some weird reason banned in Italy.
with open("~/Documents/austen.txt") as raw_text:
    raw_text = raw_text.read()

# Instantiate the NLD object, you can set the logger on and store all the timings for each run if you want
nldecorator = nld.NLD(logger=True, store_all_process_times=True)

@nldecorator.timeit
@nldecorator.freq_dist()
@nldecorator.stem
@nldecorator.remove_stopwords(punct=True) # this key argument must be specified
@nldecorator.substitute([("emma", "peppa")])
@nldecorator.lower
def tokenize(_input: str) -> list:
    # This single function with a pipeline of decorators is a single run
    return word_tokenize(_input)

first_result = tokenize(raw_text)

print("1\n", first_result, "\n")
print("PROCESS TIME:", nldecorator.process_time)
print("Run ID: ", nldecorator.id)
print("Decorators used:", nldecorator.chain[nldecorator.id])

1
 [('mr.', 1080), ('peppa', 860), ('could', 834), ("'s", 831), ('would', 815)] 

PROCESS TIME: 2.593135118484497
Run ID:  2d1c614f-e5ba-41d3-aea7-ffab8e5e7a7b
Decorators used: tokenize-lower_wrapper-sub_wrapper-rm_stopwords_wrapper-stem_wrapper-freq_dist_wrapper-

# Wrong Order, stopwords with capitals still present
@nldecorator.timeit
@nldecorator.n_grams(3)
@nldecorator.stem
@nldecorator.lower
@nldecorator.remove_stopwords()
def tokenize(_input):
    return word_tokenize(_input)

third_result = tokenize(raw_text[:])

print("3\n", list(third_result)[:20], "\n")
print("PROCESS TIME:", nldecorator.process_time)
print("Run ID: ", nldecorator.id)
print("Decorators used:", nldecorator.chain[nldecorator.id])

3
 [('\ufeffthe', 'project', 'gutenberg'), ('project', 'gutenberg', 'ebook'), ('gutenberg', 'ebook', 'emma'), ('ebook', 'emma', ','), ('emma', ',', 'jane'), (',', 'jane', 'austen'), ('jane', 'austen', 'this'), ('austen', 'this', 'ebook'), ('this', 'ebook', 'use'), ('ebook', 'use', 'anyon'), ('use', 'anyon', 'anywher'), ('anyon', 'anywher', 'cost'), ('anywher', 'cost', 'almost'), ('cost', 'almost', 'restrict'), ('almost', 'restrict', 'whatsoev'), ('restrict', 'whatsoev', '.'), ('whatsoev', '.', 'you'), ('.', 'you', 'may'), ('you', 'may', 'copi'), ('may', 'copi', ',')] 

PROCESS TIME: 2.7864768505096436
Run ID:  6f42acfa-0407-4ddd-8f91-40309db5f08b
Decorators used: tokenize-rm_stopwords_wrapper-lower_wrapper-stem_wrapper-ngrams_wrapper-

The following is an example using the itarator decorator and then the open_from_path decorator

nldecorator = nld.NLD(logger=True, store_all_process_times=True)

# With the open_from_path decorator you can run through the pipeline all the files from a given directory or a single file

@nldecorator.timeit
@nldecorator.lower
@nldecorator.word_tokenizer
@nldecorator.iterator()
@nldecorator.open_from_path
def return_directory():
    # there are three files of the same book in this directory.
    return "~/Documents/books/"

return_directory()
print(nldecorator.all_process_times)
return_directory()
print(nldecorator.all_process_times)
return_directory()
print(nldecorator.all_process_times)

[('lower_wrapper', 1.363213062286377)]
[('lower_wrapper', 1.363213062286377), ('lower_wrapper', 1.3505115509033203)]
[('lower_wrapper', 1.363213062286377), ('lower_wrapper', 1.3505115509033203), ('lower_wrapper', 1.2218332290649414)]

Build Dataframes

@nldecorator.build_df(column="tags")
@nldecorator.pos_tagger
@nldecorator.iterator()
def preprocess_tags(sents):
    return sents

@nldecorator.build_df("tokens")
@nldecorator.stem
@nldecorator.remove_stopwords()
@nldecorator.word_tokenizer
@nldecorator.lower
@nldecorator.iterator()
def preprocess_tokens_iter(sents):
    return sents


sents = ["This one is my awesome string, written by myself personally.", 
         "This two is my awesome string, written by myself personally 2.",
         "This three is my awesome string, written by myself personally 3."]

for i in range(3):
    preprocess_tokens_iter(sents)
    preprocess_tags(sents)

nldecorator.df

	tokens	tags
0	[one, awesom, string, ,, written, person, .]	[(This, DT), (one, CD), (is, VBZ), (my, PRP$),...
1	[two, awesom, string, ,, written, person, 2, .]	[(This, DT), (two, CD), (is, VBZ), (my, PRP$),...
2	[three, awesom, string, ,, written, person, 3, .]	[(This, DT), (three, CD), (is, VBZ), (my, PRP$...

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.1

Dec 6, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nld-0.0.1.tar.gz (9.6 kB view details)

Uploaded Dec 6, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nld-0.0.1-py3-none-any.whl (8.5 kB view details)

Uploaded Dec 6, 2020 Python 3

File details

Details for the file nld-0.0.1.tar.gz.

File metadata

Download URL: nld-0.0.1.tar.gz
Upload date: Dec 6, 2020
Size: 9.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.5

File hashes

Hashes for nld-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`393b0b6331fc653bd072f5ae443e075d0456b42c729f154620bad9d45931a374`
MD5	`dfcc61d6cc4e11db94030aef72f9ad9c`
BLAKE2b-256	`5e060f57b10d968ccc0d504476f76ead7ac0cc2fe59ff8c869c6022d3f826efc`

See more details on using hashes here.

File details

Details for the file nld-0.0.1-py3-none-any.whl.

File metadata

Download URL: nld-0.0.1-py3-none-any.whl
Upload date: Dec 6, 2020
Size: 8.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.5

File hashes

Hashes for nld-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`33ec2b48999aab0730feb050e565d4794b111df02fdd99d328190088f20e4aa5`
MD5	`0b05909590e5b36f702fd56598afb32f`
BLAKE2b-256	`a61403c785a0390fe98f3ead9eb4340d0e3c0dc13c7003f06f01abfbbd38bc7b`

See more details on using hashes here.

nld 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NLD - Natural Language Decorators

Install

Examples

Build Dataframes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes