Tokenizing with options to include contractions, lemmatize and stem.

These details have not been verified by PyPI

Project links

Project description

Introduction

This package is an aggregation of several packages I found useful for text pre-processing including gensim and ntlk. I put them together to create a more comprehensive and convenient pipeline.

Installation

pip install tokenizer_xm

Usage

Processing a single text string

from tokenizer_xm import TextPreProcessor
import string

# An example text
example_text = "This is an amazing product! I've been using it for almost a year now and it's clearly better than any other products I've used."

print("Original text:")
print(example_text)
print("---")

print("Simple Preprocessed:")
print("---")
tk = TextPreProcessor(text=example_text, lemma_flag=False, stem_flag=False, stopwords=[])
print(tk.process())
print("---")

print("Pre-processing with regular contractions (e.g. I've -> I have):")
# In this package, I included a dictionary of regular contractions for your convenience
tk = TextPreProcessor(text=example_text, lemma_flag=False, stem_flag=False, \
                      contractions=[], stopwords=[])
print(tk.process())
print("---")

print("Pre-processing with lemmatization:")
tk = TextPreProcessor(text=example_text, lemma_flag=True, stem_flag=False, \
                      stopwords=[])
print(tk.process())
print("---")

print("Pre-processing with lemmatization and stemming:")
# This package uses the SnowballStemmer from ntlk.stem. I will try to make it customizable later
tk = TextPreProcessor(text=example_text, lemma_flag=True, stem_flag=True, \
                       stopwords=[])
print(tk.process())
print("---")

print("Adding stop words")
# This package uses the SnowballStemmer from ntlk.stem. I will try to make it customizable later
tk = TextPreProcessor(text=example_text, lemma_flag=True, stem_flag=True, \
                        stopwords=["this",'be',"an",'it'])
print(tk.process())
print("---")

Original text:
This is an amazing product! I've been using it for almost a year now and it's clearly better than any other products I've used.
---
Simple Preprocessed:
---
['this', 'is', 'an', 'amazing', 'product', 'i', 'have', 'been', 'using', 'it', 'for', 'almost', 'a', 'year', 'now', 'and', 'it', 'has', 'it', 'is', 'clearly', 'better', 'than', 'any', 'other', 'products', 'i', 'have', 'used']
---
Pre-processing with regular contractions (e.g. I've -> I have):
['this', 'is', 'an', 'amazing', 'product', 'i', 'have', 'been', 'using', 'it', 'for', 'almost', 'a', 'year', 'now', 'and', 'it', 'has', 'it', 'is', 'clearly', 'better', 'than', 'any', 'other', 'products', 'i', 'have', 'used']
---
Pre-processing with lemmatization:
['this', 'be', 'an', 'amaze', 'product', 'i', 'have', 'be', 'use', 'it', 'for', 'almost', 'a', 'year', 'now', 'and', 'it', 'have', 'it', 'be', 'clearly', 'better', 'than', 'any', 'other', 'product', 'i', 'have', 'use']
---
Pre-processing with lemmatization and stemming:
['this', 'be', 'an', 'amaz', 'product', 'i', 'have', 'be', 'use', 'it', 'for', 'almost', 'a', 'year', 'now', 'and', 'it', 'have', 'it', 'be', 'clear', 'better', 'than', 'ani', 'other', 'product', 'i', 'have', 'use']
---
Adding stop words
['amaz', 'product', 'i', 'have', 'use', 'for', 'almost', 'a', 'year', 'now', 'and', 'have', 'clear', 'better', 'than', 'ani', 'other', 'product', 'i', 'have', 'use']
---

The order of stop words removal and lemmatization/stemming

The current algorithm performs lemmatization and stem before stop-words removal. Thus,

You need to be carefull when defining a list of stop words. For example, including the term "product" will also remove the term "production" if you set the stem_flag to True or the term "products" if you set lemma_flag to True.
When the lemma_flag is set to True, terms like "is" and "are" will be lemmatized to "be". And if "be" is not in the list of stopwords, it will remain. It is recommended that you process the list of stop-words as well if you decide to perform lemmatization

"""
Example
"""

text = "products, production, is"
stop_words = ['product','is']
tk = TextPreProcessor(text = text, lemma_flag= False, stem_flag = False, \
                       stopwords=stop_words)
# Use the .txt_pre_pros_all method instead when the input is a corpus
print(tk.process())

['products', 'production']

tk = TextPreProcessor(text = text, lemma_flag= True, stem_flag = False, \
                       stopwords=stop_words)
# Use the .txt_pre_pros_all method instead when the input is a corpus
print(tk.process())

['production', 'be']

tk = TextPreProcessor(text = text, lemma_flag= True, stem_flag = True, \
                       stopwords=stop_words)
# Use the .txt_pre_pros_all method instead when the input is a corpus
print(tk.process())

['be']

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.2

Aug 31, 2021

1.0.1

Aug 31, 2021

1.0

Aug 31, 2021

0.5

Dec 9, 2019

0.4

Nov 19, 2019

0.2

Sep 20, 2019

0.1

Sep 20, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizer_xm-1.0.2.tar.gz (4.8 kB view details)

Uploaded Aug 31, 2021 Source

File details

Details for the file tokenizer_xm-1.0.2.tar.gz.

File metadata

Download URL: tokenizer_xm-1.0.2.tar.gz
Upload date: Aug 31, 2021
Size: 4.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for tokenizer_xm-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`ea1bfd7bdd31ecd15b8973eb22e08b34a5d481616e58fd49406278531852f118`
MD5	`7b14d8ad040b8b4fe6a19fdd81b805ce`
BLAKE2b-256	`037300577ab0c92f2db25e10741db9c7912b5f6b6f5716286aa0442706eb1a2f`

See more details on using hashes here.

tokenizer-xm 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Introduction

Installation

Usage

Processing a single text string

The order of stop words removal and lemmatization/stemming

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes