Statistical NLP

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

snlp

Statistical NLP (SNLP) is a practical package with statisical tools for natural language processing. SNLP is based on statistical and distributional attributes of natural language and hence most of its functionalities are unsupervised.

Features

Text cleaning
Text analysis
Extraction of Fixed (Idiosyncratic) Expressions
Identification of statistically redundant words for filtering

Upcoming Features

Anamoly Detection
Identifying non-compositional compouds such as red tape and brain drain in the corpus

Usage

Install the package:

pip3 install snlp

See the description of different functionalities with worked examples below.

Text Cleaning

snlp implements an easy to use and powerful function for cleaning up the text (clean_text). Using, clean_text, you can choose what pattern to accept via regex_pattern argument, what pattern to drop via drop argument, and what pattern to replace via replace argument. You can also specify the maximum length of tokens. Let's use Stanford's IMDB Sentiment Dataset as an example. A sample of this data can be found in resources/data/imdb_train_sample.tsv.

from snlp.preprocessing import clean_text

imdb_train = pd.read_csv('resources/data/imdb_train_sample.tsv', sep='\t', names=['label', 'text'])

# Let's only keep alphanumeric tokens as well as important punctuation marks:
regex_pattern='^[a-zA-Z0-9!.,?\';:$/_-]+$'

# In this corpus, one can frequently see HTML tags such as `< br / >`. So let's drop them:
drop={'< br / >'}

# By skimming throw the text one can frequently see many patterns such as !!! or ???. Let's replace them:
replace={'!!!':'!', '\?\?\?':'?'}

# Finally, let's set the maximum length of a token to 15:
maxlen=15

imdb_train.text = imdb_train.text.apply(clean_text, args=(regex_pattern, drop, replace, maxlen,))

clean_text returns a tokenized text.

Text Analysis

snlp provides an easy to use function (text_analysis.generate_report) for analyzing text with an extensive analysis report. text_analysis.generate_report receives as input a dataframe that contains a text column, and an optional number of label columns. text_analysis.generate_report can generate plots for upto 4 numerical or categorical labels. See the example below for more details.

from snlp.text_analysis import generate_report

generate_report(df=imdb_train,
                out_dir='output_dir',
                text_col='text',
                label_cols=[('label', 'categorical')])

The above script creates an analysis report that includes distribution plots and word clouds for different POS tags, for text, and bar plots and histograms for labels. You can specify upto 4 labels of type categorical or numerical. See the example below for including another label of numerical type. The report is automatically rendered in the browser via plotly default port assignment. But you also have the option of saving the report in an HTML format by setting the save_report argument to True.

import numpy as np
import random

# In addition to the original label, for illustration purpose, let's create two random labels:
imdb_train['numerical_label'] = np.random.randint(1, 500, imdb_train.shape[0])
imdb_train['new_label'] = random.choices(['a', 'b', 'c', 'd'], [0.2, 0.5, 0.8, 0.9], k=imdb_train.shape[0])

generate_report(df=imdb_train,
                out_dir='output_dir',
                text_col='text',
                label_cols=[('label', 'categorical'), ('new_label', 'categorical'), ('numerical_label', 'numerical')])

The above yields a report in HTML, with interactive plotly plots as can be seen in example screenshots below.

annotation1

toolbar zoom

Extraction of Fixed (Idiosyncratic) Expressions

Identifying fixed expressions has application in a wide range of NLP taska ranging from sentiment analysis to topic models and keyphrase extraction. Fixed expressions are those multiword units whose components cannot be replaced with their near synonyms. E.g. swimming pool that cannot be replaced with swim pool or swimmers pool.

You can use snlp to identify fixed noun-noun and adjective-nount expressions in your text leveraging statistical measures such as PMI and NPMI. To do so, first import required libraries: Run get_counts to extract compounds and their corresponding frequencies and then run get_ams to calculate their corresponding PMI and rank them based on their PMI value:

from snlp.mwes import get_counts, get_ams

get_counts(imdb_train, text_column='text', output_dir='tmp/')
get_ams(path_to_counts='tmp/')

Running the above yields two sets of ranked noun-noun and adjective-noun expressions that can be found in output_dir respectively under nn_pmi.json and jn_pmi.json. Some examples from the top of ranked fixed expressions can be seen below:

nn_pmi.json
-----------
jet li
clint eastwood
monty python
kung fu
blade runner


jn_pmi.json
-----------
spinal tap
martial arts
citizen kane
facial expressions
global warming

The main idea behind the extraction of fixed Expressions is to treat them as a single token. Research shows that when fixed expressions are treated as a single token rather than the sum of their components, they can improve the performance of downstream applications such as classification and NER. Using snlp.mwe.replace_compunds function, you can replace the extracted expressions in the corpus with their hyphenated version (global warming --> global-warming) so that they are considered a single token by downstream appilcations.

Identification of Statistically Redundant Words

Words can be represented with various statistics. For instance, they can be represented by term frequency (tf) or inverse document ferquency (idf). Terms with anomalous (very high or very low) statistics usually carry no value for document classification. This package provides a functionality (snlp.preprocessing.WordFilter) to identify such terms in a completely automatic fashion. The logic is to first gaussanize the distribution of specified statistic (tf or ifd), then identify words with anomalous values on the gaussanized distribution by looking at their z-score. This way, one does not have to manually provide upper and lower thresholds.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.3.post4

Sep 14, 2021

0.1.3.post3

Sep 14, 2021

0.1.3.post2

Sep 14, 2021

0.1.3.post1

Sep 13, 2021

0.1.3

Sep 13, 2021

0.1.2

Sep 13, 2021

0.1.1

Jul 29, 2021

0.1.0

Jul 3, 2021

This version

0.0.9

Jul 1, 2020

0.0.8

Jun 26, 2020

0.0.7

Jun 26, 2020

0.0.6

Jun 26, 2020

0.0.5

Jun 26, 2020

0.0.4

May 15, 2020

0.0.3

May 15, 2020

0.0.2

May 14, 2020

0.0.1

May 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snlp-0.0.9.tar.gz (20.9 kB view details)

Uploaded Jul 1, 2020 Source

Built Distribution

snlp-0.0.9-py3-none-any.whl (26.3 kB view details)

Uploaded Jul 1, 2020 Python 3

File details

Details for the file snlp-0.0.9.tar.gz.

File metadata

Download URL: snlp-0.0.9.tar.gz
Upload date: Jul 1, 2020
Size: 20.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for snlp-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`2719dc1fc8f573213f31ffb25e6d759c85c73e3adb8d1aa9a59181ecbae3d292`
MD5	`92931c9e8fa79176584e0afa30f149ef`
BLAKE2b-256	`b7ac6b00924a91faef9a7e5cf5bed96f3b4cad317467f60ab214cc92754926f5`

See more details on using hashes here.

File details

Details for the file snlp-0.0.9-py3-none-any.whl.

File metadata

Download URL: snlp-0.0.9-py3-none-any.whl
Upload date: Jul 1, 2020
Size: 26.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for snlp-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`539f5d22240d7a9ca346894f03382bc7f3f6d668b3e5bb732735af0e54882d5f`
MD5	`bdfa5088d5441afa91f95719ffc84703`
BLAKE2b-256	`960feeea50d185c6a588c191618893bf3c659b6a8fdf08b52f1d032e58ab23b6`

See more details on using hashes here.

snlp 0.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

snlp

Features

Upcoming Features

Usage

Text Cleaning

Text Analysis

Extraction of Fixed (Idiosyncratic) Expressions

Identification of Statistically Redundant Words

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes