wayward·PyPI

Wayward is a Python package that helps to identify characteristic terms from single documents or groups of documents.

These details have not been verified by PyPI

Project links

Project description

Wayward is a Python package that helps to identify characteristic terms from single documents or groups of documents. It can be used to create word clouds.

Rather than use simple term frequency, it weighs terms by statistical models known as parsimonious language models. These models are good at picking up the terms that distinguish a text document from other documents in a collection.

For this to work, a preferably large amount of documents are needed to serve as a background collection, to compare the documents of interest to. This could be a random sample of newspaper articles, for instance, but for many applications it works better to take a natural collection, such as a periodical publication, and to fit the model for separate parts (e.g. individual issues, or yearly groups of issues).

See the References section for more information about parsimonious language models and their applications.

Wayward does not do visualization of word clouds. For that, you can paste its output into a tool like http://wordle.net or the IBM Word-Cloud Generator.

Installation

Either install the latest release from PyPI:

pip install wayward

or clone the git repository, and use Poetry to install the package in editable mode:

git clone https://github.com/aolieman/wayward.git
cd wayward/
poetry install

Usage

>>> quotes = [
...     "Love all, trust a few, Do wrong to none",
...     ...
...     "A lover's eyes will gaze an eagle blind. "
...     "A lover's ear will hear the lowest sound.",
... ]
>>> doc_tokens = [
...     re.sub(r"[.,:;!?\"‘’]|'s\b", " ", quote).lower().split()
...     for quote in quotes
... ]

The ParsimoniousLM is initialized with all document tokens as a background corpus, and subsequently takes a single document’s tokens as input. Its top method returns the top terms and their probabilities:

>>> from wayward import ParsimoniousLM
>>> plm = ParsimoniousLM(doc_tokens, w=.1)
>>> plm.top(10, doc_tokens[-1])
[('lover', 0.1538461408077277),
 ('will', 0.1538461408077277),
 ('eyes', 0.0769230704038643),
 ('gaze', 0.0769230704038643),
 ('an', 0.0769230704038643),
 ('eagle', 0.0769230704038643),
 ('blind', 0.0769230704038643),
 ('ear', 0.0769230704038643),
 ('hear', 0.0769230704038643),
 ('lowest', 0.0769230704038643)]

The SignificantWordsLM is similarly initialized with a background corpus, but subsequently takes a group of document tokens as input. Its group_top method returns the top terms and their probabilities:

>>> from wayward import SignificantWordsLM
>>> swlm = SignificantWordsLM(doc_tokens, lambdas=(.7, .1, .2))
>>> swlm.group_top(10, doc_tokens[-3:])
[('in', 0.37875318027881),
 ('is', 0.07195732361699828),
 ('mortal', 0.07195732361699828),
 ('nature', 0.07195732361699828),
 ('all', 0.07110584778711342),
 ('we', 0.03597866180849914),
 ('true', 0.03597866180849914),
 ('lovers', 0.03597866180849914),
 ('strange', 0.03597866180849914),
 ('capers', 0.03597866180849914)]

See example/dickens.py for a running example with more realistic data.

Background

This package started out as WeighWords, written by Lars Buitinck at the University of Amsterdam. It provides an efficient parsimonious LM implementation, and a very accessible API.

A recent innovation in language modeling, Significant Words Language Models, led to the addition of a two-way parsimonious language model to this package. This new version targets python 3.x, and after a long slumber deserved a fresh name. The name “Wayward” was chosen because it is a near-homophone of WeighWords, and as a nod to parsimonious language modeling: it uncovers which terms “depart” most from the background collection. The parsimonization algorithm discounts terms that are already well explained by the background model, until the most wayward terms come out on top.

References

D. Hiemstra, S. Robertson, and H. Zaragoza (2004). Parsimonious Language Models for Information Retrieval. Proc. SIGIR’04.

R. Kaptein, D. Hiemstra, and J. Kamps (2010). How different are Language Models and word clouds?. Proc. ECIR’10.

M. Dehghani, H. Azarbonyad, J. Kamps, D. Hiemstra, and M. Marx (2016). Luhn Revisited: Significant Words Language Models. Proc. CKIM’16.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.2

Jun 9, 2019

This version

0.3.1

Jun 5, 2019

0.3.0

Jun 4, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wayward-0.3.1.tar.gz (11.4 kB view details)

Uploaded Jun 5, 2019 Source

Built Distribution

wayward-0.3.1-py3-none-any.whl (12.6 kB view details)

Uploaded Jun 5, 2019 Python 3

File details

Details for the file wayward-0.3.1.tar.gz.

File metadata

Download URL: wayward-0.3.1.tar.gz
Upload date: Jun 5, 2019
Size: 11.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/0.12.16 CPython/3.7.3 Linux/4.15.0-50-generic

File hashes

Hashes for wayward-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`a7e98253443086a7419b707aba0c89fa826e7f9e69d5a51bf3d32615aa86e30e`
MD5	`0bfe3ca8877027cab017066680bb14ab`
BLAKE2b-256	`812907a7d35270d4e49a182f5d952e5d3be20ec21f5c2925d789f17f793be5af`

See more details on using hashes here.

File details

Details for the file wayward-0.3.1-py3-none-any.whl.

File metadata

Download URL: wayward-0.3.1-py3-none-any.whl
Upload date: Jun 5, 2019
Size: 12.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/0.12.16 CPython/3.7.3 Linux/4.15.0-50-generic

File hashes

Hashes for wayward-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ffd7507f40297dcdb5ec0dac802bb8a1451df585ce6f7ffe2c1550ee2dfa0bf4`
MD5	`449589186aa74af6e2b9404b8d0a9867`
BLAKE2b-256	`a6ab2ff2775433fee869e74738c43a45c70352ec01f5f8f39f0e6c7216cb1a3d`

See more details on using hashes here.

wayward 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Usage

Background

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes