An implementation of the Hybrid TF-IDF microblog summarisation algorithm as proposed by David Ionuye and Jugal K. Kalita.

These details have not been verified by PyPI

Project links

Homepage

Project description

Hybrid TF-IDF

This is an implementation of the Hybrid TF-IDF algorithm as proposed by David Ionuye and Jugal K. Kalita (2011).

Hybrid TF-IDF is designed with twitter data in mind, where document lengths are short. It is an approach to generating Multiple Post Summaries of a collection of documents.

Simply install with:

pip install hybridtfidf

Load some short texts of the form:

documents = ['This is one example of a short text.',
            'Designed for twitter posts, a typical 'short document' will have fewer than 280 characters!'
            ]

The algorithm works best on tokenized data with stopwords removed, although this is not required. You can tokenize your documents any way you like. Here is an example using the popular NLTK package:

import nltk
nltk.download('stopwords')

documents = ["This is one example of a short text.",
            "Designed for twitter posts, a typical 'short document' will have fewer than 280 characters!"
            ]

stop_words = set(nltk.corpus.stopwords.words('english'))

tokenized_documents = []

for document in documents:
    tokens = nltk.tokenize.word_tokenize(document)
    tokenized_document = [i for i in tokens if not i in stop_words]
    tokenized_documents.append(tokenized_document)    

# tokenized_documents[0] = ['This','one','example','short','text','.']

The algorithm however requires that each document is one string. If you use nltk's tokenizer, make sure to re-join each document string.

tokenized_documents = [' '.join(document) for document in tokenized_documents]

# tokenized_documents[0] = 'This one example short text .'

Create a HybridTfidf object and fit it on the data

hybridtfidf = HybridTfidf(threshold=7)
hybridtfidf.fit(tokenized_documents)

# The thresold value affects how strongly the algorithm biases towards longer documents
# A higher threshold will make longer documents have a higher post weight
# (see next snippits of code for what post weight does)

Transform the documents into their Hybrid TF-IDF vector representations, and get the saliency values for each document.

post_vectors = hybridtfidf.transform(tokenized_posts)
post_weights = hybridtfidf.transform_to_weights(tokenized_posts)

The post vectors represent the documents as embedded in Hybrid TF-IDF vector space, any linear algebra techniques can be performed on these!

The post weights list gives you a single number for each document, this number reflects how salient each document is (how strongly the document contributes towards a topical discussion). In theory, spammy-documents will have a low post saliency weight.

Lastly, Ionuye and Kalita proposed using Hybrid TF-IDF to summarise the collection of documents. We select 'k' of the most relevant/salient documents, and to avoid redundancy we do not select any documents which are too cosine-similar to previous documents. In effect we select the top 'k' most important documents, skipping over documents that talk about the same topic. I.e - we summarise the collection of documents into 'k' representative documents.

# Get the indices of the most significant documents. 
most_significant = select_salient_posts(post_vectors,post_weights, k = 5, similarity_threshold = 0.5)

for i in most significant:
    print(documents[i])         # Prints the 'k' most significant documents that are each about a separate topic

Note: The indices of: the fit() input (the starting document list), the post_vectors, and the post_weights, are all lined up. Make sure not to re-order one without re-ordering the others similarly.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.1.0

Jun 3, 2021

1.0.11

Jun 3, 2021

1.0.10

Jun 3, 2021

1.0.9

Jun 3, 2021

1.0.8

Jun 3, 2021

1.0.7

Jun 3, 2021

1.0.6

Jun 20, 2020

1.0.5

Jun 7, 2020

1.0.4.1

Jun 7, 2020

1.0.4

Jun 7, 2020

1.0.3

Jun 7, 2020

1.0.2

Jun 7, 2020

This version

1.0.1

Jun 7, 2020

1.0

Jun 5, 2020

0.6.3

Jun 4, 2020

0.6.2

Jun 4, 2020

0.6.1

Jun 4, 2020

0.6

Jun 4, 2020

0.5

Jun 4, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hybridtfidf-1.0.1.tar.gz (8.2 kB view details)

Uploaded Jun 7, 2020 Source

File details

Details for the file hybridtfidf-1.0.1.tar.gz.

File metadata

Download URL: hybridtfidf-1.0.1.tar.gz
Upload date: Jun 7, 2020
Size: 8.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.7.3

File hashes

Hashes for hybridtfidf-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`eb07a2608ac386db6ebf1cdc23809603aa477da961fcb39646fe1749ce5b37b1`
MD5	`e1de5384d7b053659a6e1e08c881b033`
BLAKE2b-256	`8d24ab46f85d30f20e4a3a08f5abb9a2353e46ca681d901f21529c285adb9f29`