Skip to main content

An implementation of the Hybrid TF-IDF microblog summarisation algorithm as proposed by David Ionuye and Jugal K. Kalita.

Project description

Hybrid TF-IDF


This is an implementation of the Hybrid TF-IDF algorithm as proposed by David Ionuye and Jugal K. Kalita (2011).

Hybrid TF-IDF is designed with twitter data in mind, where document lengths are short. It is an approach to generating Multiple Post Summaries of a collection of documents.

Simply install with:

pip install hybridtfidf

Load some short texts of the form:

documents = ['This is one example of a short text.',
            'Designed for twitter posts, a typical 'short document' will have fewer than 280 characters!'
            ]

The algorithm works best on tokenized data with stopwords removed, although this is not required. You can tokenize your documents any way you like. Here is an example using the popular NLTK package:

import nltk
nltk.download('stopwords')

documents = ["This is one example of a short text.",
            "Designed for twitter posts, a typical 'short document' will have fewer than 280 characters!"
            ]

stop_words = set(nltk.corpus.stopwords.words('english'))

tokenized_documents = []

for document in documents:
    tokens = nltk.tokenize.word_tokenize(document)
    tokenized_document = [i for i in tokens if not i in stop_words]
    tokenized_documents.append(tokenized_document)    

# tokenized_documents[0] = ['This','one','example','short','text','.']

The algorithm however requires that each document is one string. If you use nltk's tokenizer, make sure to re-join each document string.

tokenized_documents = [' '.join(document) for document in tokenized_documents]

# tokenized_documents[0] = 'This one example short text .'

Create a HybridTfidf object and fit it on the data

hybridtfidf = HybridTfidf(threshold=7)
hybridtfidf.fit(tokenized_documents)

# The thresold value affects how strongly the algorithm biases towards longer documents
# A higher threshold will make longer documents have a higher post weight
# (see next snippits of code for what post weight does)

Transform the documents into their Hybrid TF-IDF vector representations, and get the saliency values for each document.

post_vectors = hybridtfidf.transform(tokenized_posts)
post_weights = hybridtfidf.transform_to_weights(tokenized_posts)

The post vectors represent the documents as embedded in Hybrid TF-IDF vector space, any linear algebra techniques can be performed on these!

The post weights list gives you a single number for each document, this number reflects how salient each document is (how strongly the document contributes towards a topical discussion). In theory, spammy-documents will have a low post saliency weight.

Lastly, Ionuye and Kalita proposed using Hybrid TF-IDF to summarise the collection of documents. We select 'k' of the most relevant/salient documents, and to avoid redundancy we do not select any documents which are too cosine-similar to previous documents. In effect we select the top 'k' most important documents, skipping over documents that talk about the same topic. I.e - we summarise the collection of documents into 'k' representative documents.

# Get the indices of the most significant documents. 
most_significant = select_salient_posts(post_vectors,post_weights, k = 5, similarity_threshold = 0.5)

for i in most significant:
    print(documents[i])         # Prints the 'k' most significant documents that are each about a separate topic

Note: The indices of: the fit() input (the starting document list), the post_vectors, and the post_weights, are all lined up. Make sure not to re-order one without re-ordering the others similarly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hybridtfidf-1.0.2.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

hybridtfidf-1.0.2-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file hybridtfidf-1.0.2.tar.gz.

File metadata

  • Download URL: hybridtfidf-1.0.2.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.7.3

File hashes

Hashes for hybridtfidf-1.0.2.tar.gz
Algorithm Hash digest
SHA256 1ef16d5f953c020399f023c0ec1c97d343f1a81e93860a542955cbccdf3d890d
MD5 e762d43eccbe4dc290fbdc88c412c4e7
BLAKE2b-256 763af7edb6a47c4f05e705a59cfba7c78b3854764bf0d4b3a36b019ab407a855

See more details on using hashes here.

File details

Details for the file hybridtfidf-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: hybridtfidf-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.7.3

File hashes

Hashes for hybridtfidf-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ec0d9be731ae18660228982b05411869cb43319ae8d1381a03569473907d257c
MD5 a6337ec209ad48ee3286f2609b2d8398
BLAKE2b-256 38b0a436168b4ac39b3eeca8912cd6dbea54b374e60a1bbedea3d6476118606f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page