Skip to main content

An implementation of the Hybrid TF-IDF microblog summarisation algorithm as proposed by David Ionuye and Jugal K. Kalitaß.

Project description

Hybrid TF-IDF

PyPI - License GitHub release (latest by date)


This is an implementation of the Hybrid TF-IDF algorithm as proposed by David Ionuye and Jugal K. Kalita (2011).

Hybrid TF-IDF is designed with twitter data in mind, where document lengths are short. It is an approach to generating Multiple Post Summaries of a collection of documents.

Simply install with:

pip install hybridtfidf

Load some short texts of the form:

documents = ['This is one example of a short text.',
            'Designed for twitter posts, a typical 'short document' will have fewer than 280 characters!'
            ]

The algorithm works best on tokenized data with stopwords removed, although this is not required. You can tokenize your documents any way you like. Here is an example using the popular NLTK package:

import nltk
nltk.download('stopwords')

documents = ["This is one example of a short text.",
            "Designed for twitter posts, a typical 'short document' will have fewer than 280 characters!"
            ]

stop_words = set(nltk.corpus.stopwords.words('english'))

tokenized_documents = []

for document in documents:
    tokens = nltk.tokenize.word_tokenize(document)
    tokenized_document = [i for i in tokens if not i in stop_words]
    tokenized_documents.append(tokenized_document)    

# tokenized_documents[0] = ['This','one','example','short','text','.']

The algorithm however requires that each document is one string. If you use nltk's tokenizer, make sure to re-join each document string.

tokenized_documents = [' '.join(document) for document in tokenized_documents]

# tokenized_documents[0] = 'This one example short text .'

Create a HybridTfidf object and fit it on the data

from hybridtfidf import HybridTfidf

hybridtfidf = HybridTfidf(threshold=7)
hybridtfidf.fit(tokenized_documents)

# The thresold value affects how strongly the algorithm biases towards longer documents
# A higher threshold will make longer documents have a higher post weight
# (see next snippits of code for what post weight does)

Transform the documents into their Hybrid TF-IDF vector representations, and get the saliency values for each document.

document_vectors = hybridtfidf.transform(tokenized_documents)
document_weights = hybridtfidf.transform_to_weights(tokenized_documents)

The document vectors represent the documents as embedded in Hybrid TF-IDF vector space, any linear algebra techniques can be performed on these!

The document weights list gives you a single number for each document, this number reflects how salient each document is (how strongly the document contributes towards a topical discussion). In theory, spammy-documents will have a low post saliency weight.

Lastly, Ionuye and Kalita proposed using Hybrid TF-IDF to summarise the collection of documents. We select 'k' of the most relevant/salient documents, and to avoid redundancy we do not select any documents which are too cosine-similar to previous documents. In effect we select the top 'k' most important documents, skipping over documents that talk about the same topic. I.e - we summarise the collection of documents into 'k' representative documents.

# Get the indices of the most significant documents. 
from hybridtfidf.utils import select_salient_documents

most_significant = select_salient_documents(document_vectors,document_weights, k = 5, similarity_threshold = 0.5)

for i in most significant:
    print(documents[i])         # Prints the 'k' most significant documents that are each about a separate topic

Note: The indices of: the fit() input (the starting document list), the document_vectors, and the document_weights, are all lined up. Make sure not to re-order one without re-ordering the others similarly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hybridtfidf-1.1.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

hybridtfidf-1.1.0-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file hybridtfidf-1.1.0.tar.gz.

File metadata

  • Download URL: hybridtfidf-1.1.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.4.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for hybridtfidf-1.1.0.tar.gz
Algorithm Hash digest
SHA256 9ef6e1961a9357516135dbf4adad2a93473940a29920e47dcfed5ad6d550b3ae
MD5 bca6629ae72729f35af7df123740f287
BLAKE2b-256 94d1e7a7bd710491732017680b0c73b1275fdbe90e232598e1ce9107ce323940

See more details on using hashes here.

File details

Details for the file hybridtfidf-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: hybridtfidf-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.4.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for hybridtfidf-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1ec42988f5db0bbbfbed1003947944331afacc83e2fab7f1cc625db79d35c959
MD5 fe419b6c8a5da06ba4287f5fe4754d68
BLAKE2b-256 5dbae325ec87d3160ee37462fad007893ed9e50d3e79bf61af7173d120a4fbcd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page