Skip to main content

Multithreading TF-IDF vectorization for similarity search using sparse matrices for computations.

Project description

Threaded-Sparse-TFIDF

Creating a repository for multithreading TF-IDF vectorization for similarity search using sparse matrices for computations.

Usage:

from TF_IDF import TF_IDF_Vectorizer

tf_idf = TF_IDF_Vectorizer(use_cached=True, print_output=False)
_, ranking = tf_idf.get_similarity_score("science fiction super hero movie", num_workers=k)

Performance:

Image:

image

Table:

num_workers time partition_size
1.0 1.1117637634277344 6.778499999999999
2.0 0.8195240020751953 3.4149000000000003
3.0 0.7357232332229614 2.2773
4.0 0.7232689380645752 1.7081
5.0 0.7375946760177612 1.3555999999999997
6.0 0.7682486534118652 1.1307000000000003
7.0 0.7640876531600952 0.9618
8.0 0.7513441801071167 0.8506
9.0 0.7795052766799927 0.7587
10.0 0.8141436100006103 0.6807
11.0 0.8003325223922729 0.6195000000000002
12.0 0.8441393852233887 0.5697
13.0 0.8490614175796509 0.5258000000000002
14.0 0.9322290658950806 0.48739999999999994
15.0 0.8824400186538697 0.45729999999999993

Data

A subset of the Information Retrieval Dataset - Internet Movie Database (IMDB) specifically movies after the year 2007.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Threaded_Sparse_TFIDF-0.2.tar.gz (4.9 kB view hashes)

Uploaded Source

Built Distribution

Threaded_Sparse_TFIDF-0.2-py2.py3-none-any.whl (7.4 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page