Multithreading TF-IDF vectorization for similarity search using sparse matrices for computations.
Project description
Threaded-Sparse-TFIDF
Creating a repository for multithreading TF-IDF vectorization for similarity search using sparse matrices for computations.
Usage:
from TF_IDF import TF_IDF_Vectorizer
tf_idf = TF_IDF_Vectorizer(use_cached=True, print_output=False)
_, ranking = tf_idf.get_similarity_score("science fiction super hero movie", num_workers=k)
Performance:
Image:
Table:
num_workers | time | partition_size |
---|---|---|
1.0 | 1.1117637634277344 | 6.778499999999999 |
2.0 | 0.8195240020751953 | 3.4149000000000003 |
3.0 | 0.7357232332229614 | 2.2773 |
4.0 | 0.7232689380645752 | 1.7081 |
5.0 | 0.7375946760177612 | 1.3555999999999997 |
6.0 | 0.7682486534118652 | 1.1307000000000003 |
7.0 | 0.7640876531600952 | 0.9618 |
8.0 | 0.7513441801071167 | 0.8506 |
9.0 | 0.7795052766799927 | 0.7587 |
10.0 | 0.8141436100006103 | 0.6807 |
11.0 | 0.8003325223922729 | 0.6195000000000002 |
12.0 | 0.8441393852233887 | 0.5697 |
13.0 | 0.8490614175796509 | 0.5258000000000002 |
14.0 | 0.9322290658950806 | 0.48739999999999994 |
15.0 | 0.8824400186538697 | 0.45729999999999993 |
Data
A subset of the Information Retrieval Dataset - Internet Movie Database (IMDB) specifically movies after the year 2007.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for Threaded_Sparse_TFIDF-0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | cf0491f15cb60f8460092e62dbcac699583d9bae154cbf6d74ff9ac6d46367f0 |
|
MD5 | 5f6e987edf34301ddc92c8fee52eb9b1 |
|
BLAKE2b-256 | 2a1e5dbf77455132525d214cb71ec44550cd3546188e3a163c06c47fea4ea21d |
Close
Hashes for Threaded_Sparse_TFIDF-0.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b509f44fa51a97eae25b1974e46c488ec8a3b02bb71d8799d452a63c606e0478 |
|
MD5 | 9db27031a4122526ca8fab0b841942cd |
|
BLAKE2b-256 | b23e7e80051362febe646470602249448278add08129c0876c8dc36718dcf56c |