Skip to main content

A library for document similarity based on TF-IDF and Word2Vec.

Project description

TFW2V - A Document Similarity method

Install:

pip install tfw2v

How to use:

from tfw2v import TFW2V



# The input is list of text documents in Python List or pandas Series datatype.
# For example
text = [
    "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", 
    "Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.",
    "There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable."
]

# We support passing a list of stopwords for text processing
# Although, this is optional.
stopwords = ["the", "a", "of"]

# init tfw2v model instance
model = TFW2V()

# We support to train the word2vec model
# or pass your own word embedding model based Gensim library

w2v = model.train_w2v(size=100, epochs=5)

# Word embedding model can be saved to the user defined path:
w2v.save("path/to/model")

# now, run the process, the model will train TFIDF and using pre-trained w2v to enhence the result

result = model.run(text, w2v, stopwords, min_tfidf=0.1, lim_token=20, alpha=0.1, lim_most=0.3)

# the result is the dictionary with key is the document index, and value is the list of similar doc indexes and similarity score sorted desc.
# Eg: result[0] = [(5, 0.9), (3, 0.85), (8, 0.81), (10, 0.76),...]

# To get the top 10 most similar docs for given ID 7: result[7][:10]

# Given a doc index, we can also get the most similar docs included their text:
# Eg: the given doc index is 43, we want most 10 similar docs
# It will return the similar docs included their text
sim_docs = model.most_similar(43, k=10)

# output is in pandas Serires format, which can be easily viewed:
sim_docs.head()
# or save:
sim_docs.to_csv("path/to/csv_file.csv")

Parameters:

  • min_tfidf: min score for accept a token as an important word. Default 0.1.
  • lim_token: limit number of tokens assumed as important words in case no token meet the min_tfidf score requirement. Default 20.
  • alpha: the factor to adjust how much information from word2vec will affect the similarity score from tf-idf. Smaller alpha means to expect less impact. Larger alpha means to expect more surprising result. Default 0.1.
  • lim_most: Given a doc, only re-calculate the ranking for top N percentages of most similar docs. This help the algorithm run faster. It also help to avoid the too surprising result when re-ranking the bottom of the list (least similar docs). Default 1 (all docs). Recommend 0.2 (top 20% docs).

** Development

  • To build the package, go to the source folder and run:
    python -m build
  • To upload the package to pypi:
    python -m twine upload --repository pypi dist/*
  • Install the new version:
    pip install --no-deps -U tfw2v

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfw2v-0.2.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

tfw2v-0.2-py2.py3-none-any.whl (11.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file tfw2v-0.2.tar.gz.

File metadata

  • Download URL: tfw2v-0.2.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/2.1.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.5

File hashes

Hashes for tfw2v-0.2.tar.gz
Algorithm Hash digest
SHA256 5c1e77c02771809dfbc3be44397a3a483c6dbb9a898b53b8ac944691cd1fc322
MD5 6a7b49aabbad1d8b4f55d6f41e3d1ead
BLAKE2b-256 f68f6ec61c9052b545b97093d638906f86f4b2b7501e9927aa25ec92606a357b

See more details on using hashes here.

File details

Details for the file tfw2v-0.2-py2.py3-none-any.whl.

File metadata

  • Download URL: tfw2v-0.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/2.1.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.5

File hashes

Hashes for tfw2v-0.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 bbf13dc6fa5e8c5318510455c2412ac1af675207a2e9f69e1a55236a0e22e2be
MD5 1cd371aceb13a772c1b69213a0dc5ddd
BLAKE2b-256 4107b4b7b803761c96c744b7bb6c32a5e5babef8b28da24ec0de6963da0320e0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page