Skip to main content

A library for document similarity based on TF-IDF and Word2Vec.

Project description

TFW2V - A Document Similarity method

Install:

pip install -U tfw2v

How to use:

Given a list of text document in Python List or pandas Series datatype:

# For example

text = [
    "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", 
    "Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.",
    "There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable."
]

We first import and init the model:

from tfw2v import TFW2V

# init tfw2v model instance
model = TFW2V()

We support to train the word2vec model, or you can pass your own word embedding model based on Gensim library.

w2v = model.train_w2v(size=100, epochs=5)

# Word embedding model can be saved to the user defined path:
w2v.save("path/to/model")

We support passing a list of stopwords for text processing. Although, this is optional.

# Example:
stopwords = ["the", "a", "of"]

Now, run the process, the model will train TF-IDF and using pre-trained w2v model to enhance the result:

result = model.run(text, w2v, stopwords, min_tfidf=0.1, lim_token=20, alpha=0.1, lim_most=0.3)

The result is the dictionary with key is the document index, and value is the list of similar doc indexes and similarity score sorted in descending order.
Eg: result[0] = [(5, 0.9), (3, 0.85), (8, 0.81), (10, 0.76),...].
To get the top 10 most similar docs for given ID 7: result[7][:10]

Given a doc index, we can also get the most similar docs included their text:

# Eg: the given doc index is 43, we want most 10 similar docs
# It will return the similar docs included their text
sim_docs = model.most_similar(43, k=10)

# output is in pandas Serires format, which can be easily viewed:
sim_docs.head()
# or save:
sim_docs.to_csv("path/to/csv_file.csv")

To save and load the model

model.save("path/to/tfw2v")
model.load("path/to/tfw2v")

Parameters for model.run() function:

  • w2v: word embedding model in Gensim datatype. Required.
  • stopwords: list of stopwords. Optional. Default None.
  • min_tfidf: min score for accept a token as an important word. Default 0.1.
  • lim_token: limit number of tokens assumed as important words in case no token meet the min_tfidf score requirement. Default 20.
  • alpha: the factor to adjust how much information from word2vec will affect the similarity score from tf-idf. Smaller alpha means to expect less impact. Larger alpha means to expect more surprising result. Default 0.1.
  • lim_most: Given a doc, only re-calculate the ranking for top N percentages of most similar docs. This help the algorithm run faster. It also help to avoid the too surprising result when re-ranking the bottom of the list (least similar docs). Default 1 (all docs). Recommend 0.2 (top 20% docs).

Development

  • To build the package, go to the source folder and run:
    python -m build
  • To upload the package to pypi:
    python -m twine upload --repository pypi dist/*
  • Install the new version:
    pip install --no-deps -U tfw2v

Cite

This works is on behalf of following paper:
Quan Duong, Mika Hämäläinen, and Khalid Alnajjar. (2021). TFW2V: An Enhanced Document Similarity Method for the Morphologically Rich Finnish Language. In the Proceedings of the 1st on Natural Language Processing for Digital Humanities (NLP4DH).

Bibtex:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfw2v-0.3.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

tfw2v-0.3-py2.py3-none-any.whl (11.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file tfw2v-0.3.tar.gz.

File metadata

  • Download URL: tfw2v-0.3.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/2.1.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.5

File hashes

Hashes for tfw2v-0.3.tar.gz
Algorithm Hash digest
SHA256 16800d017ac4c9e3c6a490b7a916f4010c93f32a381b68d0e3811d6e2e1067c5
MD5 054025f20dadbccf71e2526f5c4645b3
BLAKE2b-256 ae7e77a5676d9617666e1def822ea0091704220cca0b7333c2580721364f0067

See more details on using hashes here.

File details

Details for the file tfw2v-0.3-py2.py3-none-any.whl.

File metadata

  • Download URL: tfw2v-0.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/2.1.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.5

File hashes

Hashes for tfw2v-0.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e64470df79242bbea05cd1577545cf4ca0464e7e4f9ce5759b5d48cb7a62cca5
MD5 ec50a6b8d8d5e9561a4cc0134a6d5a39
BLAKE2b-256 eebe9b0e2e8cd038c32cb59d7c72286e772c382749fa1dae75889df378e4e0f3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page