Skip to main content

A library that converts words to vectors using PMI and SVD

Project description

SVD2vec

SVD2vec is a python library for representing documents words as vectors. Vectors are created using the PMI (Pointwise Mutual Information) and the SVD (Singular Value Decomposition).

This library implements recommendations from "Improving Distributional Similarity with Lessons Learned from Word Embeddings" (Omer Levy, Yoav Goldberg, and Ido Dagan). This papers suggests that traditional methods like PMI and SVD can be as good as word2vec by appling the same hyperparameters.

Documentation can be found at https://valentinp72.github.io/svd2vec/index.html

Installation

pip install svd2vec

Example

wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
# Building
>>> from svd2vec import svd2vec
>>> documents = [open("text8", "r").read().split(" ")]
>>> svd = svd2vec(documents, window=2, min_count=100)
# I/O
>>> svd.save("svd.bin")
>>> svd = svd2vec.load("svd.bin")
# Similarities
>>> svd.similarity("bad", "good")
# 0.4156516999158368
>>> svd.similarity("monday", "friday")
# 0.839529117681973
# Most similar words
>>> svd.most_similar(positive=["january"], topn=2)
# [('february', 0.6854849518368631), ('october', 0.6653385092683669)]
>>> svd.most_similar(positive=['moscow', 'france'], negative=['paris'], topn=4)
# [('russia', 0.6221746629754187), ('ussr', 0.6024809889985986), ('soviet', 0.5794180517326273), ('bolsheviks', 0.5365123080505297)]
# Analogies
>>> svd.analogy("paris", "france", "berlin")
# [('germany', 0.6977716641680641), ...]
>>> svd.analogy("road", "cars", "rail")
# [('trains', 0.7532519174901262), ...]
>>> svd.analogy("cow", "cows", "pig")
# [('pigs', 0.6944101149919422), ...]
>>> svd.analogy("man", "men", "woman")
# [('women', 0.7471792753875327), ...]

Using Gensim you can load a svd2vec model using it's word2vec representation:

>>> from gensim.models.keyedvectors import Word2VecKeyedVectors
>>> svd.save_word2vec_format("svd_word2vec_format.txt")
>>> keyed_vector = Word2VecKeyedVectors.load_word2vec_format("svd_word2vec_format.txt")
>>> keyed_vector.similarity("good", "bad")
# 0.54922897

Improving Distributional Similarity with Lessons Learned from Word Embeddings
Omer Levy, Yoav Goldberg, and Ido Dagan
Transactions of the Association for Computational Linguistics 2015 Vol. 3, 211-225

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

svd2vec-0.3.3.tar.gz (173.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

svd2vec-0.3.3-py3-none-any.whl (174.5 kB view details)

Uploaded Python 3

File details

Details for the file svd2vec-0.3.3.tar.gz.

File metadata

  • Download URL: svd2vec-0.3.3.tar.gz
  • Upload date:
  • Size: 173.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.5.2

File hashes

Hashes for svd2vec-0.3.3.tar.gz
Algorithm Hash digest
SHA256 9c3b5fd3f85e470187af009c46037575575104816383a3da9a8a2a377c78ed05
MD5 563c4841d821328e8b9e1411631cd270
BLAKE2b-256 c3e5bf502f040c009e92815616147fcf20b71fcddc084cbc3264b2c2b39a159d

See more details on using hashes here.

File details

Details for the file svd2vec-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: svd2vec-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 174.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.5.2

File hashes

Hashes for svd2vec-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c44a3939eaffdc1502abba4977282276dfd0c7b7712c62b1d10af354efded1fb
MD5 e8c018ea66b830d6c707933acd31ab6c
BLAKE2b-256 9cac2bbc1ed2f4cadf56c8507db209a15a62063f18f350c7c0b5c058b70ef8dd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page