A library that converts words to vectors using PMI and SVD
Project description
SVD2vec
SVD2vec is a python library for representing documents words as vectors. Vectors are created using the PMI (Pointwise Mutual Information) and the SVD (Singular Value Decomposition).
This library implements recommendations from "Improving Distributional Similarity with Lessons Learned from Word Embeddings" (Omer Levy, Yoav Goldberg, and Ido Dagan). This papers suggests that traditional methods like PMI and SVD can be as good as word2vec by appling the same hyperparameters.
Documentation can be found at https://valentinp72.github.io/svd2vec/index.html
Installation
pip install svd2vec
Example
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
# Building
>>> from svd2vec import svd2vec
>>> documents = [open("text8", "r").read().split(" ")]
>>> svd = svd2vec(documents, window=2, min_count=100)
# I/O
>>> svd.save("svd.bin")
>>> svd = svd2vec.load("svd.bin")
# Similarities
>>> svd.similarity("bad", "good")
# 0.4156516999158368
>>> svd.similarity("monday", "friday")
# 0.839529117681973
# Most similar words
>>> svd.most_similar(positive=["january"], topn=2)
# [('february', 0.6854849518368631), ('october', 0.6653385092683669)]
>>> svd.most_similar(positive=['moscow', 'france'], negative=['paris'], topn=4)
# [('russia', 0.6221746629754187), ('ussr', 0.6024809889985986), ('soviet', 0.5794180517326273), ('bolsheviks', 0.5365123080505297)]
# Analogies
>>> svd.analogy("paris", "france", "berlin")
# [('germany', 0.6977716641680641), ...]
>>> svd.analogy("road", "cars", "rail")
# [('trains', 0.7532519174901262), ...]
>>> svd.analogy("cow", "cows", "pig")
# [('pigs', 0.6944101149919422), ...]
>>> svd.analogy("man", "men", "woman")
# [('women', 0.7471792753875327), ...]
Using Gensim you can load a svd2vec model using it's word2vec representation:
>>> from gensim.models.keyedvectors import Word2VecKeyedVectors
>>> svd.save_word2vec_format("svd_word2vec_format.txt")
>>> keyed_vector = Word2VecKeyedVectors.load_word2vec_format("svd_word2vec_format.txt")
>>> keyed_vector.similarity("good", "bad")
# 0.54922897
Improving Distributional Similarity with Lessons Learned from Word Embeddings
Omer Levy, Yoav Goldberg, and Ido Dagan
Transactions of the Association for Computational Linguistics 2015 Vol. 3, 211-225
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file svd2vec-0.3.3.tar.gz.
File metadata
- Download URL: svd2vec-0.3.3.tar.gz
- Upload date:
- Size: 173.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.5.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c3b5fd3f85e470187af009c46037575575104816383a3da9a8a2a377c78ed05
|
|
| MD5 |
563c4841d821328e8b9e1411631cd270
|
|
| BLAKE2b-256 |
c3e5bf502f040c009e92815616147fcf20b71fcddc084cbc3264b2c2b39a159d
|
File details
Details for the file svd2vec-0.3.3-py3-none-any.whl.
File metadata
- Download URL: svd2vec-0.3.3-py3-none-any.whl
- Upload date:
- Size: 174.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.5.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c44a3939eaffdc1502abba4977282276dfd0c7b7712c62b1d10af354efded1fb
|
|
| MD5 |
e8c018ea66b830d6c707933acd31ab6c
|
|
| BLAKE2b-256 |
9cac2bbc1ed2f4cadf56c8507db209a15a62063f18f350c7c0b5c058b70ef8dd
|