Interface for using the canonical C GloVe embedding implementation in Python
Project description
glovpy
Package for interfacing Stanford's C GloVe implementation from Python.
Installation
Install glovpy from PyPI:
pip install glovpy
Additionally the first time you import glopy it will build GloVe from scratch on your system.
Requirements
We highly recommend that you use a Unix-based system, preferably a variant of Debian.
The package needs git, make and a C compiler (clang or gcc) installed.
Otherwise the implementation is as barebones as it gets, only the standard library and gensim are being used (gensim only for producing KeyedVectors).
Example Usage
Here's a quick example of how to train GloVe on 20newsgroups using Gensim's tokenizer.
from gensim.utils import tokenize
from sklearn.datasets import fetch_20newsgroups
from glovpy import GloVe
texts = fetch_20newsgroups().data
corpus = [list(tokenize(text, lowercase=True, deacc=True)) for text in texts]
model = GloVe(vector_size=25)
model.train(corpus)
for word, similarity in model.wv.most_similar("god"):
print(f"{word}, sim: {similarity}")
| word | similarity |
|---|---|
| existence | 0.9156746864 |
| jesus | 0.8746870756 |
| lord | 0.8555182219 |
| christ | 0.8517201543 |
| bless | 0.8298447728 |
| faith | 0.8237065077 |
| saying | 0.8204566240 |
| therefore | 0.8177698255 |
| desires | 0.8094088435 |
| telling | 0.8083973527 |
API Reference
class glovpy.GloVe(vector_size, window_size, symmetric, distance_weighting, alpha, min_count, iter, initial_learning_rate, threads, memory)
Wrapper around the original C implementation of GloVe.
Parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
| vector_size | int | Number of dimensions the trained word vectors should have. | 50 |
| window_size | int | Number of context words to the left (and to the right, if symmetric is True). | 15 |
| alpha | float | Parameter in exponent of weighting function; default 0.75 | 0.75 |
| symmetric | bool | If true, both future and past words will be used as context, otherwise only past words will be used. | True |
| distance_weighting | bool | If False, do not weight cooccurrence count by distance between words. If True (default), weight the cooccurrence count by inverse of distance between the target word and the context word. | True |
| min_count | int | Minimum number of times a token has to appear to be kept in the vocabulary. | 5 |
| iter | int | Number of training iterations. | 25 |
| initial_learning_rate | float | Initial learning rate for training. | 0.05 |
| threads | int | Number of threads to use for training. | 8 |
| memory | float | Soft limit for memory consumption, in GB. (based on simple heuristic, so not extremely accurate) | 4.0 |
Attributes
| Name | Type | Description |
|---|---|---|
| wv | KeyedVectors | Token embeddings in the form of Gensim keyed vectors. |
Methods
glovpy.GloVe.train(tokens)
Train the model on a stream of texts.
| Parameter | Type | Description |
|---|---|---|
| tokens | Iterable[list[str]] | Stream of documents in the form of lists of tokens. The stream has to be reusable, as the model needs at least two passes over the corpus. |
glovpy.utils.reusable(gen_func)
Function decorator that turns your generator function into an iterator, thereby making it reusable. You can use this if you want to reuse a generator function so that multiple passes can be made.
Parameters
| Parameter | Type | Description |
|---|---|---|
| gen_func | Callable | Generator function that you want to be reusable. |
Returns
| Returns | Type | Description |
|---|---|---|
| _multigen | Callable | Iterator class wrapping the generator function. |
Example usage
Here's how to stream a very long file line by line in a reusable manner.
from gensim.utils import tokenize
from glovpy.utils import reusable
from glovpy import GloVe
@reusable
def stream_lines():
with open("very_long_text_file.txt") as f:
for line in f:
yield list(tokenize(line))
model = GloVe()
model.train(stream_lines())
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file glovpy-0.2.0.tar.gz.
File metadata
- Download URL: glovpy-0.2.0.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.0 CPython/3.9.13 Linux/5.15.0-88-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a03db0711182eb071efdf0336ba9a482918036c876708e393eed7570495a339
|
|
| MD5 |
213dd9b1ea50b9fbf3599e823492baf1
|
|
| BLAKE2b-256 |
fbb321a976221b035616cc9798805c541f0d4b164d567903fa21fb6dc3b91bb5
|
File details
Details for the file glovpy-0.2.0-py3-none-any.whl.
File metadata
- Download URL: glovpy-0.2.0-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.0 CPython/3.9.13 Linux/5.15.0-88-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2759cb176e7f3170af01ed6de016c7f22c9b07f7b1b95749a26c89a0a50b0cd
|
|
| MD5 |
3d4e4165b036fe7a5555c11295ec0fce
|
|
| BLAKE2b-256 |
bdc86f0863ee19248ac158fe09eccc27e4f75d95aa7d5e67ec7898515e7adf27
|