Skip to main content

Interface for using the canonical C GloVe embedding implementation in Python

Project description

glovpy

Package for interfacing Stanford's C GloVe implementation from Python.

Installation

Install glovpy from PyPI:

pip install glovpy

Additionally the first time you import glopy it will build GloVe from scratch on your system.

Requirements

We highly recommend that you use a Unix-based system, preferably a variant of Debian. The package needs git, make and a C compiler (clang or gcc) installed.

Otherwise the implementation is as barebones as it gets, only the standard library and gensim are being used (gensim only for producing KeyedVectors).

Example Usage

Here's a quick example of how to train GloVe on 20newsgroups using Gensim's tokenizer.

from gensim.utils import tokenize
from sklearn.datasets import fetch_20newsgroups

from glovpy import GloVe

texts = fetch_20newsgroups().data
corpus = [list(tokenize(text, lowercase=True, deacc=True)) for text in texts]

model = GloVe(vector_size=25)
model.train(corpus)

for word, similarity in model.wv.most_similar("god"):
    print(f"{word}, sim: {similarity}")
word similarity
existence 0.9156746864
jesus 0.8746870756
lord 0.8555182219
christ 0.8517201543
bless 0.8298447728
faith 0.8237065077
saying 0.8204566240
therefore 0.8177698255
desires 0.8094088435
telling 0.8083973527

API Reference

class glovpy.GloVe(vector_size, window_size, symmetric, distance_weighting, alpha, min_count, iter, initial_learning_rate, threads, memory)

Wrapper around the original C implementation of GloVe.

Parameters

Parameter Type Description Default
vector_size int Number of dimensions the trained word vectors should have. 50
window_size int Number of context words to the left (and to the right, if symmetric is True). 15
alpha float Parameter in exponent of weighting function; default 0.75 0.75
symmetric bool If true, both future and past words will be used as context, otherwise only past words will be used. True
distance_weighting bool If False, do not weight cooccurrence count by distance between words. If True (default), weight the cooccurrence count by inverse of distance between the target word and the context word. True
min_count int Minimum number of times a token has to appear to be kept in the vocabulary. 5
iter int Number of training iterations. 25
initial_learning_rate float Initial learning rate for training. 0.05
threads int Number of threads to use for training. 8
memory float Soft limit for memory consumption, in GB. (based on simple heuristic, so not extremely accurate) 4.0

Attributes

Name Type Description
wv KeyedVectors Token embeddings in the form of Gensim keyed vectors.

Methods

glovpy.GloVe.train(tokens)

Train the model on a stream of texts.

Parameter Type Description
tokens Iterable[list[str]] Stream of documents in the form of lists of tokens. The stream has to be reusable, as the model needs at least two passes over the corpus.

glovpy.utils.reusable(gen_func)

Function decorator that turns your generator function into an iterator, thereby making it reusable. You can use this if you want to reuse a generator function so that multiple passes can be made.

Parameters

Parameter Type Description
gen_func Callable Generator function that you want to be reusable.

Returns

Returns Type Description
_multigen Callable Iterator class wrapping the generator function.

Example usage

Here's how to stream a very long file line by line in a reusable manner.

from gensim.utils import tokenize
from glovpy.utils import reusable
from glovpy import GloVe

@reusable
def stream_lines():
    with open("very_long_text_file.txt") as f:
        for line in f:
            yield list(tokenize(line))

model = GloVe()
model.train(stream_lines())

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glovpy-0.2.0.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glovpy-0.2.0-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file glovpy-0.2.0.tar.gz.

File metadata

  • Download URL: glovpy-0.2.0.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.9.13 Linux/5.15.0-88-generic

File hashes

Hashes for glovpy-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5a03db0711182eb071efdf0336ba9a482918036c876708e393eed7570495a339
MD5 213dd9b1ea50b9fbf3599e823492baf1
BLAKE2b-256 fbb321a976221b035616cc9798805c541f0d4b164d567903fa21fb6dc3b91bb5

See more details on using hashes here.

File details

Details for the file glovpy-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: glovpy-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.9.13 Linux/5.15.0-88-generic

File hashes

Hashes for glovpy-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b2759cb176e7f3170af01ed6de016c7f22c9b07f7b1b95749a26c89a0a50b0cd
MD5 3d4e4165b036fe7a5555c11295ec0fce
BLAKE2b-256 bdc86f0863ee19248ac158fe09eccc27e4f75d95aa7d5e67ec7898515e7adf27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page