Skip to main content

glove-python fork for bicleaner-ai

Project description

bicleaner-ai-glove

NOTE: this a fork from glove-python made for bicleaner-ai.

Circle CI

A toy python implementation of GloVe.

Glove produces dense vector embeddings of words, where words that occur together are close in the resulting vector space.

While this produces embeddings which are similar to word2vec (which has a great python implementation in gensim), the method is different: GloVe produces embeddings by factorizing the logarithm of the corpus word co-occurrence matrix.

The code uses asynchronous stochastic gradient descent, and is implemented in Cython. Most likely, it contains a tremendous amount of bugs.

Installation

Install from pypi using pip: pip install glove_python.

Note for OSX users: due to its use of OpenMP, glove-python does not compile under Clang. To install it, you will need a reasonably recent version of gcc (from Homebrew for instance). This should be picked up by setup.py; if it is not, please open an issue.

Building with the default Python distribution included in OSX is also not supported; please try the version from Homebrew or Anaconda.

Usage

Producing the embeddings is a two-step process: creating a co-occurrence matrix from the corpus, and then using it to produce the embeddings. The Corpus class helps in constructing a corpus from an interable of tokens; the Glove class trains the embeddings (with a sklearn-esque API).

There is also support for rudimentary pagragraph vectors. A paragraph vector (in this case) is an embedding of a paragraph (a multi-word piece of text) in the word vector space in such a way that the paragraph representation is close to the words it contains, adjusted for the frequency of words in the corpus (in a manner similar to tf-idf weighting). These can be obtained after having trained word embeddings by calling the transform_paragraph method on the trained model.

Examples

example.py has some example code for running simple training scripts: ipython -i -- examples/example.py -c my_corpus.txt -t 10 should process your corpus, run 10 training epochs of GloVe, and drop you into an ipython shell where glove.most_similar('physics') should produce a list of similar words.

If you want to process a wikipedia corpus, you can pass file from here into the example.py script using the -w flag. Running make all-wiki should download a small wikipedia dump file, process it, and train the embeddings. Building the cooccurrence matrix will take some time; training the vectors can be speeded up by increasing the training parallelism to match the number of physical CPU cores available.

Running this on my machine yields roughly the following results:

In [1]: glove.most_similar('physics')
Out[1]:
[('biology', 0.89425889335342257),
 ('chemistry', 0.88913708236100086),
 ('quantum', 0.88859617025616333),
 ('mechanics', 0.88821824562025431)]

In [4]: glove.most_similar('north')
Out[4]:
[('west', 0.99047203572917908),
 ('south', 0.98655786905501008),
 ('east', 0.97914140138065575),
 ('coast', 0.97680427897282185)]

In [6]: glove.most_similar('queen')
Out[6]:
[('anne', 0.88284931171714842),
 ('mary', 0.87615260138308615),
 ('elizabeth', 0.87362497374226267),
 ('prince', 0.87011034923161801)]

In [19]: glove.most_similar('car')
Out[19]:
[('race', 0.89549347066796814),
 ('driver', 0.89350343749207217),
 ('cars', 0.83601334715106568),
 ('racing', 0.83157724991920212)]

Development

Pull requests are welcome.

When making changes to the .pyx extension files, you'll need to run python setup.py cythonize in order to produce the extension .c and .cpp files before running pip install -e ..

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bicleaner-ai-glove-0.2.1.tar.gz (332.9 kB view details)

Uploaded Source

Built Distributions

bicleaner_ai_glove-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

bicleaner_ai_glove-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

bicleaner_ai_glove-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

bicleaner_ai_glove-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

bicleaner_ai_glove-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

bicleaner_ai_glove-0.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

File details

Details for the file bicleaner-ai-glove-0.2.1.tar.gz.

File metadata

  • Download URL: bicleaner-ai-glove-0.2.1.tar.gz
  • Upload date:
  • Size: 332.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for bicleaner-ai-glove-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f57d3ac324067796adb6797ca69225e6cf1ef2ee2a01bf4d35011863070885c7
MD5 b70f61eb86b63860c79d31dffb5723c6
BLAKE2b-256 8389469ef0c6fc8582ceb510cd8813efe743eb62a8f3f428e29bb8cfcff80422

See more details on using hashes here.

File details

Details for the file bicleaner_ai_glove-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

  • Download URL: bicleaner_ai_glove-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.12

File hashes

Hashes for bicleaner_ai_glove-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b5a415a50f2955ee6cda147b9e51b161f4e05319172a22c56b7013f6583d44d0
MD5 2a3257e4f6ba436adb4551ca7c2c80cf
BLAKE2b-256 0c8f479c256263e0b60dfa24fb189d32f1f0589054707920b5a00fb5af661731

See more details on using hashes here.

File details

Details for the file bicleaner_ai_glove-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

  • Download URL: bicleaner_ai_glove-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.12

File hashes

Hashes for bicleaner_ai_glove-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d23a01069a69028a021bb5700a0abb1d7001ee594e433d922878dab5744f743a
MD5 da6aa1b3dd8fe5afd8b58d7ec8d35404
BLAKE2b-256 50cf0dd82d2f5999845e06bd184ceded8834585728247a3f3619ed85c5addafc

See more details on using hashes here.

File details

Details for the file bicleaner_ai_glove-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bicleaner_ai_glove-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4b547bfc8e8c9d79815de1b119ed8bb6c8a0f1ce2881a0858e564d932e158499
MD5 e4f97bd8f67adc65a911796f671de294
BLAKE2b-256 a0e4e6d196f4b742f6070a74d9dfae4488d8076651de7545b9c92b0ed864021d

See more details on using hashes here.

File details

Details for the file bicleaner_ai_glove-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bicleaner_ai_glove-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 79918a93abb17b71090ab046523523c101af3fdd9e25a6a3e854726c55617cef
MD5 139a0bea9798c87969249ec736988d83
BLAKE2b-256 0ba23ec91ad826d5c8d5e2865fb134948d3463402b22e059d7fb3423a68bfac4

See more details on using hashes here.

File details

Details for the file bicleaner_ai_glove-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bicleaner_ai_glove-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 132003f85c0cc1e00f1e93e2c6075be613f9a11c7a6882ac73abb710a83b9545
MD5 e98f7c49d1233b8e82fb67f07308037e
BLAKE2b-256 4bf06f3d1298eb155d9aa7a7d4f9a9f73abb71db02366295d6aecb702fc4f648

See more details on using hashes here.

File details

Details for the file bicleaner_ai_glove-0.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bicleaner_ai_glove-0.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 03ecf38f08cf292d8219e745bf7f1beb7239245627383925f02599e8afcd874d
MD5 96a38b59b4e57bc7b3d91347cdf45d64
BLAKE2b-256 5282f81de07670a98b07d7954ff50a128541c298bfe5983df6341fdcc41d6d9c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page