Loaders and savers for different implentations of word embedding.
Project description
.. -*- coding: utf-8; -*-
Loaders and savers for different implentations of `word embedding <https://en.wikipedia.org/wiki/Word_embedding>`_. The motivation of this project is that it is cumbersome to write loaders for different pretrained word embedding files. This project provides a simple interface for loading pretrained word embedding files in different formats.
.. code:: python
from word_embedding_loader import WordEmbedding
# it will automatically determine format from content
wv = WordEmbedding.load('path/to/embedding.bin')
# This project provides minimum interface for word embedding
print wv.vectors[wv.vocab['is']]
# Modify and save word embedding file with arbitrary format
wv.save('path/to/save.txt', 'word2vec', binary=False)
This project currently supports following formats:
* `GloVe <https://nlp.stanford.edu/projects/glove/>`_, Global Vectors for Word Representation, by Jeffrey Pennington, Richard Socher, Christopher D. Manning from Stanford NLP group.
* `word2vec <https://code.google.com/archive/p/word2vec/>`_, by Mikolov.
- text (create with ``-binary 0`` option (the default))
- binary (create with ``-binary 1`` option)
* `gensim <https://radimrehurek.com/gensim/>`_ 's ``models.word2vec`` module (coming)
* original HDFS format: a performance centric option for loading and saving word embedding (coming)
Sometimes, you want combine an external program with word embedding file of your own choice. This project also provides a simple executable to convert a word embedding format to another.
.. code:: bash
# it will automatically determine the format from the content
word-embedding-loader convert -t glove test/word_embedding_loader/word2vec.bin test.bin
# Get help for command/subcommand
word-embedding-loader --help
word-embedding-loader convert --help
Issues with encoding
--------------------
This project does decode vocab. It is up to users to determine and decode bytes.
.. code:: python
decoded_vocab = {k.decode('latin-1'): v for k, v in wv.vocab.iteritems()}
.. notes::
Encoding of pretrained word2vec is latin-1. Encoding of pretrained
glove is utf-8
Development
============
This project us Cython to build some modules, so you need Cython for development.
```bash
pip install -r requirements.txt
```
If environment variable ``DEVELOP_WE`` is set, it will try to rebuild ``.pyx`` modules.
```bash
DEVELOP_WE=1 python setup.py test
```
Loaders and savers for different implentations of `word embedding <https://en.wikipedia.org/wiki/Word_embedding>`_. The motivation of this project is that it is cumbersome to write loaders for different pretrained word embedding files. This project provides a simple interface for loading pretrained word embedding files in different formats.
.. code:: python
from word_embedding_loader import WordEmbedding
# it will automatically determine format from content
wv = WordEmbedding.load('path/to/embedding.bin')
# This project provides minimum interface for word embedding
print wv.vectors[wv.vocab['is']]
# Modify and save word embedding file with arbitrary format
wv.save('path/to/save.txt', 'word2vec', binary=False)
This project currently supports following formats:
* `GloVe <https://nlp.stanford.edu/projects/glove/>`_, Global Vectors for Word Representation, by Jeffrey Pennington, Richard Socher, Christopher D. Manning from Stanford NLP group.
* `word2vec <https://code.google.com/archive/p/word2vec/>`_, by Mikolov.
- text (create with ``-binary 0`` option (the default))
- binary (create with ``-binary 1`` option)
* `gensim <https://radimrehurek.com/gensim/>`_ 's ``models.word2vec`` module (coming)
* original HDFS format: a performance centric option for loading and saving word embedding (coming)
Sometimes, you want combine an external program with word embedding file of your own choice. This project also provides a simple executable to convert a word embedding format to another.
.. code:: bash
# it will automatically determine the format from the content
word-embedding-loader convert -t glove test/word_embedding_loader/word2vec.bin test.bin
# Get help for command/subcommand
word-embedding-loader --help
word-embedding-loader convert --help
Issues with encoding
--------------------
This project does decode vocab. It is up to users to determine and decode bytes.
.. code:: python
decoded_vocab = {k.decode('latin-1'): v for k, v in wv.vocab.iteritems()}
.. notes::
Encoding of pretrained word2vec is latin-1. Encoding of pretrained
glove is utf-8
Development
============
This project us Cython to build some modules, so you need Cython for development.
```bash
pip install -r requirements.txt
```
If environment variable ``DEVELOP_WE`` is set, it will try to rebuild ``.pyx`` modules.
```bash
DEVELOP_WE=1 python setup.py test
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
WordEmbeddingLoader-0.2.0.tar.gz
(117.8 kB
view details)
File details
Details for the file WordEmbeddingLoader-0.2.0.tar.gz
.
File metadata
- Download URL: WordEmbeddingLoader-0.2.0.tar.gz
- Upload date:
- Size: 117.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a06b162facb30d491232451c271411a684993c543ed06a4ccdc4cebb13b643ce |
|
MD5 | 2cb7fa28bc857eda59d7b37d059b7485 |
|
BLAKE2b-256 | 687b0e55c98539843f06cbc88b3e4a8d3ed52eb4050ddc32eceaf657931219eb |