Skip to main content

Context Encoders (ConEc) as an extension of word2vec

Project description

Context Encoders (ConEc)

With this code you can train and evaluate Context Encoders (ConEc), an extension of word2vec, which can learn word embeddings from large corpora and create out-of-vocabulary embeddings on the spot as well as distinguish between multiple meanings of words based on their local contexts. For further details on the model and experiments please refer to the paper - and of course if any of this code was helpful for your research, please consider citing it:

    @inproceedings{horn2017conecRepL4NLP,
      author       = {Horn, Franziska},
      title        = {Context encoders as a simple but powerful extension of word2vec},
      booktitle    = {Proceedings of the 2nd Workshop on Representation Learning for NLP},
      year         = {2017},
      organization = {Association for Computational Linguistics},
      pages        = {10--14}
    }

The code is intended for research purposes. It should run with Python 2.7 and 3 versions - no guarantees on this though (open an issue if you find a bug, please)!

installation

You either download the code from here and include the conec folder in your $PYTHONPATH or install (the library components only) via pip:

$ pip install conec

conec library components

dependencies: numpy, scipy

  • word2vec.py: code to train a standard word2vec model, adapted from the corresponding gensim implementation.
  • context2vec.py: code to build a sparse context matrix from a large collection of texts; this context matrix can then be multiplied with the corresponding word2vec embeddings to give the context encoder embeddings:
# get the text for training
sentences = Text8Corpus('data/text8')
# train the word2vec model
w2v_model = word2vec.Word2Vec(sentences, mtype='cbow', hs=0, neg=13, vector_size=200, seed=3)
# get the global context matrix for the text
context_model = context2vec.ContextModel(sentences, min_count=w2v_model.min_count, window=w2v_model.window, wordlist=w2v_model.wv.index2word)
context_mat = context_model.get_context_matrix(fill_diag=False, norm='max')
# multiply the context matrix with the (length normalized) word2vec embeddings
# to get the context encoder (ConEc) embeddings
conec_emb = context_mat.dot(w2v_model.wv.vectors_norm)
# renormalize so the word embeddings have unit length again
conec_emb = conec_emb / np.array([np.linalg.norm(conec_emb, axis=1)]).T

examples

additional dependencies: sklearn

test_analogy.py and test_ner.py contain the code to replicate the analogy and named entity recognition (NER) experiments discussed in the aforementioned paper.

To run the analogy experiment, it is assumed that the text8 corpus or 1-billion corpus as well as the analogy questions are in a data directory.

To run the named entity recognition experiment, it is assumed that the corresponding training and test files are located in the data/conll2003 directory.

If you have any questions please don't hesitate to send me an email and of course if you should find any bugs or want to contribute other improvements, pull requests are very welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conec-2.0.1.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

conec-2.0.1-py2.py3-none-any.whl (12.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file conec-2.0.1.tar.gz.

File metadata

  • Download URL: conec-2.0.1.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for conec-2.0.1.tar.gz
Algorithm Hash digest
SHA256 e5d2297ae3487b17cb2bd03ac29b6eaa5aa3063753440d063a2320a74bc29840
MD5 7c7c8316985c88c9d6ab82b3c2edd7db
BLAKE2b-256 baf36a41c6dfdd5433e76891257c06b66e090d915dfcf625f7bebecffcf73f81

See more details on using hashes here.

File details

Details for the file conec-2.0.1-py2.py3-none-any.whl.

File metadata

  • Download URL: conec-2.0.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for conec-2.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6780a6b1bc164b9f25d7ca6476114021c495318cf34fa327c0324e7a4ff1b15a
MD5 e46da96b4282885343d59542665200a3
BLAKE2b-256 21d4a3b95710b27b6b7fb96431c0b6fbaf3d66e97964d055458b5eea8f13e034

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page