Skip to main content

Context Encoders (ConEc) as an extension of word2vec

Project description

With this code you can train and evaluate Context Encoders (ConEc), an extension of word2vec, which can learn word embeddings from large corpora and create out-of-vocabulary embeddings on the spot as well as distinguish between multiple meanings of words based on their local contexts. For further details on the model and experiments please refer to the paper (and of course please consider citing it ;-)).

The code is intended for research purposes. It was programmed for Python 2.7, but should theoretically also run on newer Python 3 versions - no guarantees on this though (open an issue if you find a bug, please)!


You either download the code from here and include the conec folder in your $PYTHONPATH or install (the library components only) via pip:

$ pip install conec

conec library components

dependencies: numpy, scipy

  • code to train a standard word2vec model, adapted from the corresponding gensim implementation.
  • code to build a sparse context matrix from a large collection of texts; this context matrix can then be multiplied with the corresponding word2vec embeddings to give the context encoder embeddings:
# get the text for training
sentences = Text8Corpus('data/text8')
# train the word2vec model
w2v_model = word2vec.Word2Vec(sentences, mtype='cbow', hs=0, neg=13, embed_dim=200, seed=3)
# get the global context matrix for the text
context_model = context2vec.ContextModel(sentences, min_count=w2v_model.min_count, window=w2v_model.window, wordlist=w2v_model.index2word)
context_mat = context_model.get_context_matrix(fill_diag=False, norm='max')
# multiply the context matrix with the (length normalized) word2vec embeddings
# to get the context encoder (ConEc) embeddings
conec_emb =
# renormalize so the word embeddings have unit length again
conec_emb = conec_emb / np.array([np.linalg.norm(conec_emb, axis=1)]).T


additional dependencies: sklearn, unidecode and contain the code to replicate the analogy and named entity recognition (NER) experiments discussed in the aforementioned paper.

To run the analogy experiment, it is assumed that the text8 corpus or 1-billion corpus as well as the analogy questions are in a data directory.

To run the named entity recognition experiment, it is assumed that the corresponding training and test files are located in the data/conll2003 directory.

If you have any questions please don’t hesitate to send me an email and of course if you should find any bugs or want to contribute other improvements, pull requests are very welcome!

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for conec, version 1.0.0
Filename, size File type Python version Upload date Hashes
Filename, size conec-1.0.0.tar.gz (12.4 kB) File type Source Python version None Upload date Hashes View
Filename, size conec-1.0.0-py2.py3-none-any.whl (13.6 kB) File type Wheel Python version py2.py3 Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page