Skip to main content

A python module to generate word embeddings from tiny data

Project description

GitHub release PyPI release Build MIT License DOI

nonce2vec

Welcome to Nonce2Vec!

This is the repo accompanying the paper High-risk learning: acquiring new word vectors from tiny data (Herbelot & Baroni, 2017). If you use this code, please cite the following:

@InProceedings{herbelot-baroni:2017:EMNLP2017,
  author    = {Herbelot, Aur\'{e}lie  and  Baroni, Marco},
  title     = {High-risk learning: acquiring new word vectors from tiny data},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  pages     = {304--309},
  url       = {https://www.aclweb.org/anthology/D17-1030}
}

NEW! We have now released v2 of Nonce2Vec which is packaged via pip and runs on gensim v3.4.0. This should make it way easier for you to replicate experiments.

Install

pip3 install nonce2vec

Download and extract the required resources

To download the definitional, chimeras and MEN datasets:

wget http://129.194.21.122/~kabbach/noncedef.chimeras.men.7z

To use the pretrained gensim model from Herbelot and Baroni (2017):

wget http://129.194.21.122/~kabbach/wiki_all.sent.split.model.7z

Generate a pre-trained word2vec model

If you want to generate a new gensim.word2vec model from scratch and do not want to rely on the wiki_all.sent.split.model:

Download/Generate a Wikipedia dump

To use the same Wikipedia dump as Herbelot and Baroni (2017):

wget http://129.194.21.122/~kabbach/wiki.all.utf8.sent.split.lower.7z

Else, to create a new Wikipedia dump from an different archive, check out WiToKit.

Train the background model

You can train Word2Vec with gensim via the nonce2vec package:

n2v train \
  --data /absolute/path/to/wikipedia/dump \
  --outputdir /absolute/path/to/dir/where/to/store/w2v/model \
  --alpha 0.025 \
  --neg 5 \
  --window 5 \
  --sample 1e-3 \
  --epochs 5 \
  --min-count 50 \
  --size 400 \
  --num-threads number_of_cpu_threads_to_use
  --train-mode skipgram

Check the correlation with the MEN dataset

n2v check \
  --data /absolute/path/to/MEN/MEN_dataset_natural_form_full
  --model /absolute/path/to/gensim/word2vec/model

Replication

Test nonce2vec on the nonce definitional dataset

n2v test \
  --on nonces \
  --model /absolute/path/to/pretrained/w2v/model \
  --data /absolute/path/to/nonce.definitions.299.test \
  --alpha 1 \
  --neg 3 \
  --window 15 \
  --sample 10000 \
  --epochs 1 \
  --lambda 70 \
  --sample-decay 1.9 \
  --window-decay 5

Test nonce2vec on the chimeras dataset

n2v test \
  --on chimeras \
  --model /absolute/path/to/pretrained/w2v/model \
  --data /absolute/path/to/chimeras.dataset.lx.tokenised.test.txt \
  --alpha 1 \
  --neg 3 \
  --window 15 \
  --sample 10000 \
  --epochs 1 \
  --lambda 70 \
  --sample-decay 1.9 \
  --window-decay 5

Results

Results on nonce2vec v2.x are slightly different than those reported to in the original EMNLP paper due to several bugfix in how gensim originally handled subsampling with random.rand().

DATASET MRR / RHO
Definitional 0.04846
Chimeras L2 0.3407
Chimeras L4 0.3457
Chimeras L6 0.4001

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nonce2vec-2.0.0rc4.tar.gz (16.4 kB view details)

Uploaded Source

File details

Details for the file nonce2vec-2.0.0rc4.tar.gz.

File metadata

  • Download URL: nonce2vec-2.0.0rc4.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for nonce2vec-2.0.0rc4.tar.gz
Algorithm Hash digest
SHA256 d3b989af558dc0ec997f70a43c082d4b4f92f685157dc250a06f9baa5a550290
MD5 18d3b7ced59c534eef61a95dd8b7a9dc
BLAKE2b-256 a66fa6f980dd9c67caff2c10d7d86d44520bc489058c41d6500f71315daaff4a

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page