A python module to generate word embeddings from tiny data
Project description
nonce2vec
Welcome to Nonce2Vec!
This is the repo accompanying the paper "High-risk learning: acquiring new word vectors from tiny data" (Herbelot & Baroni, 2017). If you use this code, please cite the following:
@InProceedings{herbelot-baroni:2017:EMNLP2017,
author = {Herbelot, Aur\'{e}lie and Baroni, Marco},
title = {High-risk learning: acquiring new word vectors from tiny data},
booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
month = {September},
year = {2017},
address = {Copenhagen, Denmark},
publisher = {Association for Computational Linguistics},
pages = {304--309},
url = {https://www.aclweb.org/anthology/D17-1030}
}
NEW! We have now released v2 of Nonce2Vec which is packaged via pip and runs on gensim v3.4.0. This should make it way easier for you to replicate experiments.
Install
pip3 install nonce2vec
Download and extract the required resources
To download the definitional, chimeras and MEN datasets:
wget http://129.194.21.122/~kabbach/noncedef.chimeras.men.7z
To use the pretrained gensim model from Herbelot and Baroni (2017):
wget http://129.194.21.122/~kabbach/wiki_all.sent.split.model.7z
Generate a pre-trained word2vec model
If you want to generate a new gensim.word2vec model from scratch and do not want to rely on the wiki_all.sent.split.model
:
Download/Generate a Wikipedia dump
To use the same Wikipedia dump as Herbelot and Baroni (2017):
wget http://129.194.21.122/~kabbach/wiki.all.utf8.sent.split.lower.7z
Else, to create a new Wikipedia dump from an different archive, check out WiToKit.
Train the background model
You can train Word2Vec with gensim via the nonce2vec package:
n2v train \
--data /absolute/path/to/wikipedia/dump \
--outputdir /absolute/path/to/dir/where/to/store/w2v/model \
--alpha 0.025 \
--neg 5 \
--window 5 \
--sample 1e-3 \
--epochs 5 \
--min-count 50 \
--size 400 \
--num-threads number_of_cpu_threads_to_use
--train-mode skipgram
Check the correlation with the MEN dataset
n2v check \
--data /absolute/path/to/MEN/MEN_dataset_natural_form_full
--model /absolute/path/to/gensim/word2vec/model
Replication
Test nonce2vec on the nonce definitional dataset
n2v test \
--on nonces \
--model /absolute/path/to/pretrained/w2v/model \
--data /absolute/path/to/nonce.definitions.299.test \
--alpha 1 \
--neg 3 \
--window 15 \
--sample 10000 \
--epochs 1 \
--lambda 70 \
--sample-decay 1.9 \
--window-decay 5
Test nonce2vec on the chimeras dataset
n2v test \
--on chimeras \
--model /absolute/path/to/pretrained/w2v/model \
--data /absolute/path/to/chimeras.dataset.lx.tokenised.test.txt \
--alpha 1 \
--neg 3 \
--window 15 \
--sample 10000 \
--epochs 1 \
--lambda 70 \
--sample-decay 1.9 \
--window-decay 5
slightly
Results
Results on nonce2vec v2.x are slightly different than those reported to in the
original EMNLP paper due to several bugfix in how gensim originally
handled subsampling with random.rand()
.
DATASET | MRR / RHO |
---|---|
Definitional | 0.04846 |
Chimeras L2 | 0.3407 |
Chimeras L4 | 0.3457 |
Chimeras L6 | 0.4001 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file nonce2vec-2.0.0rc3.tar.gz
.
File metadata
- Download URL: nonce2vec-2.0.0rc3.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a943cb07a73a84c5ab3451263c91f9e0d1dcc54f44ca29022457568bb01af16 |
|
MD5 | d8ff5a4cef68b55387c16da29af3dd97 |
|
BLAKE2b-256 | ccefdb7636f49acc7077a6165dd4d376dcd538721aa9b450a0a4ba88e89ce9c6 |