A tool for learning vector representations of words and entities from Wikipedia

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Project description

Wikipedia2Vec

Wikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia. It is developed and maintained by Studio Ousia.

This tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space. Embeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.

This tool implements the conventional skip-gram model to learn the embeddings of words, and its extension proposed in Yamada et al. (2016) to learn the embeddings of entities. This tool has been used in several state-of-the-art NLP models such as entity linking, named entity recognition, knowledge graph completion, entity relatedness, and question answering.

This tool has been tested on Linux, Windows, and macOS.

An empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available here.

Documentation and pretrained embeddings for 12 languages (English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) are available online at http://wikipedia2vec.github.io/.

Basic Usage

Wikipedia2Vec can be installed via PyPI:

% pip install wikipedia2vec

With this tool, embeddings can be learned by running a train command with a Wikipedia dump as input. For example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:

% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE

Then, the learned embeddings are written to MODEL_FILE. Note that this command can take many optional parameters. Please refer to our documentation for further details.

Reference

If you use Wikipedia2Vec in a scientific publication, please cite the following paper:

Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Wikipedia2Vec: An Optimized Tool for Learning Embeddings of Words and Entities from Wikipedia.

@article{yamada2018wikipedia2vec,
  title={Wikipedia2Vec: An Optimized Tool for Learning Embeddings of Words and Entities from Wikipedia},
  author={Yamada, Ikuya and Asai, Akari and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
  journal={arXiv preprint 1812.06280},
  year={2018}
}

License

Apache License 2.0

Project details

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.2.3

Apr 16, 2019

0.2.2

Apr 16, 2019

0.2.1

Apr 16, 2019

0.2

Apr 16, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikipedia2vecsm-0.2.3.tar.gz (1.2 MB view details)

Uploaded Apr 16, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wikipedia2vecsm-0.2.3-py2-none-any.whl (1.2 MB view details)

Uploaded Apr 16, 2019 Python 2

File details

Details for the file wikipedia2vecsm-0.2.3.tar.gz.

File metadata

Download URL: wikipedia2vecsm-0.2.3.tar.gz
Upload date: Apr 16, 2019
Size: 1.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/2.7.15

File hashes

Hashes for wikipedia2vecsm-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`5f4c86d860124d80606cba860886fda8bfe22e645e204359914ccf71ac1dae94`
MD5	`bd879f48f4b801128fd48a284c75e697`
BLAKE2b-256	`25fbac4ee10db4587f9edd6028839b51f7e04223d81a084dab2cba379733b8a9`

See more details on using hashes here.

File details

Details for the file wikipedia2vecsm-0.2.3-py2-none-any.whl.

File metadata

Download URL: wikipedia2vecsm-0.2.3-py2-none-any.whl
Upload date: Apr 16, 2019
Size: 1.2 MB
Tags: Python 2
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/2.7.15

File hashes

Hashes for wikipedia2vecsm-0.2.3-py2-none-any.whl
Algorithm	Hash digest
SHA256	`8b8fa6770aa4db82fe1b6d5f30b2de28fcafdefe0b4c1be7707572859ad2cdd6`
MD5	`73b26477f24e3ff43bf013abee13722e`
BLAKE2b-256	`d10d0bf3bcf743043e6d010017024df8293e4c7a4ef8d92330967f955a88653c`

See more details on using hashes here.

wikipedia2vecsm 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Wikipedia2Vec

Basic Usage

Reference

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes