Tools for encoding Wikipedia articles as vectors.
Project description
wikivector
Tools for encoding Wikipedia articles as vectors.
Installation
To get the latest stable version:
pip install wikivector
To get the development version:
pip install git+git://github.com/mortonne/wikivector
Exporting Wikipedia text
First, run WikiExtractor on a Wikipedia dump. This will generate a directory with many subdirectories and text files within each subdirectory. Next, build a header file with a list of all articles in the extracted text data:
wiki_header wiki_dir header_file
where wiki_dir
is the path to the output from WikiExtractor
.
The header_file
will be a CSV file with the title of each article
and the file in which it can be found.
To extract specific articles, write a CSV file with two columns: "item" and "title". The "title" for each item must exactly match an article title in the Wikipedia dump. To extract the text for each item:
export_articles header_file map_file output_dir
where map_file
is the CSV file with your items, and output_dir
is
where you want to save text files with each item's article.
Universal Sentence Encoder
Once articles have been exported, you can calculate a vector embedding for each item using the Universal Sentence Encoder.
embed_articles map_file text_dir h5_file
This reads a map file specifying an item pool (only the "item" field is used) and outputs vectors in an hdf5 file. To read the vectors, in Python:
from wikivector import vector
vectors, items = vector.load_vectors(h5_file)
Citation
If you use wiki2vec, please cite the following paper:
Morton, NW*, Zippi, EL*, Noh, S, Preston, AR. In revision. Semantic knowledge of famous people and places is represented in distinct networks that converge in hippocampus. * authors contributed equally
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wikivector-1.0.0.tar.gz
.
File metadata
- Download URL: wikivector-1.0.0.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 83cb46e74f4050aa311cb75fbbf8f0fc42474129f01dbfe3dcc3edaa2a9715c6 |
|
MD5 | 0dd6e3265e70f07c62c5921bd86c9c34 |
|
BLAKE2b-256 | d02c7c87aa9235b9110e3f53d6f21e7182d09fcd881cc223fc4d553db9dc05d8 |
File details
Details for the file wikivector-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: wikivector-1.0.0-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 091f155916b121794faeabaa182bda54d0e82099c02f2010a58e9e5625ffb275 |
|
MD5 | bae46b34aac64f0741e396fe1d97107e |
|
BLAKE2b-256 | 55ccdf9ade4764fbc85b853bdbc6d591b0759aae8abb8f3c90a8e95975044e38 |