Skip to main content

Tools for encoding Wikipedia articles as vectors.

Project description

wikivector

PyPI version

Tools for encoding Wikipedia articles as vectors.

Installation

To get the latest stable version:

pip install wikivector

To get the development version:

pip install git+git://github.com/mortonne/wikivector

Exporting Wikipedia text

First, run WikiExtractor on a Wikipedia dump. This will generate a directory with many subdirectories and text files within each subdirectory. Next, build a header file with a list of all articles in the extracted text data:

wiki_header wiki_dir header_file

where wiki_dir is the path to the output from WikiExtractor. The header_file will be a CSV file with the title of each article and the file in which it can be found.

To extract specific articles, write a CSV file with two columns: "item" and "title". The "title" for each item must exactly match an article title in the Wikipedia dump. To extract the text for each item:

export_articles header_file map_file output_dir

where map_file is the CSV file with your items, and output_dir is where you want to save text files with each item's article.

Universal Sentence Encoder

Once articles have been exported, you can calculate a vector embedding for each item using the Universal Sentence Encoder.

embed_articles map_file text_dir h5_file

This reads a map file specifying an item pool (only the "item" field is used) and outputs vectors in an hdf5 file. To read the vectors, in Python:

from wikivector import vector
vectors, items = vector.load_vectors(h5_file)

Citation

If you use wiki2vec, please cite the following paper:

Morton, NW*, Zippi, EL*, Noh, S, Preston, AR. In revision. Semantic knowledge of famous people and places is represented in distinct networks that converge in hippocampus. * authors contributed equally

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikivector-1.0.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

wikivector-1.0.0-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file wikivector-1.0.0.tar.gz.

File metadata

  • Download URL: wikivector-1.0.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1

File hashes

Hashes for wikivector-1.0.0.tar.gz
Algorithm Hash digest
SHA256 83cb46e74f4050aa311cb75fbbf8f0fc42474129f01dbfe3dcc3edaa2a9715c6
MD5 0dd6e3265e70f07c62c5921bd86c9c34
BLAKE2b-256 d02c7c87aa9235b9110e3f53d6f21e7182d09fcd881cc223fc4d553db9dc05d8

See more details on using hashes here.

File details

Details for the file wikivector-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: wikivector-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1

File hashes

Hashes for wikivector-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 091f155916b121794faeabaa182bda54d0e82099c02f2010a58e9e5625ffb275
MD5 bae46b34aac64f0741e396fe1d97107e
BLAKE2b-256 55ccdf9ade4764fbc85b853bdbc6d591b0759aae8abb8f3c90a8e95975044e38

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page