Skip to main content

WikiVector: Tools for encoding Wikipedia articles as vectors

Project description

wikivector

PyPI version DOI

Tools for encoding Wikipedia articles as vectors.

Installation

To get the latest stable version:

pip install wikivector

To get the development version:

pip install git+git://github.com/mortonne/wikivector

Exporting Wikipedia text

First, run WikiExtractor on a Wikipedia dump. This will generate a directory with many subdirectories and text files within each subdirectory. Next, build a header file with a list of all articles in the extracted text data:

wiki_header wiki_dir

where wiki_dir is the path to the output from WikiExtractor. This will create a CSV file called header.csv with the title of each article and the file in which it can be found.

To extract specific articles, write a CSV file with two columns: "item" and "title". The "title" for each item must exactly match an article title in the Wikipedia dump. We refer to this file as the map_file.

If you are working with an older Wikipedia dump, it can be difficult to find the correct titles for article pages, as page titles may have changed between the archive and the current online version of Wikipedia. To help identify mismatches between the map file and the Wikipedia dump, you can run:

wiki_check_map header_file map_file

to display any items whose article is not found in the header file. You can then use the Bash utility grep to search the header file for correct titles for each missing item.

When your map file is ready, extract the text for each item:

export_articles header_file map_file output_dir

where map_file is the CSV file with your items, and output_dir is where you want to save text files with each item's article. Check the output carefully to ensure that you have the correct text for each item and that XML tags have been stripped out.

Universal Sentence Encoder

Once articles have been exported, you can calculate a vector embedding for each item using the Universal Sentence Encoder.

embed_articles map_file text_dir h5_file

This reads a map file specifying an item pool (only the "item" field is used) and outputs vectors in an hdf5 file. To read the vectors, in Python:

from wikivector import vector
vectors, items = vector.load_vectors(h5_file)

Citation

If you use wikivector, please cite the following paper:

Morton, NW*, Zippi, EL*, Noh, S, Preston, AR. In press. Semantic knowledge of famous people and places is represented in hippocampus and distinct cortical networks. Journal of Neuroscience. *authors contributed equally

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikivector-1.2.1.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

wikivector-1.2.1-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file wikivector-1.2.1.tar.gz.

File metadata

  • Download URL: wikivector-1.2.1.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5

File hashes

Hashes for wikivector-1.2.1.tar.gz
Algorithm Hash digest
SHA256 e34ce533d7ebdbe4318b7a2c104482cdc581046e82b0914b8c55aa6e5ceaa279
MD5 9afe2d9821f77170c4b3e2a82c498d4e
BLAKE2b-256 493837252b3ef3a79f9bcef6d9d43a3efa0f06711391932d8fb3f8b1d6eacaf1

See more details on using hashes here.

File details

Details for the file wikivector-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: wikivector-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5

File hashes

Hashes for wikivector-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1cf30ad3dcce4bf30e71793beb86006dd002e6b56f748f1b2c2243d26c627353
MD5 08462354eaf21f801f1f1bb9cf8b6017
BLAKE2b-256 602be098b874a98a6fde06bca1d12bf9dd9dbbfd018971f9e9bce798e9f1c95c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page