Skip to main content

A library for finding the link between two wikipedia pages.

Project description

Description

This python library leverages graphs and artificial intelligence to find the path from a start wikipedia page to a destination wikipedia page.

Installation

Install wikiwormhole through pip package manager.

pip install wikiwormhole

Then run the setup script to accomplish the following:

  • Generate config.yaml script for the pageviews api (How many pageviews does a wikipedia page have).

  • Download pre-trained Word2Vec model using Gensim.

  • Download corpus of stop words (insignificant words) to help embed titles using natural language toolkit (NLTK).

python <venv-name>/bin/setup_wormhole.py <config-path> <download-path>

Here config path is where the new config-yaml will be created. The download path is where the Word2Vec model weights and NLTK corpus will be downloaded on your system.

The final step in installation is providing a personal website and email in the config.yaml file. Once filled you’ll be able to use the pageviews API.

Algorithm

The algorithm is split into two parts: popular and similar traversal.

Similar

Similar Traversal is an algorithm that attempts to find the path of pages from the starting page to the destination page. The algorithm assumes that titles with similar words will be closer to each other in the graph than the contrary. This similarity calculation is powered by the infamous word embedder Word2Vec. The word embedder calculates the words position in a latent space and the similarity is then calculated for two separate words using a cosine similarity metric (emphasis on directional alignment rather than distance).

Psudeocode:

SimilarTraversal(InitialPage, TargetPage)
    PQ <- PriorityQueue() - only stores top x nodes
    G <- Graph(InitialPage)
    PQ.add(InitialPage, Similarity(InitialPage, TargetPage))

    while TargetPage not in G do
        SimNode <- PQ.pop()
        for LinkNode in SimNode do
        G.add(SimNode, LinkNode)
        PQ.add(LinkNode, Similarity(LinkNode, TargetPage))

    return G.path(InitialPage, TargetPage)

Similarity(Title1, Title2)
    Title1 <- Embed(Title1)
    Title2 <- Embed(Title2)
    return CosineSimilarity(Title1, Title2)

Sample Usage

from wikiwormhole.traverse.popular import PopularTraverse
from wikiwormhole.traverse.similar import SimilarTraverse
from wikiwormhole.title2vec import Title2Vec
from tqdm import tqdm

data_dir = './data'
config_path = './config.yaml'

start_title = 'Apple'
end_title= 'Cadillac'

pop_rounds = 5

pop_start = PopularTraverse(start_title, config_path)
pop_end = PopularTraverse(end_title, config_path)

for _ in tqdm(range(pop_rounds)):
    pop_start.traverse()
    pop_end.traverse()

path_start = pop_start.most_popular_pathway()
path_end = pop_end.most_popular_pathway()
print("START: ", path_start)
print("END: ", path_end)

t2v = Title2Vec(data_dir)
sim_start, sim_end = path_start[-1], path_end[-1]
sim_trav = SimilarTraverse(sim_start, sim_end, t2v)

while not sim_trav.target_reached():
    sim_trav.traverse()

path_cnx = sim_trav.path_to_target()
print("RESULT: ", path_start[:-1] + path_cnx + path_end[:-1][::-1])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikiwormhole-0.0.4.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

wikiwormhole-0.0.4-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file wikiwormhole-0.0.4.tar.gz.

File metadata

  • Download URL: wikiwormhole-0.0.4.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.6

File hashes

Hashes for wikiwormhole-0.0.4.tar.gz
Algorithm Hash digest
SHA256 bc352c396d964992d195cbead35fd34fb349d16fa3f32175ad05700b3da46a73
MD5 3627f85294b1ca41972eb68d813d20a4
BLAKE2b-256 be745b26ca57952084a8a67469c9abb56ba5dc6e216af4cc1ef14d2cfc66651b

See more details on using hashes here.

File details

Details for the file wikiwormhole-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: wikiwormhole-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 15.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.6

File hashes

Hashes for wikiwormhole-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 9385050a85804f1d8cd91abdf104de3d6a4b353657aca78bcaa976e74bf07fcb
MD5 aacbdf328338b829f12d27f564a07a41
BLAKE2b-256 a125b8ba60b62cab7ad61524ad7a59c62b10200a30b61a654ccf886fa19c2548

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page