Toolkit for managing and navigating graphs of Wikipedia categories

Project description

`wikicat`

A Python toolkit for managing and navigating graphs of Wikipedia categories 🔖


Simple Python API for exploring graph offline	Useful CLI for processing and launching app

Interactive visualization of categories	UI to display information and filter nodes

Note If you need help at any time, you can head over to the official documentations.

Main API

Note The reference can be found on the doc page or in docs/wikicat.md

The main wikicat API allows you work with category graphs generated from a certain dump by Wikipedia. Once the dump is processed via wikicat.processing, you can easily navigate the graph using simple and clear Python code, all offline (i.e., you do not need to make web requests to Wikipedia, and you can choose dump going back to any date you prefer). The API is designed to be as simple as possible, and is intended to be used by researchers and developers who want to work with the Wikipedia category graph.

To install the API, run:

pip3 install wikicat

wikicat contains two classes to work with the Wikipedia category graph: CategoryGraph and Page. The CategoryGraph class is used to load the graph from a file, and to navigate the graph. The Page class is used to represent a Wikipedia page, and to retrieve information about the page from Wikipedia. They are meant to be used together, as shown in the following example:

import wikicat as wc

# Load the graph
cg = wc.CategoryGraph.read_json(
    '~/.wikicat_data/enwiki_<yyyy>_<mm>_<dd>/category_graph.json'
)

# Get the page for "Montreal"
page = cg.get_page_from_title('Montreal', 'article')

# Get the categories for "Montreal"
cats = cg.get_parents(page=page)
print(f"Category tags of {page.title}: {cats}")

# Get URL of "Montreal"
print("URL:", page.get_url())

By default, the path will be ~/.wikicat_data/, but the JSON can be stored anywhere you want (see wikicat.processing below for more information).

`wikicat.processing`

wikicat.processing is a command line interface (CLI) for downloading and processing the data

Note The reference can be found on the doc page or in docs/wikicat/processing.md

To install the processing tools, run:

pip3 install wikicat[processing]

Now, following those instructions to download and process the data:

# 1. Download DB dump of Wikipedia categories (extension .sql.gz)
python3 -m wikicat.processing.download_dump \
        --year <yyyy> \
        --month <mm> \
        --day <dd> \
        --base_dir ~/.wikicat_data/  # optional, default is ~/.wikicat_data/

# 2. Process individual dumps (.sql.gz) into csv files (to be merged later)
python3 -m wikicat.processing.process_dump \
        -y <yyyy> -m <mm> -d <dd>

# 3. Merge the individual table csvs into a single category graph csv
python3 -m wikicat.processing.merge_tables \
        -y <yyyy> -m <mm> -d <dd>

# 4. Convert CSV category graph into JSON category graph (the final output)
python3 -m wikicat.processing.generate_graph \
        -y <yyyy> -m <mm> -d <dd>

Notes (by step):

If you do not specify --base_dir, it will automatically be saved to ~/.wikicat_data/enwiki_<yyyy>_<mm>_<dd>.
This may take a while depending on your hardware, and will need plenty of RAM. It will generate two CSV files with the relevant tables.
This should take under 30 mins depending on your hardware, and will generate a single CSV file with the relevant category graph links.
The results will be saved in ~/.wikicat_data/enwiki_<yyyy>_<mm>_<dd>/category_graph.json.

`wikicat.viewer`

wikicat.viewer is an application that lets you visually explore a category graph

Note The reference can be found on the doc page or in docs/wikicat/viewer.md

To install the viewer, run:

pip3 install wikicat[viewer]

To run the viewer, run:

python3 -m wikicat.viewer -y <yyyy> -m <mm> -d <dd> --port 8050

Then, open your browser to http://0.0.0.0:8050.

Usage

The viewer let you interact with the nodes. You can zoom in and out, move and click the nodes in the graph.

When you click on a node, you will see various information (including a list of children articles) appear on the middle panel.
On the right panel, you will see various checklists of children and parents of the selected node. When you click on "Update", the checked parents and children will be added to the graph.
There's a dropdown and a validated input. For the input, a green check will appear if a valid article title is input, otherwise it remains red. The dropdown lets you choose one of 28 top-level categories. The input let you type the name of an article (not category). When the title is valid, you can click on the "Compute Path" button, which will try to find a valid path between the top-level category and the article you chose.
Click on the "Reset" button to go back to the original view.

Accessing components

wikicat.viewer was built using Dash, a Python framework for building web applications. The application is composed of several components, which can be accessed inside wikicat.viewer.components. For example, to access the Network component, you can run:

import wikicat.viewer.components as comp

# Build the network
cytoscape_graph = comp.build_cytoscape_graph(...)

# Build the right panel
panel = comp.build_panel(...)

Those can be reused in your custom Dash application. You can also create your own component and add it to the viewer. For example:

import wikicat.viewer as wcv

# ...

# Define app
app = dash.Dash(__name__, external_stylesheets=[style], title=title, **kwargs)

# Define your custom components
def build_btn(...):
    # ...

# Build regular components
cyto_graph = wcv.components.build_cytoscape_graph(root)
# ...
cards = wcv.components.build_cards(cl=cl, sw=sw)
cards_column = wcv.components.build_card_column(cards)

# Build layout
app.layout = wcv.components.build_layout(...)

# Assign callbacks to make app interactive
wcv.components.assign_callbacks(app=app, ...)

# Run app
run(app=app, ...)

See the wikicat.viewer.build_app() function for more details.

Warning

Because of the size of the graph, some parts of the API (such as the viewer and the processing CLI) require a lot of memory. We recommend using a machine with at least 32 GB of RAM. We are working on a more memory-efficient version of the API.

Alternatives

wikicat was designed for offline and high-throughput workflows, with support for different versions of Wikipedia (as categories change over time). As a result, there's a high overhead (long build time and high RAM usage). If this is not what you are looking for, you can check out alternatives to this library:

MediaWiki: This is Wikipedia's web API, and contains documentations for accessing categories (see API:Categories and API:Categorymembers)
Wikipedia Histories: This library contains a domain-level analysis module that allows you to query articles associated with a certain category. Since it utilizes the Wikipedia Web API, it does not have the same overhead.

Project details

Release history Release notifications | RSS feed

This version

0.0.1.dev8 pre-release

Jun 4, 2023

0.0.1.dev7 pre-release

Jun 4, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikicat-0.0.1.dev8.tar.gz (32.6 kB view details)

Uploaded Jun 4, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wikicat-0.0.1.dev8-py3-none-any.whl (28.0 kB view details)

Uploaded Jun 4, 2023 Python 3

File details

Details for the file wikicat-0.0.1.dev8.tar.gz.

File metadata

Download URL: wikicat-0.0.1.dev8.tar.gz
Upload date: Jun 4, 2023
Size: 32.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for wikicat-0.0.1.dev8.tar.gz
Algorithm	Hash digest
SHA256	`1a2f09e94c9f09285c1f5d8777239b27ba611c2634322f6d287f0d5477849135`
MD5	`a4189e098fb6e904f65b99a5c3c529a9`
BLAKE2b-256	`9f37843f8b12318aedda86d39181dcda78c92ab419798d5b3e0db0c08608f3c6`

See more details on using hashes here.

File details

Details for the file wikicat-0.0.1.dev8-py3-none-any.whl.

File metadata

Download URL: wikicat-0.0.1.dev8-py3-none-any.whl
Upload date: Jun 4, 2023
Size: 28.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for wikicat-0.0.1.dev8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b7bc8516c9141a2224c7cee25690d9a9061175e1c344f15cefe3e009a478f795`
MD5	`557430255658369c366ad1e3674a4979`
BLAKE2b-256	`b6ec18916e2511128c704b1c8d35d8b3935ca9ab0e4b7994386ab897ce411bf2`

See more details on using hashes here.

wikicat 0.0.1.dev8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

`wikicat`

Main API

`wikicat.processing`

`wikicat.viewer`

Usage

Accessing components

Warning

Alternatives

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes