Toolkit for managing and navigating graphs of Wikipedia categories
Project description
wikicat
A Python toolkit for managing and navigating graphs of Wikipedia categories 🔖
Simple Python API for exploring graph offline | Useful CLI for processing and launching app |
Interactive visualization of categories | UI to display information and filter nodes |
Note If you need help at any time, you can head over to the official documentations.
Main API
Note The reference can be found on the doc page or in
docs/wikicat.md
The main wikicat
API allows you work with category graphs generated from a certain dump by Wikipedia. Once the dump is processed via wikicat.processing
, you can easily navigate the graph using simple and clear Python code, all offline (i.e., you do not need to make web requests to Wikipedia, and you can choose dump going back to any date you prefer). The API is designed to be as simple as possible, and is intended to be used by researchers and developers who want to work with the Wikipedia category graph.
To install the API, run:
pip3 install wikicat
wikicat
contains two classes to work with the Wikipedia category graph: CategoryGraph
and Page
. The CategoryGraph
class is used to load the graph from a file, and to navigate the graph. The Page
class is used to represent a Wikipedia page, and to retrieve information about the page from Wikipedia. They are meant to be used together, as shown in the following example:
import wikicat as wc
# Load the graph
cg = wc.CategoryGraph.read_json(
'~/.wikicat_data/enwiki_<yyyy>_<mm>_<dd>/category_graph.json'
)
# Get the page for "Montreal"
page = cg.get_page_from_title('Montreal', 'article')
# Get the categories for "Montreal"
cats = cg.get_parents(page=page)
print(f"Category tags of {page.title}: {cats}")
# Get URL of "Montreal"
print("URL:", page.get_url())
By default, the path will be ~/.wikicat_data/
, but the JSON can be stored anywhere you want (see wikicat.processing
below for more information).
wikicat.processing
wikicat.processing
is a command line interface (CLI) for downloading and processing the data
Note The reference can be found on the doc page or in
docs/wikicat/processing.md
To install the processing tools, run:
pip3 install wikicat[processing]
Now, following those instructions to download and process the data:
# 1. Download DB dump of Wikipedia categories (extension .sql.gz)
python3 -m wikicat.processing.download_dump \
--year <yyyy> \
--month <mm> \
--day <dd> \
--base_dir ~/.wikicat_data/ # optional, default is ~/.wikicat_data/
# 2. Process individual dumps (.sql.gz) into csv files (to be merged later)
python3 -m wikicat.processing.process_dump \
-y <yyyy> -m <mm> -d <dd>
# 3. Merge the individual table csvs into a single category graph csv
python3 -m wikicat.processing.merge_tables \
-y <yyyy> -m <mm> -d <dd>
# 4. Convert CSV category graph into JSON category graph (the final output)
python3 -m wikicat.processing.generate_graph \
-y <yyyy> -m <mm> -d <dd>
Notes (by step):
- If you do not specify
--base_dir
, it will automatically be saved to~/.wikicat_data/enwiki_<yyyy>_<mm>_<dd>
. - This may take a while depending on your hardware, and will need plenty of RAM. It will generate two CSV files with the relevant tables.
- This should take under 30 mins depending on your hardware, and will generate a single CSV file with the relevant category graph links.
- The results will be saved in
~/.wikicat_data/enwiki_<yyyy>_<mm>_<dd>/category_graph.json
.
wikicat.viewer
wikicat.viewer
is an application that lets you visually explore a category graph
Note The reference can be found on the doc page or in
docs/wikicat/viewer.md
To install the viewer, run:
pip3 install wikicat[viewer]
To run the viewer, run:
python3 -m wikicat.viewer -y <yyyy> -m <mm> -d <dd> --port 8050
Then, open your browser to http://0.0.0.0:8050
.
Usage
The viewer let you interact with the nodes. You can zoom in and out, move and click the nodes in the graph.
- When you click on a node, you will see various information (including a list of children articles) appear on the middle panel.
- On the right panel, you will see various checklists of children and parents of the selected node. When you click on "Update", the checked parents and children will be added to the graph.
- There's a dropdown and a validated input. For the input, a green check will appear if a valid article title is input, otherwise it remains red. The dropdown lets you choose one of 28 top-level categories. The input let you type the name of an article (not category). When the title is valid, you can click on the "Compute Path" button, which will try to find a valid path between the top-level category and the article you chose.
- Click on the "Reset" button to go back to the original view.
Accessing components
wikicat.viewer
was built using Dash, a Python framework for building web applications. The application is composed of several components, which can be accessed inside wikicat.viewer.components
. For example, to access the Network
component, you can run:
import wikicat.viewer.components as comp
# Build the network
cytoscape_graph = comp.build_cytoscape_graph(...)
# Build the right panel
panel = comp.build_panel(...)
Those can be reused in your custom Dash application. You can also create your own component and add it to the viewer. For example:
import wikicat.viewer as wcv
# ...
# Define app
app = dash.Dash(__name__, external_stylesheets=[style], title=title, **kwargs)
# Define your custom components
def build_btn(...):
# ...
# Build regular components
cyto_graph = wcv.components.build_cytoscape_graph(root)
# ...
cards = wcv.components.build_cards(cl=cl, sw=sw)
cards_column = wcv.components.build_card_column(cards)
# Build layout
app.layout = wcv.components.build_layout(...)
# Assign callbacks to make app interactive
wcv.components.assign_callbacks(app=app, ...)
# Run app
run(app=app, ...)
See the wikicat.viewer.build_app()
function for more details.
Warning
Because of the size of the graph, some parts of the API (such as the viewer and the processing CLI) require a lot of memory. We recommend using a machine with at least 32 GB of RAM. We are working on a more memory-efficient version of the API.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wikicat-0.0.1.dev7.tar.gz
.
File metadata
- Download URL: wikicat-0.0.1.dev7.tar.gz
- Upload date:
- Size: 31.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.1 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e18c0aa6d1dfd629c10798b04cab5aa212cebf75bc82679b1f963af22619cce2 |
|
MD5 | ddb087ae13cf5ec20e7778d8437e06ba |
|
BLAKE2b-256 | 86e0bf27ffe4ff79b3fee2ce1a50ce32cbd022e3bdf883221a04f6cfdc04be46 |
File details
Details for the file wikicat-0.0.1.dev7-py3-none-any.whl
.
File metadata
- Download URL: wikicat-0.0.1.dev7-py3-none-any.whl
- Upload date:
- Size: 27.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.1 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 22967aecd96c8f0f46e68049bd554454d1678929f322c4b6ff498ae09cf5273f |
|
MD5 | 968143fb6180b891fb8ffeb4260dd5c7 |
|
BLAKE2b-256 | 6cb40e6f0a7143c51bfd9d2daa61833eee348d1de5dcf0da3b5fb1bb5216a721 |