A scraping library for the Minecraft Wiki
Project description
mcwiki
A scraping library for the Minecraft Wiki.
import mcwiki
page = mcwiki.load("Data Pack")
print(page["pack.mcmeta"].extract(mcwiki.TREE))
[TAG_Compound]
The root object.
└─ pack
[TAG_Compound]
Holds the data pack information.
├─ description
│ [TAG_String, TAG_List, TAG_Compound]
│ A JSON text that appears when hovering over the data pack's name in
│ the list given by the /datapack list command, or when viewing the pack
│ in the Create World screen.
└─ pack_format
[TAG_Int]
Pack version: If this number does not match the current required
number, the data pack displays a warning and requires additional
confirmation to load the pack. Requires 4 for 1.13–1.14.4. Requires 5
for 1.15–1.16.1. Requires 6 for 1.16.2–1.16.5. Requires 7 for 1.17.
Introduction
The Minecraft Wiki is a well-maintained source of information but is a bit too organic to be used as anything more than a reference. This project tries its best to make it possible to locate and extract the information you're interested in and use it as a programmatic source of truth for developing Minecraft-related tooling.
Features
- Easily navigate through page sections
- Extract paragraphs, code blocks and recursive tree-like hierarchies
- Create custom extractors or extend the provided ones
Installation
The package can be installed with pip
.
$ pip install mcwiki
Getting Started
The load
function allows you to load a page from the Minecraft Wiki. The page can be specified by providing a URL or simply the title of the page.
mcwiki.load("https://minecraft.fandom.com/wiki/Data_Pack")
mcwiki.load("Data Pack")
You can use the load_file
function to read from a page downloaded locally or the from_markup
function if you already have the html loaded in a string.
mcwiki.load_file("Data_Pack.html")
mcwiki.from_markup("<!DOCTYPE html>\n<html ...")
Page sections can then be manipulated like dictionaries. Keys are case-insensitive and are associated to subsections.
page = mcwiki.load("https://minecraft.fandom.com/wiki/Advancement/JSON_format")
print(page["List of triggers"])
<PageSection ['minecraft:bee_nest_destroyed', 'minecraft:bred_animals', ...]>
Extracting Data
There are 4 built-in extractors. Extractors are instantiated with a CSS selector and define a process
method that produces an item for each element returned by the selector.
Extractor | Type | Extracted Item |
---|---|---|
PARAGRAPH |
TextExtractor("p") |
String containing the text content of a paragraph |
CODE |
TextExtractor("code") |
String containing the text content of a code span |
CODE_BLOCK |
TextExtractor("pre") |
String containing the text content of a code block |
TREE |
TreeExtractor() |
An instance of mcwiki.Tree containing the treeview data |
Page sections can invoke extractors by using the extract
and extract_all
methods. The extract
method will return the first item in the page section or None
if the extractor couldn't extract anything.
print(page.extract(mcwiki.PARAGRAPH))
Custom advancements in data packs of a Minecraft world store the advancement data for that world as separate JSON files.
You can use the index
argument to specify which paragraph to extract.
print(page.extract(mcwiki.PARAGRAPH, index=1))
All advancement JSON files are structured according to the following format:
The extract_all
method will return a lazy sequence-like container of all the items the extractor could extract from the page section.
for paragraph in page.extract_all(mcwiki.PARAGRAPH):
print(paragraph)
You can use the limit
argument or slice the returned sequence to limit the number of extracted items.
# Both yield exactly the same list
paragraphs = page.extract_all(mcwiki.PARAGRAPH)[:10]
paragraphs = list(page.extract_all(mcwiki.PARAGRAPH, limit=10))
Tree Structures
The TREE
extractor returns recursive tree-like hierarchies. You can use the children
property to iterate through the direct children of a tree.
def print_nodes(tree: mcwiki.Tree):
for key, node in tree.children:
print(key, node.text, node.icons)
print_nodes(node.content)
print_nodes(section.extract(mcwiki.TREE))
Folded entries are automatically fetched, inlined, and cached. This means that iterating over the children
property can yield a node that's already been visited so make sure to handle infinite recursions where appropriate.
Tree nodes have 3 attributes that can all be empty:
- The
text
attribute holds the text content of the node - The
icons
attribute is a tuple that stores the names of the icons associated to the node - The
content
attribute is a tree containing the children of the node
You can transform the tree into a shallow dictionary with the as_dict
method.
# Both yield exactly the same dictionary
nodes = tree.as_dict()
nodes = dict(tree.children)
Contributing
Contributions are welcome. Make sure to first open an issue discussing the problem or the new feature before creating a pull request. The project uses poetry
.
$ poetry install
You can run the tests with poetry run pytest
.
$ poetry run pytest
The project must type-check with pyright
. If you're using VSCode the pylance
extension should report diagnostics automatically. You can also install the type-checker locally with npm install
and run it from the command-line.
$ npm run watch
$ npm run check
The code follows the black
code style. Import statements are sorted with isort
.
$ poetry run isort mcwiki tests
$ poetry run black mcwiki tests
$ poetry run black --check mcwiki tests
License - MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mcwiki-0.2.1.tar.gz
.
File metadata
- Download URL: mcwiki-0.2.1.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8546145e5e8d3d9db5b4a03fefe44ab9655d4d7601fe176694199cb967a1063f |
|
MD5 | 18f133dc3c589baa212b19d2dfb14296 |
|
BLAKE2b-256 | 51a290a47df36b92a9941c4e676f2b3154cf5c27dd52b4990f93be28c4e88b55 |
File details
Details for the file mcwiki-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: mcwiki-0.2.1-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eefe2fbe9afabe89d4d65158cb646bf4ba261c834b82761da7ee0d1c7fe2db57 |
|
MD5 | 233455e99f20db471468d93091011178 |
|
BLAKE2b-256 | 7ec9c2744085ca96a4dd1f914003695087e3f346dfbde1ca794af5c153bf3792 |