A simple AI (span-marker) powered fandom scraper

Project description

Fandom Scraper

A simple AI (span marker) powered fandom scraper.

[!NOTE]
This package is a part of the Cirilla project

[!IMPORTANT]
In order to use the package an nvidia gpu is required.

Considering how fragile huggingface's span marker can be, the requirements are fixed, so I advise to create a separate project in order to only scrape the data.

Installation

# (recommended)
uv add fandom-scraper

# or
pip install fandom-scraper

Usage

The usage is very simple, the function requires path with so-called seeds to start scraping e.g. examples/witcher_json/witcher_1.json

[
    "Geralt of Rivia", "Triss Merigold", "Vesemir", "Leo", "Lambert", 
    "Eskel", "Alvin", "Shani", "Zoltan Chivay", "Dandelion (Jaskier)", 
    "King Foltest", "Adda the White",

    "Jacques de Aldersberg", "Azar Javed", "Professor (leader of Salamandra)", 
    ...
]

and later uses sugesions provided by an Named Entity Recognition (NER) model. The script saves the scraped pages and instructions into respective folders.

from fandom_scraper import scrape_fandom
in_path = "./examples/witcher_json"
out_path = "./examples/async_fandom"
instruct_path = "./examples/async_fandom_instruct"

wiki = "Witcher"
lang = "en"

scrape_fandom(in_path=in_path,
              out_path=out_path,
              instruct_path=instruct_path,
              n_workers=50, # n async page "fetchers"
              wiki=wiki,
              lang=lang)

See examples/async_fandom/ and examples/async_fandom_instruct/ for more examples.

In order to turn the scraped instructions into a .jsonl file that can be used by Cirilla models use:

from fandom_scraper import instructions_into_conv

instructions_into_conv('./examples/async_fandom_instruct', './examples/fandom_instruct.jsonl')

The resulting file will look like:

{"subject": "Shani", "text": [{"role": "user", "content": "What role did Shani play in the Battle of Brenna?"}, {"role": "assistant", "content": "Shani played a vital role in the Battle of Brenna in March 1268, serving in a field hospital under Milo Rusty Vanderbeck. Despite her youth, she effectively managed the demanding and gruesome conditions, providing essential medical aid to the wounded, contributing to the victory of the Northern Kingdoms."}], "data type": "conv", "source": "fandom"}
{"subject": "Shani", "text": [{"role": "user", "content": "How did Shani become the Dean of the Department of Medicine at Oxenfurt Academy?"}, {"role": "assistant", "content": "Shani, initially a medical student at Oxenfurt Academy, demonstrated her medical proficiency at a young age. She served in Milo Rusty Vanderbecks field hospital at the Battle of Brenna. Her exceptional skills and commitment to the field led to her appointment as the Dean of the Department of Medicine at Oxenfurt Academy."}], "data type": "conv", "source": "fandom"}
...

Effectiveness

For the Witcher fandom, the scraper managed to gather 7506 pages, 1494 instructions. All in all around 40MiB of pure text in around 4 hours.

Project details

Release history Release notifications | RSS feed

This version

0.6.4

Sep 10, 2025

0.6.3

Sep 10, 2025

0.6.2

Sep 10, 2025

0.6.1

Sep 10, 2025

0.6.0

Sep 9, 2025

0.5.0

Sep 9, 2025

0.3.1

Sep 9, 2025

0.3.0

Sep 9, 2025

0.2.1

Sep 9, 2025

0.2.0

Sep 9, 2025

0.1.34

Sep 9, 2025

0.1.33

Sep 9, 2025

0.1.32

Sep 9, 2025

0.1.31

Sep 9, 2025

0.1.4

Sep 9, 2025

0.1.3

Sep 9, 2025

0.1.2

Sep 9, 2025

0.1.1

Sep 9, 2025

0.1.0

Sep 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fandom_scraper-0.6.4.tar.gz (5.7 kB view details)

Uploaded Sep 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fandom_scraper-0.6.4-py3-none-any.whl (6.9 kB view details)

Uploaded Sep 10, 2025 Python 3

File details

Details for the file fandom_scraper-0.6.4.tar.gz.

File metadata

Download URL: fandom_scraper-0.6.4.tar.gz
Upload date: Sep 10, 2025
Size: 5.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.0

File hashes

Hashes for fandom_scraper-0.6.4.tar.gz
Algorithm	Hash digest
SHA256	`67ba364b990b338f95aea87e4816e455a23e7e0ed85fb936770124c9344d948d`
MD5	`a51d170b4ec19960cabce561843da340`
BLAKE2b-256	`6fba4e830795a72e6789bdd5c8386b7f3ed7543db8826315045f7917ce61be7e`

See more details on using hashes here.

File details

Details for the file fandom_scraper-0.6.4-py3-none-any.whl.

File metadata

Download URL: fandom_scraper-0.6.4-py3-none-any.whl
Upload date: Sep 10, 2025
Size: 6.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.0

File hashes

Hashes for fandom_scraper-0.6.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`600e5ce898d5c59298fac947f887d3fc5610882b61f5af89cc4d80f01b9fd005`
MD5	`c7081358e121c59d753c712d2b9c7633`
BLAKE2b-256	`4354ce9870400d7c1ebf1f311f8ee8c9f835e93a1f513ca2986ec53cc5228074`

See more details on using hashes here.

fandom-scraper 0.6.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Fandom Scraper

Installation

Usage

Effectiveness

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes