A simple AI (span-marker) powered fandom scraper
Project description
Fandom Scraper
A simple AI (span marker) powered fandom scraper.
[!NOTE]
This package is a part of the Cirilla project
[!IMPORTANT]
In order to use the package an nvidia gpu is required.Considering how fragile huggingface's span marker can be, the requirements are fixed, so I advise to create a separate project in order to only scrape the data.
Installation
# (recommended)
uv add fandom-scraper
# or
pip install fandom-scraper
Usage
The usage is very simple, the function requires path with so-called seeds to start scraping e.g. examples/witcher_json/witcher_1.json
[
"Geralt of Rivia", "Triss Merigold", "Vesemir", "Leo", "Lambert",
"Eskel", "Alvin", "Shani", "Zoltan Chivay", "Dandelion (Jaskier)",
"King Foltest", "Adda the White",
"Jacques de Aldersberg", "Azar Javed", "Professor (leader of Salamandra)",
...
]
and later uses sugesions provided by an Named Entity Recognition (NER) model. The script saves the scraped pages and instructions into respective folders.
from fandom_scraper import scrape_fandom
in_path = "./examples/witcher_json"
out_path = "./examples/async_fandom"
instruct_path = "./examples/async_fandom_instruct"
wiki = "Witcher"
lang = "en"
scrape_fandom(in_path=in_path,
out_path=out_path,
instruct_path=instruct_path,
n_workers=50, # n async page "fetchers"
wiki=wiki,
lang=lang)
See examples/async_fandom/ and examples/async_fandom_instruct/ for more examples.
In order to turn the scraped instructions into a .jsonl file that can be used by Cirilla models use:
from fandom_scraper import instructions_into_conv
instructions_into_conv('./examples/async_fandom_instruct', './examples/fandom_instruct.jsonl')
The resulting file will look like:
{"subject": "Shani", "text": [{"role": "user", "content": "What role did Shani play in the Battle of Brenna?"}, {"role": "assistant", "content": "Shani played a vital role in the Battle of Brenna in March 1268, serving in a field hospital under Milo Rusty Vanderbeck. Despite her youth, she effectively managed the demanding and gruesome conditions, providing essential medical aid to the wounded, contributing to the victory of the Northern Kingdoms."}], "data type": "conv", "source": "fandom"}
{"subject": "Shani", "text": [{"role": "user", "content": "How did Shani become the Dean of the Department of Medicine at Oxenfurt Academy?"}, {"role": "assistant", "content": "Shani, initially a medical student at Oxenfurt Academy, demonstrated her medical proficiency at a young age. She served in Milo Rusty Vanderbecks field hospital at the Battle of Brenna. Her exceptional skills and commitment to the field led to her appointment as the Dean of the Department of Medicine at Oxenfurt Academy."}], "data type": "conv", "source": "fandom"}
...
Effectiveness
For the Witcher fandom, the scraper managed to gather 7506 pages, 1494 instructions. All in all around 40MiB of pure text in around 4 hours.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fandom_scraper-0.6.4.tar.gz.
File metadata
- Download URL: fandom_scraper-0.6.4.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67ba364b990b338f95aea87e4816e455a23e7e0ed85fb936770124c9344d948d
|
|
| MD5 |
a51d170b4ec19960cabce561843da340
|
|
| BLAKE2b-256 |
6fba4e830795a72e6789bdd5c8386b7f3ed7543db8826315045f7917ce61be7e
|
File details
Details for the file fandom_scraper-0.6.4-py3-none-any.whl.
File metadata
- Download URL: fandom_scraper-0.6.4-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
600e5ce898d5c59298fac947f887d3fc5610882b61f5af89cc4d80f01b9fd005
|
|
| MD5 |
c7081358e121c59d753c712d2b9c7633
|
|
| BLAKE2b-256 |
4354ce9870400d7c1ebf1f311f8ee8c9f835e93a1f513ca2986ec53cc5228074
|