Skip to main content

A simple AI (span-marker) powered fandom scraper

Project description

Fandom Scraper

A simple AI (span marker) powered fandom scraper.

[!NOTE]
This package is a part of the Cirilla project

[!IMPORTANT]
In order to use the package an nvidia gpu is required.

Considering how fragile huggingface's span marker can be, the requirements are fixed, so I advise to create a separate project in order to only scrape the data.

Installation

# (recommended)
uv add fandom-scraper

# or
pip install fandom-scraper

Usage

The usage is very simple, the function requires path with so-called seeds to start scraping e.g. examples/witcher_json/witcher_1.json

[
    "Geralt of Rivia", "Triss Merigold", "Vesemir", "Leo", "Lambert", 
    "Eskel", "Alvin", "Shani", "Zoltan Chivay", "Dandelion (Jaskier)", 
    "King Foltest", "Adda the White",

    "Jacques de Aldersberg", "Azar Javed", "Professor (leader of Salamandra)", 
    ...
]

and later uses sugesions provided by an Named Entity Recognition (NER) model. The script saves the scraped pages and instructions into respective folders.

from fandom_scraper import scrape_fandom
in_path = "./examples/witcher_json"
out_path = "./examples/async_fandom"
instruct_path = "./examples/async_fandom_instruct"

wiki = "Witcher"
lang = "en"

scrape_fandom(in_path=in_path,
              out_path=out_path,
              instruct_path=instruct_path,
              n_workers=50, # n async page "fetchers"
              wiki=wiki,
              lang=lang)

See examples/async_fandom/ and examples/async_fandom_instruct/ for more examples.

In order to turn the scraped instructions into a .jsonl file that can be used by Cirilla models use:

from fandom_scraper import instructions_into_conv

instructions_into_conv('./examples/async_fandom_instruct', './examples/fandom_instruct.jsonl')

The resulting file will look like:

{"subject": "Shani", "text": [{"role": "user", "content": "What role did Shani play in the Battle of Brenna?"}, {"role": "assistant", "content": "Shani played a vital role in the Battle of Brenna in March 1268, serving in a field hospital under Milo Rusty Vanderbeck. Despite her youth, she effectively managed the demanding and gruesome conditions, providing essential medical aid to the wounded, contributing to the victory of the Northern Kingdoms."}], "data type": "conv", "source": "fandom"}
{"subject": "Shani", "text": [{"role": "user", "content": "How did Shani become the Dean of the Department of Medicine at Oxenfurt Academy?"}, {"role": "assistant", "content": "Shani, initially a medical student at Oxenfurt Academy, demonstrated her medical proficiency at a young age. She served in Milo Rusty Vanderbecks field hospital at the Battle of Brenna. Her exceptional skills and commitment to the field led to her appointment as the Dean of the Department of Medicine at Oxenfurt Academy."}], "data type": "conv", "source": "fandom"}
...

Effectiveness

For the Witcher fandom, the scraper managed to gather 7506 pages, 1494 instructions. All in all around 40MiB of pure text in around 4 hours.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fandom_scraper-0.6.4.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fandom_scraper-0.6.4-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file fandom_scraper-0.6.4.tar.gz.

File metadata

  • Download URL: fandom_scraper-0.6.4.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.0

File hashes

Hashes for fandom_scraper-0.6.4.tar.gz
Algorithm Hash digest
SHA256 67ba364b990b338f95aea87e4816e455a23e7e0ed85fb936770124c9344d948d
MD5 a51d170b4ec19960cabce561843da340
BLAKE2b-256 6fba4e830795a72e6789bdd5c8386b7f3ed7543db8826315045f7917ce61be7e

See more details on using hashes here.

File details

Details for the file fandom_scraper-0.6.4-py3-none-any.whl.

File metadata

File hashes

Hashes for fandom_scraper-0.6.4-py3-none-any.whl
Algorithm Hash digest
SHA256 600e5ce898d5c59298fac947f887d3fc5610882b61f5af89cc4d80f01b9fd005
MD5 c7081358e121c59d753c712d2b9c7633
BLAKE2b-256 4354ce9870400d7c1ebf1f311f8ee8c9f835e93a1f513ca2986ec53cc5228074

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page