Skip to main content

A tool for generating Anki cards by web scraping

Project description

cardscraper

Webscraping tool for generating Anki packages.

Installation

From PyPI:

pip install cardscraper

From git:

pip install git+https://github.com/sakhezech/cardscraper

Usage

cardscraper ... or python -m cardscraper ...

Generate a skeleton input file:

cardscraper init filename.yaml

Edit it with your favorite text editor:

nvim filename.yaml

Generate the package:

cardscraper gen filename.yaml

For more info use cardscraper -h.

Input files

You can generate a skeleton input file by using cardscraper init filename.yaml.

Here is a big self-explaining input file example:

# here you can specify which function to use for each step
# (every one defaults to 'default')
meta:
  # controls package details and package dumping
  package: default
  # controls deck creation
  deck: default
  # controls model creation
  model: default
  # controls scraping and note creation
  scraping: default

# anki package info
package:
  # package name
  name: package_name
  # output folder (defaults to '.')
  output: ./out/
  # media folder (defaults to null)
  # the directory will be walked recursively
  # every pattern matched file will be added to the package as media
  media: ./media/
  # pattern to match files against for media (defaults to **/*.*)
  pattern: "**/*.png"

# anki deck info
deck:
  # deck name
  name: Deck
  # deck id
  # don't forget to make this value unique
  id: 987

# anki model info
model:
  # model name
  name: Model
  # model id
  # don't forget to make this value unique
  id: 321
  # card styling (defaults to '')
  css: |
    .question, .answer {
        text-align: center;
    }
    .question {
        font-size: 5rem;
        font-weight: 700;
    }
    .answer {
        font-size: 3rem;
    }
  # list of cards
  templates:
    # card name
    - name: Front
      # front side
      qfmt: |
        <div class='question'>
        {{Question}}
        </div>
      # back side
      afmt: |
        {{FrontSide}}
        <hr id='answer'>
        <div class='answer'>
        {{Answer}}
        </div>
    # same here
    - name: Back
      qfmt: |
        <div class='question'>
        {{Answer}}
        </div>
      afmt: |
        {{FrontSide}}
        <hr id='answer'>
        <div class='answer'>
        {{Question}}
        </div>

# scraping info
scraping:
  # list of urls to scrape
  urls:
    - https://www.scrapethissite.com/pages/simple/
  # you can set your own custom user agent (defaults to null)
  agent: Mozilla/5.0 (X11; Linux x86_64; rv:120.0) Gecko/20100101 Firefox/120.0
  # list of queries
  # each query selects an html element and lets you use its text in the templates
  # each child query runs inside the parent one
  queries:
    # query name which you can use in the templates like {{Country}}
    - name: Country
      # css selector
      query: .country
      # you can select something specific from the query by providing a regex
      # this is a python regex with re.DOTALL enabled i.e. '.' captures '\n'
      # uses the first captured group
      # (defaults to null)
      regex: null
      # if true: we select every instance and iterate over them
      # if false: we only select the first one
      # basically it's querySelector() vs querySelectorAll()
      # (defaults to false)
      many: true
      children:
        - name: Question
          query: .country-info
          many: false
          regex: (Area .*)$
          children: null
        - name: Answer
          query: .country-name
          many: false
          regex: null
          children: null

Usage in code

It is possible to use cardscraper programmatically, but it is created to be used as a CLI application.

import yaml
from cardscraper import (
    Config,
    generate_anki_package,
    select_function_by_step_and_name,
    write_package,
)
from genanki import Model, Note

if __name__ == '__main__':
    with open('/path/to/config.yaml', 'r') as f:
        config: Config = yaml.load(f, yaml.Loader)
    # or you can make a config manually

    get_model = select_function_by_step_and_name('model', 'default')
    get_deck = select_function_by_step_and_name('deck', 'default')
    get_package = select_function_by_step_and_name('package', 'default')

    def get_notes(config: Config, model: Model) -> list[Note]:
        notes = []
        ...
        return notes

    package, path = generate_anki_package(
        config, get_model, get_notes, get_deck, get_package
    )
    write_package(package, path)

Plugin system

A plugin system is present in cardscraper. To expose your functions to cardscraper expose them in an entry point named cardscraper.STEPNAME.

This is how the default functions are exposed:

[project.entry-points.'cardscraper.model']
default = 'cardscraper.default:get_model'
[project.entry-points.'cardscraper.scraping']
default = 'cardscraper.default:get_notes'
[project.entry-points.'cardscraper.deck']
default = 'cardscraper.default:get_deck'
[project.entry-points.'cardscraper.package']
default = 'cardscraper.default:get_package'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cardscraper-0.4.2.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

cardscraper-0.4.2-py2.py3-none-any.whl (13.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file cardscraper-0.4.2.tar.gz.

File metadata

  • Download URL: cardscraper-0.4.2.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for cardscraper-0.4.2.tar.gz
Algorithm Hash digest
SHA256 b1832952230a2e10ede5a271e1aacde91efdc962701f24e10f7f72ed1529c9ed
MD5 5ec01437def532f6d01fee7fd7389259
BLAKE2b-256 27e8d168534de72c884cdac636095a85d004cb8e3dd8f104c04642df5ddd9f29

See more details on using hashes here.

File details

Details for the file cardscraper-0.4.2-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for cardscraper-0.4.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a5d0d4ae4b68e337cbc61a97ce6aa3a7f9e4b724527e51d0503c75e6cb5105c9
MD5 91af56e5799644c032b13d381ed7fdfc
BLAKE2b-256 6b10a9dc30dd1ba602514969fc1ad952cf3ef793085249aab5137805d575071f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page