Skip to main content

YAML based lightweight crawlers

Project description

YAML based lightweight crawlers


pip install skyscraper


Each web crawler is defined in a yml file

# the name of the crawler
name: Python 3.x docs
# the number of parallel thread workers
threads: 3

# start urls

# how/where the results are saved
  type: Json
  file: "python.json"

# on each url labeled "result", results will be extracted using
# this scheme
  - name: title
      select: h1
      text: yes
      single: true

# the first page is labeled "start" and for each extracted url, we label it
# accordingly. In this example, we extract the results directly from
# the first page
- name: start
  label: start
  - type: ahrefs
    label: result
      select: a.biglink

To run the crawler, execute

skyscraper run examples/python_docs.yaml

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for skyscraper, version 0.0.5
Filename, size File type Python version Upload date Hashes
Filename, size skyscraper-0.0.5-py3-none-any.whl (6.1 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size skyscraper-0.0.5.tar.gz (4.2 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page