Skip to main content

YAML based lightweight crawlers

Project description

YAML based lightweight crawlers

Installation

pip install skyscraper

Usage

Each web crawler is defined in a yml file

# the name of the crawler
name: Python 3.x docs
# the number of parallel thread workers
threads: 3

# start urls
params:
  start_url: https://docs.python.org/3/index.html

# how/where the results are saved
results:
  type: Json
  file: "python.json"

# on each url labeled "result", results will be extracted using
# this scheme
result_extractor:
  fields:
  - name: title
    rules:
      select: h1
      text: yes
      single: true


# the first page is labeled "start" and for each extracted url, we label it
# accordingly. In this example, we extract the results directly from
# the first page
steps:
- name: start
  label: start
  extract:
  - type: ahrefs
    label: result
    rules:
      select: a.biglink

To run the crawler, execute

skyscraper run examples/python_docs.yaml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
skyscraper-0.0.5-py3-none-any.whl (6.1 kB) Copy SHA256 hash SHA256 Wheel py3
skyscraper-0.0.5.tar.gz (4.2 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page