Skip to main content

YAML based lightweight crawlers

Project description

YAML based lightweight crawlers

Installation

pip install skyscraper

Usage

Each web crawler is defined in a yml file

# the name of the crawler
name: Python 3.x docs
# the number of parallel thread workers
threads: 3

# start urls
params:
  start_url: https://docs.python.org/3/index.html

# how/where the results are saved
results:
  type: Json
  file: "python.json"

# on each url labeled "result", results will be extracted using
# this scheme
result_extractor:
  fields:
  - name: title
    rules:
      select: h1
      text: yes
      single: true


# the first page is labeled "start" and for each extracted url, we label it
# accordingly. In this example, we extract the results directly from
# the first page
steps:
- name: start
  label: start
  extract:
  - type: ahrefs
    label: result
    rules:
      select: a.biglink

To run the crawler, execute

skyscraper run examples/python_docs.yaml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skyscraper-0.0.5.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

skyscraper-0.0.5-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file skyscraper-0.0.5.tar.gz.

File metadata

  • Download URL: skyscraper-0.0.5.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for skyscraper-0.0.5.tar.gz
Algorithm Hash digest
SHA256 c5fb4f5ef39194a1f566a497da4053de27494c58e537d59afc09203c4e2fc74b
MD5 ac8c43e72c13c8417c524e030427f1ca
BLAKE2b-256 85f0c76e3617212afc8b846f1cabf1a17931d31d128cd88213f92164b2537ca1

See more details on using hashes here.

File details

Details for the file skyscraper-0.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for skyscraper-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 cfd51f80d32e4c7e5803c6d9be2def7b2244141798cebda000ab986a1144e33e
MD5 986133cfc77d43187f565c4d11c6be19
BLAKE2b-256 87215939f42b87fb68389b772342468eb0651cba700cc8f69dfa4345a5ad6635

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page