YAML based lightweight crawlers
Project description
Skyscraper
YAML based lightweight crawlers
Usage
Each web crawler is defined in a yml file
# the name of the crawler
name: Python 3.x docs
# the number of parallel thread workers
threads: 3
# start urls
params:
start_url: https://docs.python.org/3/index.html
# how/where the results are saved
results:
type: Json
file: "python.json"
# on each url labeled "result", results will be extracted using
# this scheme
result_extractor:
fields:
- name: title
rules:
select: h1
text: yes
single: true
# the first page is labeled "start" and for each extracted url, we label it
# accordingly. In this example, we extract the results directly from
# the first page
steps:
- name: start
label: start
extract:
- type: ahrefs
label: result
rules:
select: a.biglink
To run the crawler, execute
skyscraper run examples/python_docs.yaml
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
skyscraper-0.0.4.tar.gz
(4.2 kB
view hashes)
Built Distribution
Close
Hashes for skyscraper-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d2ce5cfa275a1a32ae46762bb8e305a44b2e7e642999395a0d7ce27d09f49958 |
|
MD5 | 6322f9ef3976a8013aae98993944828d |
|
BLAKE2b-256 | 197c36fa3047743027cab5037ca0ace29f2992ef4ee3525ff8cd284de278bb8f |