YAML based lightweight crawlers
Project description
# Skyscraper
YAML based lightweight crawlers
## Usage
Each web crawler is defined in a yml file
# the name of the crawler
name: Python 3.x docs
# the number of parallel thread workers
threads: 3
# start urls
params:
start_url: https://docs.python.org/3/index.html
# how/where the results are saved
results:
type: Json
file: "python.json"
# on each url labeled "result", results will be extracted using
# this scheme
result_extractor:
fields:
- name: title
rules:
select: h1
text: yes
single: true
# the first page is labeled "start" and for each extracted url, we label it
# accordingly. In this example, we extract the results directly from
# the first page
steps:
- name: start
label: start
extract:
- type: ahrefs
label: result
rules:
select: a.biglink
To run the crawler, execute
skyscraper run examples/python_docs.yaml
YAML based lightweight crawlers
## Usage
Each web crawler is defined in a yml file
# the name of the crawler
name: Python 3.x docs
# the number of parallel thread workers
threads: 3
# start urls
params:
start_url: https://docs.python.org/3/index.html
# how/where the results are saved
results:
type: Json
file: "python.json"
# on each url labeled "result", results will be extracted using
# this scheme
result_extractor:
fields:
- name: title
rules:
select: h1
text: yes
single: true
# the first page is labeled "start" and for each extracted url, we label it
# accordingly. In this example, we extract the results directly from
# the first page
steps:
- name: start
label: start
extract:
- type: ahrefs
label: result
rules:
select: a.biglink
To run the crawler, execute
skyscraper run examples/python_docs.yaml
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
skyscraper-0.0.3.tar.gz
(4.2 kB
view hashes)
Built Distribution
Close
Hashes for skyscraper-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d673e1f0be9d43c327113474a7bc2c73e1fb065316e045ac1d7832c7505e85e |
|
MD5 | 7ee72b43c30e3e3eaabe937e433cc9a4 |
|
BLAKE2b-256 | a129e5b7a99eb11ca5765cd15d7c717b484713133a777854407e9c5b65b1015a |