skyscraper

YAML based lightweight crawlers

Project description

YAML based lightweight crawlers

Installation

pip install skyscraper

Usage

Each web crawler is defined in a yml file

# the name of the crawler
name: Python 3.x docs
# the number of parallel thread workers
threads: 3

# start urls
params:
  start_url: https://docs.python.org/3/index.html

# how/where the results are saved
results:
  type: Json
  file: "python.json"

# on each url labeled "result", results will be extracted using
# this scheme
result_extractor:
  fields:
  - name: title
    rules:
      select: h1
      text: yes
      single: true


# the first page is labeled "start" and for each extracted url, we label it
# accordingly. In this example, we extract the results directly from
# the first page
steps:
- name: start
  label: start
  extract:
  - type: ahrefs
    label: result
    rules:
      select: a.biglink

To run the crawler, execute

skyscraper run examples/python_docs.yaml

Project details

Release history Release notifications | RSS feed

This version

0.0.5

Jul 1, 2018

0.0.4

Jul 1, 2018

0.0.3

Jul 1, 2018

0.0.2

Jul 1, 2018

0.0.1

Jul 1, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skyscraper-0.0.5.tar.gz (4.2 kB view details)

Uploaded Jul 1, 2018 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

skyscraper-0.0.5-py3-none-any.whl (6.1 kB view details)

Uploaded Jul 1, 2018 Python 3

File details

Details for the file skyscraper-0.0.5.tar.gz.

File metadata

Download URL: skyscraper-0.0.5.tar.gz
Upload date: Jul 1, 2018
Size: 4.2 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for skyscraper-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`c5fb4f5ef39194a1f566a497da4053de27494c58e537d59afc09203c4e2fc74b`
MD5	`ac8c43e72c13c8417c524e030427f1ca`
BLAKE2b-256	`85f0c76e3617212afc8b846f1cabf1a17931d31d128cd88213f92164b2537ca1`

See more details on using hashes here.

File details

Details for the file skyscraper-0.0.5-py3-none-any.whl.

File metadata

Download URL: skyscraper-0.0.5-py3-none-any.whl
Upload date: Jul 1, 2018
Size: 6.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for skyscraper-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cfd51f80d32e4c7e5803c6d9be2def7b2244141798cebda000ab986a1144e33e`
MD5	`986133cfc77d43187f565c4d11c6be19`
BLAKE2b-256	`87215939f42b87fb68389b772342468eb0651cba700cc8f69dfa4345a5ad6635`

See more details on using hashes here.

skyscraper 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Project description

Installation

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes