Skip to main content

Web crawler and scraper based on Scrapy and Playwright's headless browser.

Project description

Scrapy Scraper

Web crawler and scraper based on Scrapy and Playwright's headless browser.

To use the headless browser specify -p option. Browsers, unlike other standard web request libraries, have the ability to render JavaScript encoded HTML content.

To automatically download and beautify all JavaScript files, including minified ones, specify -dir downloads option - where downloads is your desired output directory.

Future plans:

  • detect if Playwright Chromium is not installed.

Resources:

Tested on Kali Linux v2023.4 (64-bit).

Made for educational purposes. I hope it will help!

Table of Contents

How to Install

Install Playwright and Chromium

pip3 install --upgrade playwright

playwright install chromium

Make sure each time you upgrade your Playwright dependency to re-install Chromium; otherwise, you might get an error using the headless browser.

Standard Install

pip3 install --upgrade scrapy-scraper

Build and Install From the Source

git clone https://github.com/ivan-sincek/scrapy-scraper && cd scrapy-scraper

python3 -m pip install --upgrade build

python3 -m build

python3 -m pip install dist/scrapy-scraper-2.5-py3-none-any.whl

How to Run

Restricted (domain whitelisting is on):

scrapy-scraper -u https://example.com/home -o results.txt -a random -s 2 -rs -dir js -l

Unrestricted (domain whitelisting is off):

scrapy-scraper -u https://example.com/home -o results.txt -a random -s 2 -rs -dir js -l -w off

Usage

Scrapy Scraper v2.5 ( github.com/ivan-sincek/scrapy-scraper )

Usage:   scrapy-scraper -u urls                     -o out         [-dir directory]
Example: scrapy-scraper -u https://example.com/home -o results.txt [-dir downloads]

DESCRIPTION
    Crawl and scrape websites
URLS
    File with URLs or a single URL to start crawling and scraping from
    -u, --urls = urls.txt | https://example.com/home | etc.
WHITELIST
    File with whitelisted domains to limit the crawling scope
    Specify 'off' to disable domain whitelisting
    Default: domains extracted from the URLs
    -w, --whitelist = whitelist.txt | off | etc.
LINKS
    Include all links and sources (incl. 3rd party) in the output file
    -l, --links
PLAYWRIGHT
    Use Playwright's headless browser
    -p, --playwright
PLAYWRIGHT WAIT
    Wait time in seconds before fetching the content from the page
    Applies only if Playwright's headless browser is used
    -pw, --playwright-wait = 2 | 4 | etc.
CONCURRENT REQUESTS
    Number of concurrent requests
    Default: 30
    -cr, --concurrent-requests = 15 | 45 | etc.
CONCURRENT REQUESTS PER DOMAIN
    Number of concurrent requests per domain
    Default: 10
    -crd, --concurrent-requests-domain = 5 | 15 | etc.
SLEEP
    Sleep time in seconds between two consecutive requests to the same domain
    -s, --sleep = 1.5 | 3 | etc.
RANDOM SLEEP
    Randomize the sleep time on each request to vary between '0.5 * sleep' and '1.5 * sleep'
    -rs, --random-sleep
AUTO THROTTLE
    Auto throttle concurrent requests based on the load and latency
    Sleep time is still respected
    -at, --auto-throttle = 0.5 | 10 | 15 | 45 | etc.
RETRIES
    Number of retries per URL
    -rt, --retries = 2 | 4 | etc.
RECURSION
    Recursion depth limit
    Specify '0' for no limit
    Default: 1
    -r, --recursion = 0 | 2 | etc.
REQUEST TIMEOUT
    Request timeout in seconds
    Default: 60
    -t, --request-timeout = 30 | 90 | etc.
HEADER
    Specify any number of extra HTTP request headers
    -H, --header = "Authorization: Bearer ey..." | etc.
COOKIE
    Specify any number of extra HTTP cookies
    -b, --cookie = PHPSESSIONID=3301 | etc.
USER AGENT
    User agent to use
    Default: Scrapy Scraper/2.5
    -a, --user-agent = curl/3.30.1 | random[-all] | etc.
PROXY
    Web proxy to use
    -x, --proxy = http://127.0.0.1:8080 | etc.
DIRECTORY
    Output directory
    All extracted JavaScript files will be saved in this directory
    -dir, --directory = downloads | etc.
OUT
    Output file
    -o, --out = results.txt | etc.
DEBUG
    Debug output
    -dbg, --debug

Images

Scraping

Figure 1 - Scraping

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_scraper-2.5.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

scrapy_scraper-2.5-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_scraper-2.5.tar.gz.

File metadata

  • Download URL: scrapy_scraper-2.5.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for scrapy_scraper-2.5.tar.gz
Algorithm Hash digest
SHA256 a2a23f4f6025344a112b8248f073f0c2a8ca764c345f345b7f196197c8a1ec1d
MD5 2ee27678e7cdd309ce9f55195f981cf5
BLAKE2b-256 d2294c9b464ebca0cf615ce501926db458d38e88691cfa1d1cd11fdb5c503162

See more details on using hashes here.

File details

Details for the file scrapy_scraper-2.5-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_scraper-2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 ec45535bd86c48409139ab9676f7c0ebc3033be5e6b3ab7ab37ee36e821a42b2
MD5 9b69ded56811078f144e4a091a0815dd
BLAKE2b-256 44d1ed772f01aee18bcff51e6af2abedb1dd6e541eb701d9e9bcef04af402cc4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page