Skip to main content

Web crawler and scraper based on Scrapy and Playwright's headless browser.

Project description

Scrapy Scraper

Web crawler and scraper based on Scrapy and Playwright's headless browser.

To use the headless browser specify -p option. Browsers, unlike other standard web request libraries, have the ability to render JavaScript encoded HTML content.

To automatically download and beautify all JavaScript files, including minified ones, specify -dir downloads option - where downloads is your desired output directory.

Resources:

Tested on Kali Linux v2023.4 (64-bit).

Made for educational purposes. I hope it will help!

Table of Contents

How to Install

Install Playwright and Chromium

pip3 install --upgrade playwright

playwright install chromium

Make sure each time you upgrade your Playwright dependency to re-install Chromium; otherwise, you might get no results if using the headless browser.

Standard Install

pip3 install --upgrade scrapy-scraper

Build and Install From the Source

git clone https://github.com/ivan-sincek/scrapy-scraper && cd scrapy-scraper

python3 -m pip install --upgrade build

python3 -m build

python3 -m pip install dist/scrapy-scraper-1.6-py3-none-any.whl

How to Run

Restricted (domain whitelisting is on):

scrapy-scraper -u https://example.com/home -o results.txt -a random -s random -dir js -l

Unrestricted (domain whitelisting is off):

scrapy-scraper -u https://example.com/home -o results.txt -a random -s random -dir js -l -w off

Usage

Scrapy Scraper v1.6 ( github.com/ivan-sincek/scrapy-scraper )

Usage:   scrapy-scraper -u urls                     -o out         [-dir directory]
Example: scrapy-scraper -u https://example.com/home -o results.txt [-dir downloads]

DESCRIPTION
    Crawl and scrape websites
URLS
    File with URLs or a single URL to start crawling and scraping from
    -u, --urls = urls.txt | https://example.com/home | etc.
WHITELIST
    File with whitelisted domains to limit the crawling scope
    Specify 'off' to disable domain whitelisting
    Default: domains extracted from initial URLs
    -w, --whitelist = whitelist.txt | off | etc.
LINKS
    Include all links and sources (incl. 3rd party) in the output file
    -l, --links
PLAYWRIGHT
    Use Playwright's headless browser
    -p, --playwright
CONCURRENT REQUESTS
    Number of concurrent requests
    Default: 30
    -cr, --concurrent-requests = 15 | 45 | etc.
CONCURRENT REQUESTS PER DOMAIN
    Number of concurrent requests per domain
    Default: 10
    -crd, --concurrent-requests-domain = 5 | 15 | etc.
SLEEP
    Sleep time between two consecutive requests to the same domain
    Specify 'random' to sleep a random amount of time between 0.5 and 1.5 seconds
    Default: 1.5
    -s, --sleep = 0 | 2 | 4 | random | etc.
AUTO THROTTLE
    Auto throttle concurrent requests based on the load and latency
    -at, --auto-throttle = 0.5 | 10 | 15 | 45 | etc.
RECURSION
    Recursion depth limit
    Specify '0' for no limit
    Default: 1
    -r, --recursion = 0 | 2 | 4 | etc.
USER AGENT
    User agent to use
    Default: Scrapy Scraper/1.6
    -a, --user-agent = curl/3.30.1 | random | etc.
PROXY
    Web proxy to use
    -x, --proxy = http://127.0.0.1:8080 | etc.
DIRECTORY
    Output directory
    All extracted JavaScript files will be saved in this directory
    -dir, --directory = downloads | etc.
OUT
    Output file
    -o, --out = results.txt | etc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-scraper-1.6.tar.gz (11.8 kB view hashes)

Uploaded Source

Built Distribution

scrapy_scraper-1.6-py3-none-any.whl (11.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page