Skip to main content

A simple CLI image scraper tool with support for headless scraping of dynamic websites.

Project description

imgscrapy

A simple CLI image scraper written in python inspired by ImageScraper with support for headless scraping of dynamic websites.

Installation

Build from source
  • git clone https://github.com/arutselvan/ImgScrapy
  • cd ImgScrapy
  • python setup.py install
As a Python package
pip install --user imgscrapy

Requirements

python>=3.6

Usage

usage: imgscrapy [-h] [-d DIRECTORY] [-i] [-n NFIRST] [-t NTHREADS] [-hd] [-to TIMEOUT] target_url

Downloads images from the given URL

positional arguments:
  target_url            URL to scrape images from
optional arguments:
  -h, --help            show this help message and exit
  -d DIRECTORY, --directory DIRECTORY
                        Directory in which images should be downloaded
  -i, --injected        Scrape images from a dynamic website and JS injected images
  -n NFIRST, --nfirst NFIRST
                        Scrape the first n images
  -t NTHREADS, --nthreads NTHREADS
                        Maximum number of threads to use
  -hd, --head           Open chromium for scraping JS injected source/images
  -to TIMEOUT, --timeout TIMEOUT
                        Timeout value for obtaining page source

Examples

  • Download all images from a static website
imgscrapy <Target URL>
  • Download the first 5 images from a dynamic website
imgscrapy <Target URL> -i --nfirst 5
Note

ImgScrapy uses pyppeteer which uses Chromium for headless scraping. When scraping a dynamic website for the first time, Chromium will be downloaded automatically which might take some time.

To Do

  • Write tests
  • Add support for Base64 images
  • Add support for embedded/inline svg files
  • Fix issues with headless browsing of dynamic site with modal/popup
  • Fix issue with missing trailing slash in URL resolution
  • Add option to dump URL of downloaded/failed images

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imgscrapy-1.0.0.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

imgscrapy-1.0.0-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file imgscrapy-1.0.0.tar.gz.

File metadata

  • Download URL: imgscrapy-1.0.0.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for imgscrapy-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c2929761cd9f7badb4ec82956ef9eb19cf6b5c28caec065f0884d867c2247016
MD5 fbe714da9f07269b5e5f74bf7dbd2b70
BLAKE2b-256 92e42b10346de96d6db36ea3d8fe964f1115f6f8f02dc1f99e40eddc1de2f3d4

See more details on using hashes here.

File details

Details for the file imgscrapy-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: imgscrapy-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for imgscrapy-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5ade9e96fd6426e7bd0e3b89763740bf62f880dc20a6455630aec39ea36c0660
MD5 517fab2d1ecc663bd041ff235f01cc53
BLAKE2b-256 3440b89952e5e10afb2a614b1d8439abf13ca0e709af6d3cedf7749afd1e515d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page