A simple CLI image scraper tool with support for headless scraping of dynamic websites.
Project description
imgscrapy
A simple CLI image scraper written in python inspired by ImageScraper with support for headless scraping of dynamic websites.
Installation
Build from source
git clone https://github.com/arutselvan/ImgScrapycd ImgScrapypython setup.py install
As a Python package
pip install --user imgscrapy
Requirements
python>=3.6
Usage
usage: imgscrapy [-h] [-d DIRECTORY] [-i] [-n NFIRST] [-t NTHREADS] [-hd] [-to TIMEOUT] target_url
Downloads images from the given URL
positional arguments:
target_url URL to scrape images from
optional arguments:
-h, --help show this help message and exit
-d DIRECTORY, --directory DIRECTORY
Directory in which images should be downloaded
-i, --injected Scrape images from a dynamic website and JS injected images
-n NFIRST, --nfirst NFIRST
Scrape the first n images
-t NTHREADS, --nthreads NTHREADS
Maximum number of threads to use
-hd, --head Open chromium for scraping JS injected source/images
-to TIMEOUT, --timeout TIMEOUT
Timeout value for obtaining page source
Examples
- Download all images from a static website
imgscrapy <Target URL>
- Download the first 5 images from a dynamic website
imgscrapy <Target URL> -i --nfirst 5
Note
ImgScrapy uses pyppeteer which uses Chromium for headless scraping. When scraping a dynamic website for the first time, Chromium will be downloaded automatically which might take some time.
To Do
- Write tests
- Add support for Base64 images
- Add support for embedded/inline svg files
- Fix issues with headless browsing of dynamic site with modal/popup
- Fix issue with missing trailing slash in URL resolution
- Add option to dump URL of downloaded/failed images
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file imgscrapy-1.0.0.tar.gz.
File metadata
- Download URL: imgscrapy-1.0.0.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2929761cd9f7badb4ec82956ef9eb19cf6b5c28caec065f0884d867c2247016
|
|
| MD5 |
fbe714da9f07269b5e5f74bf7dbd2b70
|
|
| BLAKE2b-256 |
92e42b10346de96d6db36ea3d8fe964f1115f6f8f02dc1f99e40eddc1de2f3d4
|
File details
Details for the file imgscrapy-1.0.0-py3-none-any.whl.
File metadata
- Download URL: imgscrapy-1.0.0-py3-none-any.whl
- Upload date:
- Size: 6.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ade9e96fd6426e7bd0e3b89763740bf62f880dc20a6455630aec39ea36c0660
|
|
| MD5 |
517fab2d1ecc663bd041ff235f01cc53
|
|
| BLAKE2b-256 |
3440b89952e5e10afb2a614b1d8439abf13ca0e709af6d3cedf7749afd1e515d
|