Skip to main content

A utility for scrapping images from a HTML doc from a URL.

Project description

just-another-imgscrapper

A utility for scrapping images from a HTML doc.

Uses asyncio for fast concurrent download.

Installation

Binary installers for the latest released version are available at the Python Package Index (PyPI).

$ pip install just-another-imgscrapper

Usage

1. From cli

$ imgscrapper -h

To get HTML doc, extract image links from src attribute of <img> tags and download.

$ imgscrapper "http://foo.com/bar"
[2023-06-06 23:22:56] imgscrapper.utils:INFO: ### Initializing Scrapping ###
[2023-06-06 23:23:01] imgscrapper.utils:INFO: ### Downloaded 41 images out of extracted 41 links ###

Downloads to imgs/ dir in working dir. If dir does not exists, creates.

2. From module

>>> from imgscrapper import ImgScrapper
>>> d = ImgScrapper()
>>> d.download("http://foo.com/bar") 
>>> 3

Specify path to store downloaded images.

>>> d = ImgScrapper()
>>> d.url = "http://foo.com/bar"
>>> d.path = "/path/download"
>>> d.download() # returns no. of successful downloads
>>> 3

Some servers will block the scrapping, respect robots.txt and only used in allowed hosts.

You can add request headers.

>>> ...
>>> d.request_header = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0',
    'DNT': '1',
    }
>>> ...

You can select specific type of img tags only by passing attrs dict.

<!-- >http://helloworld.com<-->
<html>
    <body>
        <img src="https://foo.com/bar.png" class="apple ball">
        <img src="/foo.jpg" class="cat bar">
    </body>
<html>

To select only images with class: cat

>>> d = ImgScrapper()
>>> d.url = "http://helloworld.com"
>>> d.attrs = {
    'class': 'cat',
    }
>>> d.download()
>>> 1 # http://helloworld.com/foo.jpg

The downloader gives unique uuid filename to downloaded images preserving the image extension.

>>> d = ImgScrapper(
    url = "http://helloworld.com",
    attrs = {'class': 'cat'},
    max = 5,
    path = "/home/images"
)
>>> d.download()
>>> 5

Limit the no. of image downloads by passing max value.

Liscense

just-another-imgscrapper is released under the MIT liscense. See LISCENSE for details.

Contact

Connect with me on twitter @deshritbaral

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

just-another-imgscrapper-0.1.2.tar.gz (8.4 kB view hashes)

Uploaded Source

Built Distribution

just_another_imgscrapper-0.1.2-py3-none-any.whl (7.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page