Skip to main content

Simple python package for scraping images from different search engines by prompt.

Project description

Gallery Thief

Gallery Thief Logo

Gallery Thief, an artful liar
Cunningly steals your heart's desire.
Python library, so refined,
Your digital treasures are its' prize.

YandexGPT2, 2023

Introduction

Gallery Thief is a simple web-scraping tool designed for parsing images in different search engines.

It isn't fast because it tries to keep all the captcha stuff away by changing User-Agent, proxy and refer every its request. It also has some kind of a cool down between two requests to make them less repetitive and suspicious.

First run takes a while because it loads proxy list from freeproxyupdate.com. After that you can send several requests and they will be processed much faster than an initial run.

Quick Start

If you installed this package correctly you can just copy and paste this example code to test how does it work:

from GalleryThief.performer import Thief
from GalleryThief.strategies import StealingFromYandex
from GalleryThief.mask import RobberMask

PROXY_SOURCE = "https://freeproxyupdate.com/files/txt/http.txt"

strategy = StealingFromYandex()  # Creating strategy for getting images
mask = RobberMask(PROXY_SOURCE, 10)  # Creating mask to hide behind proxies
thief = Thief(strategy, mask)  # Creating thief using given strategy and mask

# Ordering thief to get one image of Pluto from yandex images
result = thief.get_images_list(['Photo of Pluto', 1])

print(result)

Guide

Gallery Thief uses several classes to retrieve images from search engines (Google, Yandex, Bing, etc.)

You should know them to achieve full potential of this little package.

Thief class

Thief is your loyal performer for your mischievous deeds involving some images web-scraping. You can find his class in GalleryThief.performer.

To create new instance of Thief you must inject StealingStrategy and RobberMask in its constructor (we'll talk about strategies and masks later). After that you can easily give orders to your helpful minion. Your put them in the simple python lists using special format: ["Prompt text": str, number_of_images: int]. To make Thief execute these orders you need call its only method get_images_list. This method can accept as many orders as you wish. For example:

result = thief.get_images_list(
    ['Photo of Pluto', 1],
    ['Doctor Who', 2],
    ['Star Trek', 3],
    ['Solaris poster', 1]
)

It will return dictionary which keys are your prompts, every key in such dictionary stores list of urls to images it found using StealingStrategy.

Your also can change strategy and mask on fly using Thief's setters:

thief.strategy = StealingFromGoogle()
thief.mask = RobberMask(ANOTHER_PROXY_SOURCE, 42)

RobberMask Class

What a thief goes on his job without proper mask to hide his identity?

This class is designed for hiding from search engines one fact. The fact that your requests are automated by python script. It uses different technics such as changing user-agent header, refer and proxy servers. When you create instance of that class you must provide url of source of proxy servers list like this:

mask = RobberMask("https://freeproxyupdate.com/files/txt/http.txt")

List must be in plain text format where one string equals one ip address with port or some kind of comment starting with #. Example:

## Top 50 Updated Free Proxy IP Address
## 09-29-2023 15:17 (UTC-6 Chicago)
47.88.3.19:8080
67.43.227.227:30983
91.107.247.138:4000
118.33.139.176:80
121.4.20.187:20000

Sometimes list will be very long. RobberMask checks every ip address presented so it will many time to complete this checking. Instead of that you can specify upper limit for number of checked proxy servers like that:

mask = RobberMask("https://freeproxyupdate.com/files/txt/http.txt", 10)

It will check ten ip addresses and then stop checking.

Creating instance of RobberMask takes time depending on proxy servers list size and its limit.

Stealing Strategies Classes

This group of classes are describing different algorithms of getting images for different search engines. They all have their own params, options and etc. so it was logical to separate them into different classes with one abstract parent class called StealingStrategy. Let's look at them!

  1. StealingFromYandex

    Its purpose is obvious because of name of this class. It was designed for scraping "Yandex Images".

    StealingFromYandex(
        size: YandexSizes = YandexSizes.ANY,
        orientation: YandexOrientation = YandexOrientation.ANY,
        image_type: YandexImageType = YandexImageType.ANY,
        file_type: YandexFileType = YandexFileType.ANY,
        color: YandexColor = YandexColor.ANY,
        site: str = '',
        recent: bool = False,
    )
    

    Params description:

    Parameter Description
    size selects images of one of special size groups (SMALL, MIDDLE, LARGE, WALLPAPER, ANY)
    orientation selects horizontal or vertical images (HORIZONTAL, VERTICAL, ANY)
    image_type selects images by their type (PHOTO, CLIPART, LINEART, FACE, DEMOTIVATOR, ANY)
    file_type selects images by file type (PNG, JPEG, GIF, ANY)
    color selects images by dominant color in them (COLOR, GRAY, RED, ORANGE, YELLOW, CYAN, GREEN, BLUE, VIOLET, CYAN, WHITE, BLACK, ANY)
    site specifies the site images should be from
    recent if True looks among images published in last seven days
  2. StealingFromGoogle

    Work in progress...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gallery-thief-1.0.1.tar.gz (9.4 kB view hashes)

Uploaded Source

Built Distribution

gallery_thief-1.0.1-py3-none-any.whl (8.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page