Simple python package for scraping images from different search engines by prompt.
Project description
Gallery Thief
Gallery Thief, an artful liar
Cunningly steals your heart's desire.
Python library, so refined,
Your digital treasures are its' prize.
YandexGPT2, 2023
Introduction
Gallery Thief is a simple web-scraping tool designed for parsing images in different search engines.
It isn't fast because it tries to keep all the captcha stuff away by changing User-Agent, proxy and refer every its request. It also has some kind of a cool down between two requests to make them less repetitive and suspicious.
First run takes a while because it loads proxy list from given url address. After that you can send several requests and they will be processed much faster than an initial run.
Quick Start
If you installed this package correctly you can just copy and paste this example code to test how does it work:
from GalleryThief.performer import Thief
from GalleryThief.strategies import StealingFromYandex
from GalleryThief.mask import RobberMask
PROXY_SOURCE = "https://freeproxyupdate.com/files/txt/http.txt"
strategy = StealingFromYandex() # Creating strategy for getting images
mask = RobberMask(PROXY_SOURCE, 10) # Creating mask to hide behind proxies
thief = Thief(strategy, mask) # Creating thief using given strategy and mask
# Ordering thief to get one image of Pluto from yandex images
result = thief.get_images_list(['Photo of Pluto', 1])
print(result)
Guide
Gallery Thief uses several classes to retrieve images from search engines (Google, Yandex and etc.)
You should know them to achieve full potential of this little package.
Thief class
Thief is your loyal performer for your mischievous deeds involving some
images web-scraping. You can find his class in GalleryThief.performer
.
To create new instance of Thief you must inject StealingStrategy
and RobberMask
in its constructor (we'll talk about strategies and masks later). After that you can easily give orders to your helpful minion. Your put them in the simple python lists using special format: ["Prompt text": str, number_of_images: int]
. To make Thief execute these orders you need call its only method get_images_list
. This method can accept as many orders as you wish. For example:
result = thief.get_images_list(
['Photo of Pluto', 1],
['Doctor Who', 2],
['Star Trek', 3],
['Solaris poster', 1]
)
It will return dictionary which keys are your prompts, every key in such dictionary stores list of urls to images it found using StealingStrategy
.
Your also can change strategy and mask on fly using Thief's setters:
thief.strategy = StealingFromGoogle()
thief.mask = RobberMask(ANOTHER_PROXY_SOURCE, 42)
RobberMask Class
What a thief goes on his job without proper mask to hide his identity?
This class is designed for hiding from search engines one fact. The fact that your requests are automated by python script. It uses different technics such as changing user-agent header, refer and proxy servers. When you create instance of that class you must provide url of source of proxy servers list like this:
mask = RobberMask("https://freeproxyupdate.com/files/txt/http.txt")
List must be in plain text format where one string equals one ip address with port or some kind of comment starting with #. Example:
## Top 50 Updated Free Proxy IP Address
## 09-29-2023 15:17 (UTC-6 Chicago)
47.88.3.19:8080
67.43.227.227:30983
91.107.247.138:4000
118.33.139.176:80
121.4.20.187:20000
Sometimes list will be very long. RobberMask checks every ip address presented so it will many time to complete this checking. Instead of that you can specify upper limit for number of checked proxy servers like that:
mask = RobberMask("https://freeproxyupdate.com/files/txt/http.txt", 10)
It will check ten ip addresses and then stop checking.
Creating instance of RobberMask takes time depending on proxy servers list size and its limit.
Stealing Strategies Classes
This group of classes are describing different algorithms of getting images for different search engines. They all have their own params, options and etc. so it was logical to separate them into different classes with one abstract parent class called StealingStrategy
.
Let's look at them!
-
StealingFromYandex
Its purpose is obvious because of name of this class. It was designed for scraping "Yandex Images".
StealingFromYandex( size: YandexSizes = YandexSizes.ANY, orientation: YandexOrientation = YandexOrientation.ANY, image_type: YandexImageType = YandexImageType.ANY, file_type: YandexFileType = YandexFileType.ANY, color: YandexColor = YandexColor.ANY, site: str = '', recent: bool = False, )
Params description:
Parameter Description size selects images of one of special size groups (SMALL, MIDDLE, LARGE, WALLPAPER, ANY) orientation selects horizontal or vertical images (HORIZONTAL, VERTICAL, ANY) image_type selects images by their type (PHOTO, CLIPART, LINEART, FACE, DEMOTIVATOR, ANY) file_type selects images by file type (PNG, JPEG, GIF, ANY) color selects images by dominant color in them (COLOR, GRAY, RED, ORANGE, YELLOW, CYAN, GREEN, BLUE, VIOLET, CYAN, WHITE, BLACK, ANY) site specifies the site images should be from recent if True
looks among images published in last seven days -
StealingFromGoogle
StealingFromGoogle( self, size: GoogleSizes = GoogleSizes.ANY, image_type: GoogleImageType = GoogleImageType.ANY, last_time: GoogleLastTimeUsed = GoogleLastTimeUsed.ANY, color: GoogleColor = GoogleColor.ANY, license: GoogleLicense = GoogleLicense.ANY, ):
Params description:
Parameter Description size selects images of one of special size groups (LARGE, MEDIUM, ICONS, ANY) image_type selects images by their type (CLIPART, LINEART, ANIMATED, ANY) last_time selects images by the period of time they were published in (DAY, WEEK, MONTH, YEAR, ANY) color selects images by dominant color in them (BLACK_AND_WHITE, TRANSPARENT, RED, ORANGE, YELLOW, GREEN, TEAL, BLUE, PURPLE, PINK, WHITE, GRAY, BLACK, BROWN, ANY) license selects images by type of license (CREATIVE_COMMONS, COMMERCIAL, ANY)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gallery-thief-1.1.0.tar.gz
.
File metadata
- Download URL: gallery-thief-1.1.0.tar.gz
- Upload date:
- Size: 10.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6838e70ef52288fcf8ea804ef76c529aebe09b3c7dd943881aa69f13343aff3f |
|
MD5 | 26b51d30e9a7230ed664f94fdee9f3a9 |
|
BLAKE2b-256 | 622da1235bca1e14f132584112ee8acb53616fbab2aef8ee9db816ca25c2e3b0 |
File details
Details for the file gallery_thief-1.1.0-py3-none-any.whl
.
File metadata
- Download URL: gallery_thief-1.1.0-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2df27d1f92efcfcf52205401fb770e9ca9151a61fd56089955bbafecb5d8e7d2 |
|
MD5 | 3947822c83e744da20be323ef1e91b81 |
|
BLAKE2b-256 | a46321843252cc957e7c28ea7e581ebeff0964420dc6e7a106dc6a842677f84e |