A rather customizable image crawler structure, designed to download images with their information using multi-threading method. Besides, several wheels have been implemented to help better build a custom image crawler for yourself.

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Science/Research
Programming Language
- Python :: 3
Topic
- Internet

Project description

Image Crawler Utils

A Customizable Multi-station Image Crawler Structure

English | 简体中文

About

A rather customizable image crawler structure, designed to download images with their information using multi-threading method.

Besides, several classes and functions have been implemented to help better build a custom image crawler for yourself.

Please follow the rules of robots.txt, and set a low number of threads with high number of delay time when crawling images. Frequent requests and massive download traffic may result in IP addresses being banned or accounts being suspended.

Installing

It is recommended to install it by

pip install image-crawler-utils

Requires Python >= 3.9.

Attentions!

nodriver are used to parse information from certain websites. It is suggested to install the latest version of Google Chrome first to ensure the crawler will be correctly running.

Features (Partial)

Currently supported websites:
- Danbooru - features supported:
  - Downloading images searched by tags
- yande.re / konachan.com / konachan.net - features supported:
  - Downloading images searched by tags
- Gelbooru - features supported:
  - Downloading images searched by tags
- Safebooru - features supported:
  - Downloading images searched by tags
- Pixiv - features supported:
  - Downloading images searched by tags
  - Downloading images uploaded by a certain member
- Twitter / X - features supported:
  - Downloading images from searching result
  - Downloading images uploaded by a certain user
Logging of crawler operations onto the console and (optional) into a file.
Using rich bars and logging messages to denote the progress of crawler (Jupyter Notebook support is included).
Save or load the settings of a crawler.
Save or load the information of images for future downloading.
Several classes and functions for custom image crawler designing.

How to Use

Please refer to tutorials and notes for tasks for detailed instructions.

Quick Start

Image Crawler Utils provides three independent modules for an image crawler:

CrawlerSettings: Basic configuration to adjust the downloading and debugging settings of the crawler. Every argument except station_url is optional, and will use the default values (see tutorials) when omitted. A list of parameters in a CrawlerSettings is like:

from image_crawler_utils import CrawlerSettings
from image_crawler_utils.configs import DebugConfig

crawler_settings = CrawlerSettings(
    # Configs restrict downloading numbers and capacity
    image_num: int | None=None,
    capacity: float | None=None,
    page_num: int | None=None,
    # Configs about parameters in downloading
    headers: dict | Callable | None=None,
    proxies: dict | Callable | None=None,
    thread_delay: float=5,
    fail_delay: float=3,
    randomize_delay: bool=True,
    thread_num: int=5,
    timeout: float | None=10,
    max_download_time: float | None=None,
    retry_times: int=5,
    overwrite_images: bool=True,
    # Configs define which types of messages are shown on the console.
    debug_config=DebugConfig(
        show_debug: bool=False,
        show_info: bool=True,
        show_warning: bool=True,
        show_error: bool=True,
        show_critical: bool=True,
    ),
    # Logging settings
    detailed_console_log: bool=False,
    # Extra configs for custom use
    extra_configs={
        "arg_name": config, 
        "arg_name2": config2, 
        ...
    },
)

Parser: Parsing the arguments provided, visiting and crawling the sites, and finally return a list of image URLs with information. Different tasks may require different parsers. A functional parser should work like this:

# import SomeParser from image_crawler_utils.stations.some_station

parser = SomeParser(crawler_settings, parser_args)
image_info_list = parser.run()


# Example
from image_crawler_utils.stations.booru import DanbooruKeywordParser

parser = DanbooruKeywordParser(
    crawler_settings=crawler_settings,
    standard_keyword_string="kuon_(utawarerumono) AND rating:safe",
)
image_info_list = parser.run()

Downloader: Downloading images with the list generated by parser and filtered by the image_filter. A list of parameters in a Donwloader is like:

from image_crawler_utils import Downloader

downloader = Downloader(
    crawler_settings: CrawlerSettings=CrawlerSettings(),
    image_info_list: Iterable[ImageInfo],
    store_path: str | Iterable[str]='./',
    image_info_filter: Callable | bool=True,
    cookies: Cookies | list | dict | str | None=Cookies(),
)
total_size, succeeded_image_list, failed_image_list, skipped_image_list = downloader.run()

Examples

Running this example will download the first 20 images from Danbooru with keyword / tag kuon_(utawarerumono) and rating:general into the "Danbooru" folder. Information of images will be stored in image_info_list.json at same the path of your program. Pay attention that the proxies may need to be changed manually.

from image_crawler_utils import CrawlerSettings, Downloader, save_image_infos
from image_crawler_utils.stations.booru import DanbooruKeywordParser

crawler_settings = CrawlerSettings(
    image_num=20,
    # If you do not use system proxies, remove '#' and set this manually
    # proxies={"https": "socks5://127.0.0.1:7890"},
)

parser = DanbooruKeywordParser(
    crawler_settings=crawler_settings,
    standard_keyword_string="kuon_(utawarerumono) AND rating:general",
)
image_info_list = parser.run()
save_image_infos(image_info_list, "image_info_list")
downloader = Downloader(
    crawler_settings=crawler_settings,
    store_path='Danbooru',
    image_info_list=image_info_list,
)
downloader.run()

Information of an image in the saved image_info_list.json is like:

ImageInfo Structure in JSON

{
    "url": "https://cdn.donmai.us/original/cd/91/cd91f0000b9574bf142d125a1e886e5c.png",
    "name": "Danbooru 4994142 cd91f0000b9574bf142d125a1e886e5c.png",
    "info": {
        "info": {
            "id": 4994142,
            "created_at": "2021-12-21T08:02:13.706-05:00",
            "uploader_id": 772564,
            "score": 10,
            "source": "https://i.pximg.net/img-original/img/2020/08/11/12/41/43/83599609_p0.png",
            "md5": "cd91f0000b9574bf142d125a1e886e5c",
            "last_comment_bumped_at": null,
            "rating": "s",
            "image_width": 2000,
            "image_height": 2828,
            "tag_string": "1girl absurdres animal_ears black_eyes black_hair coat grabbing_own_breast hair_ornament hairband highres holding holding_mask japanese_clothes kuon_(utawarerumono) long_hair looking_at_viewer mask ponytail shirokuro_neko_(ouma_haruka) smile solo utawarerumono utawarerumono:_itsuwari_no_kamen",
            "fav_count": 10,
            "file_ext": "png",
            "last_noted_at": null,
            "parent_id": null,
            "has_children": false,
            "approver_id": null,
            "tag_count_general": 17,
            "tag_count_artist": 1,
            "tag_count_character": 1,
            "tag_count_copyright": 2,
            "file_size": 4527472,
            "up_score": 10,
            "down_score": 0,
            "is_pending": false,
            "is_flagged": false,
            "is_deleted": false,
            "tag_count": 23,
            "updated_at": "2024-07-10T12:21:31.782-04:00",
            "is_banned": false,
            "pixiv_id": 83599609,
            "last_commented_at": null,
            "has_active_children": false,
            "bit_flags": 0,
            "tag_count_meta": 2,
            "has_large": true,
            "has_visible_children": false,
            "media_asset": {
                "id": 5056745,
                "created_at": "2021-12-21T08:02:04.132-05:00",
                "updated_at": "2023-03-02T04:43:15.608-05:00",
                "md5": "cd91f0000b9574bf142d125a1e886e5c",
                "file_ext": "png",
                "file_size": 4527472,
                "image_width": 2000,
                "image_height": 2828,
                "duration": null,
                "status": "active",
                "file_key": "nxj2jBet8",
                "is_public": true,
                "pixel_hash": "5d34bcf53ddde76fd723f29aae5ebc53",
                "variants": [
                    {
                        "type": "180x180",
                        "url": "https://cdn.donmai.us/180x180/cd/91/cd91f0000b9574bf142d125a1e886e5c.jpg",
                        "width": 127,
                        "height": 180,
                        "file_ext": "jpg"
                    },
                    {
                        "type": "360x360",
                        "url": "https://cdn.donmai.us/360x360/cd/91/cd91f0000b9574bf142d125a1e886e5c.jpg",
                        "width": 255,
                        "height": 360,
                        "file_ext": "jpg"
                    },
                    {
                        "type": "720x720",
                        "url": "https://cdn.donmai.us/720x720/cd/91/cd91f0000b9574bf142d125a1e886e5c.webp",
                        "width": 509,
                        "height": 720,
                        "file_ext": "webp"
                    },
                    {
                        "type": "sample",
                        "url": "https://cdn.donmai.us/sample/cd/91/sample-cd91f0000b9574bf142d125a1e886e5c.jpg",
                        "width": 850,
                        "height": 1202,
                        "file_ext": "jpg"
                    },
                    {
                        "type": "original",
                        "url": "https://cdn.donmai.us/original/cd/91/cd91f0000b9574bf142d125a1e886e5c.png",
                        "width": 2000,
                        "height": 2828,
                        "file_ext": "png"
                    }
                ]
            },
            "tag_string_general": "1girl animal_ears black_eyes black_hair coat grabbing_own_breast hair_ornament hairband holding holding_mask japanese_clothes long_hair looking_at_viewer mask ponytail smile solo",
            "tag_string_character": "kuon_(utawarerumono)",
            "tag_string_copyright": "utawarerumono utawarerumono:_itsuwari_no_kamen",
            "tag_string_artist": "shirokuro_neko_(ouma_haruka)",
            "tag_string_meta": "absurdres highres",
            "file_url": "https://cdn.donmai.us/original/cd/91/cd91f0000b9574bf142d125a1e886e5c.png",
            "large_file_url": "https://cdn.donmai.us/sample/cd/91/sample-cd91f0000b9574bf142d125a1e886e5c.jpg",
            "preview_file_url": "https://cdn.donmai.us/180x180/cd/91/cd91f0000b9574bf142d125a1e886e5c.jpg"
        },
        "family_group": null,
        "tags": [
            "1girl",
            "absurdres",
            "animal_ears",
            "black_eyes",
            "black_hair",
            "coat",
            "grabbing_own_breast",
            "hair_ornament",
            "hairband",
            "highres",
            "holding",
            "holding_mask",
            "japanese_clothes",
            "kuon_(utawarerumono)",
            "long_hair",
            "looking_at_viewer",
            "mask",
            "ponytail",
            "shirokuro_neko_(ouma_haruka)",
            "smile",
            "solo",
            "utawarerumono",
            "utawarerumono:_itsuwari_no_kamen"
        ],
        "tags_class": {
            "1girl": "general",
            "animal_ears": "general",
            "black_eyes": "general",
            "black_hair": "general",
            "coat": "general",
            "grabbing_own_breast": "general",
            "hair_ornament": "general",
            "hairband": "general",
            "holding": "general",
            "holding_mask": "general",
            "japanese_clothes": "general",
            "long_hair": "general",
            "looking_at_viewer": "general",
            "mask": "general",
            "ponytail": "general",
            "smile": "general",
            "solo": "general",
            "kuon_(utawarerumono)": "character",
            "utawarerumono": "copyright",
            "utawarerumono:_itsuwari_no_kamen": "copyright",
            "shirokuro_neko_(ouma_haruka)": "artist",
            "absurdres": "meta",
            "highres": "meta"
        }
    },
    "backup_urls": [
        "https://i.pximg.net/img-original/img/2020/08/11/12/41/43/83599609_p0.png"
    ]
}

Documentation

Tutorials: A detailed tutorial about how to set up configurations, construct a image crawler and downloading images by keywords / tags from Danbooru.
Notes for tasks: Including notes and examples for every supported sites and crawling tasks.
Classes and Functions: Providing extra information about the structure of this project and information of usable classes and functions.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Science/Research
Programming Language
- Python :: 3
Topic
- Internet

Release history Release notifications | RSS feed

0.4.6

Apr 1, 2026

0.4.5

Jul 8, 2025

0.4.4

Jun 25, 2025

0.4.3

Jun 17, 2025

0.4.2

Jun 17, 2025

0.4.1

Jun 17, 2025

0.4.0

Jun 17, 2025

0.3.2

Apr 12, 2025

0.3.1

Apr 11, 2025

0.3.0

Apr 11, 2025

0.2.6

Apr 11, 2025

0.2.5

Apr 11, 2025

0.2.4

Apr 7, 2025

0.2.3

Mar 19, 2025

0.2.2

Feb 19, 2025

This version

0.2.0

Jan 24, 2025

0.1.9

Jan 23, 2025

0.1.8

Jan 15, 2025

0.1.7

Jan 2, 2025

0.1.6

Dec 1, 2024

0.1.5

Nov 30, 2024

0.1.4

Nov 30, 2024

0.1.3

Nov 29, 2024

0.1.2

Nov 29, 2024

0.1.1

Nov 29, 2024

0.1.0

Nov 28, 2024

0.0.5

Nov 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

image_crawler_utils-0.2.0.tar.gz (76.1 kB view details)

Uploaded Jan 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

image_crawler_utils-0.2.0-py3-none-any.whl (102.3 kB view details)

Uploaded Jan 24, 2025 Python 3

File details

Details for the file image_crawler_utils-0.2.0.tar.gz.

File metadata

Download URL: image_crawler_utils-0.2.0.tar.gz
Upload date: Jan 24, 2025
Size: 76.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.12.3

File hashes

Hashes for image_crawler_utils-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`11eebf52d3d1c6f136c390cee5f8dd5548c948bbe2bf9747243d1b5def920e08`
MD5	`1fa86af71cb89dc2e1ea12a54fff4dc0`
BLAKE2b-256	`5e21abc3302040215d81bf0b2b6f36deb303d5f919260c3d36699cef985f5cd0`

See more details on using hashes here.

File details

Details for the file image_crawler_utils-0.2.0-py3-none-any.whl.

File metadata

Download URL: image_crawler_utils-0.2.0-py3-none-any.whl
Upload date: Jan 24, 2025
Size: 102.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.12.3

File hashes

Hashes for image_crawler_utils-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9e8250a359c47b66e66c4d92ef1c208495bb4ffd212479f64e0254b043ab7020`
MD5	`d08764bc5cc0c30d11f03f282465fcbd`
BLAKE2b-256	`3c27410d842e6fad5bea842a0f1ab32c8846c47683ae8cc50c5f44f57b44b010`

See more details on using hashes here.

image-crawler-utils 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Image Crawler Utils

A Customizable Multi-station Image Crawler Structure

About

Installing

Attentions!

Features (Partial)

How to Use

Quick Start

Examples

Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes