Skip to main content

Image Scraper for Google Drive, Imgur, AsiaChan, and more.

Project description

ImageURlScraper

ImageURLScraper is a multi-site image scraper. It automatically detects which site the image is coming from and scrapes it. Only relevant images are scraped from the site and shortened links are automatically unshortened. In the case that you have many links that need to be processed, these links can be distinguished by IDs when requesting the image links.

Currently Supported Sites:
Asiachan - Checks all previous and next pages from it's current location.
Google Drive - Checks all folders and grabs the first 1000 images in each folder.
Imgur - Grabs all images in a gallery.

Installation

In a terminal, type pip install imageurlscraper.

In order to scrape images from Google Drive, the credentials are needed.
Steps to add Google Drive credentials:

Go to https://console.developers.google.com/apis/dashboard and at the top click + ENABLE APIS AND SERVICES.
Next, search for Google Drive API, click it, and then click Enable.
Select a project and then you'll be on a page with your project.
You will see a notice: "To use this API, you may need credentials. Click 'Create credentials' to get started.".
Go ahead and click Create Credentials.
You will be requested information on the type of credentials you need.
For the API, select Google Drive API, and select Other UI for where the API will be called from.
For the data you will be accessing, select Application data.
After that, create a service account in the 2nd field. Have the role as project owner and make sure the Key type is JSON.
Get your credentials and rename the JSON file to credentials.JSON Go to the project source (if you installed by pip, go to a terminal and type pip show imageurlscraper)
Put the credentials.json in the same folder as the main.py.

Sample Code

"""
This sample code links directly to the main function that automatically processes the links 
and returns back a dict with IDs and their image links. The original link will not be shown,
which is why IDs are useful.
IDs are REQUIRED input alongside their links, although they are only for classifying links.
Links can have several IDs if necessary to group them together.
"""
import imageurlscraper
import pprint
pp = pprint.PrettyPrinter(indent=4)

list_of_links = [
    # the list must contain an ID along with a link
    # This ID is helpful for distinguishing certain objects or people.
    # When the dict is returned.
    [0, "https://kpop.asiachan.com/222040"],
    [1, 'https://imgur.com/a/mEUURoG'],
    [2, 'https://bit.ly/36GWd2A'],
    [3, 'http://imgur.com/a/jRcrF'],
    # [999, 'https://drive.google.com/drive/folders/1uWIObdgq65-TmBcA8oJIWOnbuuR_H5PB']
    # This google drive folder has a lot of media and will be skipped for testing purposes. but it can support
    # google drive links like these and will go through every folder in that folder.
]


scraper = imageurlscraper.main.Scraper()
all_images = scraper.run(list_of_links)  # a dict with all the links of the images.
pp.pprint(all_images)  

Expected Output (dict)

{   1: [   'https://i.imgur.com/RUb6Xwl.jpg',
           ...],
    3: [   'https://i.imgur.com/ILixI73.jpg',
           ...],
    4: [   'https://i.imgur.com/X8jZOc7.jpg',
           ...],
    5: [   'https://i.imgur.com/L4SFme0.jpg',
           ...],
    6: [   'https://i.imgur.com/G2ltCDf.jpg',
           ...],
    204: [   'https://static.asiachan.com/Lee.Jueun.full.222040.jpg',
             ...]
}

More Samples

import imageurlscraper
scraper = imageurlscraper.main.Scraper()

shortened_link = "https://bit.ly/311n6vP"
unshortened_link = scraper.get_main_link(shortened_link)  # Expected Output -> http://google.com/


# Want to process links one by one or do not want to use IDs?
link = "https://imgur.com/a/mEUURoG"
image_links = scraper.process_source(link)  # Expected Output -> A LIST of image links.


# Want to run from the sources directly?
images = imageurlscraper.asiachan.AsiaChan().get_all_image_links(link)  # Asiachan, expected output -> A LIST of image links.
images = imageurlscraper.googledrive.DriveScraper().get_links(link)  # Google Drive, expected output -> A LIST of image links.
images = imageurlscraper.imgur.MediaScraper().start(link)  # Imgur, expected output -> A LIST of image links.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imageurlscraper-1.0.0.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

imageurlscraper-1.0.0-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file imageurlscraper-1.0.0.tar.gz.

File metadata

  • Download URL: imageurlscraper-1.0.0.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.1

File hashes

Hashes for imageurlscraper-1.0.0.tar.gz
Algorithm Hash digest
SHA256 138e04560918d97e7e3d046bc53c98cf16afa222f9defcb908096598b3ece18e
MD5 308772d175bbf6e401711a09d5ed2053
BLAKE2b-256 9b1abde0109fc413ea85b8fbb7097b0928d46b8ca94a1f9c94984bb923ef1e0a

See more details on using hashes here.

File details

Details for the file imageurlscraper-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: imageurlscraper-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.1

File hashes

Hashes for imageurlscraper-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5a9b349fae06a949319f2674b8d80408a9443f99533da5b154ceb1e292a18eba
MD5 bca8e492b4590f685ea0eb9ba19a8ae8
BLAKE2b-256 7b3a2241565a3aa7328f56810f63d451fbec98528c4c9ab21a2634fda1c57649

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page