Skip to main content

A simple 9GAG scraper

Project description

9GAG Scraper

A simple 9GAG scraper that uses selenium's Chromium webdriver. It allows you to scrape the images given a particular search term. The application is written in Python 2.7 with Tkinter 8.6.

Installation

Tkinter

First, you will need to manually install Tkinter. To install it, follow the instructions from the tkdocs.com installation tutorial webpage. Remember to follow the correct instructions to install Tkinter for Python 2.7

Chromium webdriver

The scraper also uses selenium's Chromium webdriver. I suspect that using a normal Chrome webdriver is going to work just the same, but I was unable to test it out, since I cannot install pure Chrome webdriver on Linux (whoops). Follow the instructions from this webpage.

Pip packages

Finally, you need to install the pip packages that are required for the application to run. It's recommended to create a virtual environment, so you can separate the installed packages from your global packages:

# Create a virtual environment
virtualenv ./venv

# Activate the virtual environment for your current shell
source ./venv/bin/activate

# Finally, install the packages from the requirements.txt file
pip install -r ./requirements.txt

# When you're done using the application, you can deactivate the virtual environment
deactivate

Usage

To start the application, you just need to execute the main.py script:

python ./src/main.py

This will open up the GUI application, where you'll be able to define the search term you want to scrape with. Hitting the "Scrape!" button (or enter) will start the scraping process - the webdriver will fetch the webpage needed, scroll down the required amount (twice by default) and fetch all images that have been found. The images that have been found will be displayed as thumbnails and you will have the opportunity to save the images on your computer.

Known issues

Cloudflare - are you a robot?!

9GAG is protected by CloudFlare or some other proxy. Sometimes, I've had issues with it trying to verify that I'm human. The webdriver is obviously not human, so you cannot simply click the button to go around the verification. The way that I went around it is to open the same page with my own Chrome browser and going through the verification step.

Next steps

  • Get the webdriver to minimize
  • Have an entry where the user can define the download folder (otherwise, use the default)
  • Have an entry field that allows the user to select the number of times the webdriver will scroll down
  • Try and find a way to use the with keyword with the scraper and gui classes, so the browser can be closed nicely without the try/finally clause
  • Have some CLI arguments about logging level, logging location, more?

Jorkata feedback

  • Error handling
  • 2x newline between imports and the rest of the code
  • newline at the end of the files
  • avoid newlines in functions
  • main.py - config_logger function - not the best name
  • import whole library instead of each small component
  • Docstrings
    • [?] // Read the PEP8 document for docstrings, get familiar with it
    • Newline after docstrings
    • If a docstring can be on a single line, it's better for it to be that way
    • Docstring everything
    • One liner description for multi-line docstrings
    • Docstrings must end on with a dot
    • Imperative form for docstrings
    • Some comments that I have are useless and don't look good ("WHAT THE FAK")
  • Use constants for literals
  • if statements - continue or return, try to avoid using else
  • scrape.py -> _get_images or whatever it is called - make into a generator (yield)
  • scraper.py#79 - too many new lines (javascripty)
  • initialize fields in init even if that means that they need to be set to None - clearer definition of what the class contains as fields
  • Make into a PyPi package - python wheel
  • UNITTEST!
  • Make it python3 compatible
  • Make into MVC

Unittest

  • Test ScrapedImage somehow?
  • Test NineGagScraper somehow?
    • We'll need to mock the selenium driver somehow

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

9gag_scraper-0.0.5.tar.gz (122.6 kB view details)

Uploaded Source

Built Distribution

9gag_scraper-0.0.5-py2.py3-none-any.whl (9.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file 9gag_scraper-0.0.5.tar.gz.

File metadata

  • Download URL: 9gag_scraper-0.0.5.tar.gz
  • Upload date:
  • Size: 122.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.8.3 requests/2.27.1 setuptools/44.1.1 requests-toolbelt/1.0.0 tqdm/4.64.1 CPython/2.7.18

File hashes

Hashes for 9gag_scraper-0.0.5.tar.gz
Algorithm Hash digest
SHA256 6b39d5b1a3e3987909e3a638d5251ca6567c40179b13f79558ab4213f6e083fb
MD5 a4e6f1869d4b5bd3616d54f3b4341109
BLAKE2b-256 bfd3d0431cb86716a33e2a8f3c269a0194d75de5c987da7ad0272dd6ea81ef09

See more details on using hashes here.

File details

Details for the file 9gag_scraper-0.0.5-py2.py3-none-any.whl.

File metadata

  • Download URL: 9gag_scraper-0.0.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.8.3 requests/2.27.1 setuptools/44.1.1 requests-toolbelt/1.0.0 tqdm/4.64.1 CPython/2.7.18

File hashes

Hashes for 9gag_scraper-0.0.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e98160587fccedabff072e362226dc8e47871b776bc66aec6fce4b0d780114c0
MD5 d06d50cee2c62dd77deb58068a587b31
BLAKE2b-256 4fa4acef8e38f0f6ef539decdcd9718dbc3db7bd7e27b82a0c35355fb919e043

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page