Package and CLI for downloading media from a webpage.
Project description
Pixelripper
Package and CLI for downloading media from a webpage.
Install with:
pip install pixelripper
Pixelripper contains a class called PixelRipper and a subclass called PixelRipperSelenium.
PixelRipper uses the requests library to fetch webpages and PixelRipperSelenium uses a selenium based engine to do the same.
The selenium engine is slower and requires more resources, but is useful for webpages
that don't render their media content without a JavaScript engine.
It can use either Firefox or Chrome browsers.
Note: You must have the appropriate webdriver for your machine and browser
version installed in order to use PixelRipperSelenium.
pixelripper can be used programmatically or from the command line.
Programmatic usage:
from pixelripper import PixelRipper from pathlib import Path ripper = PixelRipper() # Scrape the page for image, video, and audio urls. ripper.rip("https://somewebsite.com") # Any content urls found will now be accessible as members of ripper. print(ripper.image_urls) print(ripper.video_urls) print(ripper.audio_urls) # All the urls found on a page can be accessed through the ripper.scraper member. all_urls = ripper.scraper.get_links("all") # The urls can also be filtered according to a list of extensions # with the filter_by_extensions function. # The following will return only .jpg and .mp3 file urls. urls = ripper.filter_by_extensions([".jpg", ".mp3"]) # The content can then be downloaded. ripper.download_files(urls, Path.cwd()/"somewebsite") # Alternatively, everything in ripper.image_urls, ripper.video_urls, and ripper.audio_urls # can be downloaded with just a call to ripper.download_all() ripper.download_all(Path.cwd()/"somewebsite") # Separate subfolders named "images", "videos", and "audio" # will be created inside the "somewebsite" folder when using this function.
Command line usage:
>pixelripper -h usage: pixelripper [-h] [-s] [-nh] [-b BROWSER] [-o OUTPUT_PATH] [-eh [EXTRA_HEADERS ...]] url positional arguments: url The url to scrape for media. options: -h, --help show this help message and exit -s, --selenium Use selenium to get page content instead of requests. -nh, --no_headless Don't use headless mode when using -s/--selenium. -b BROWSER, --browser BROWSER The browser to use when using -s/--selenium. Can be 'firefox' or 'chrome'. You must have the appropriate webdriver installed for your machine and browser version in order to use the selenium engine. -o OUTPUT_PATH, --output_path OUTPUT_PATH Output directory to save results to. If not specified, a folder with the name of the webpage will be created in the current working directory. -eh [EXTRA_HEADERS ...], --extra_headers [EXTRA_HEADERS ...] Extra headers to use when requesting files as key, value pairs. Keys and values whould be colon separated and pairs should be space separated. e.g. -eh Referer:website.com/page Host:website.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pixelripper-0.0.1.tar.gz
.
File metadata
- Download URL: pixelripper-0.0.1.tar.gz
- Upload date:
- Size: 57.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d2cce23ca51db8dcdd9c8acb7c25f7a08d128194bf1b3b296f4dd7de56dde687 |
|
MD5 | 0b13cb0ff105214ab2bbd5513827b207 |
|
BLAKE2b-256 | db32c14da54fbe9c0dd0b85f2e563478d40457f6c97d52ef12c88b62c8b175f0 |
File details
Details for the file pixelripper-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: pixelripper-0.0.1-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa7dd6b59fce552b162fc9147c1cb4d72b2897ef620f639d89330f4df8e7e010 |
|
MD5 | ac5af15e3cb6694152c2b0ed0114006e |
|
BLAKE2b-256 | 9f203441aaaa8dcee8feab509f4b4c486f8adb39c36b41b00c8ef479bc671250 |