Skip to main content

`scrape_files` is a tool to help scrape things online to your local machine.

Project description

scrape_files is a tool to help scrape things online to your local machine. Currently, it supports scraping and converting htmls to well-formatted markdowns for easy reading as well as scraping and downloading images of various formats in a web page.

Scraping htmls to your local machine

The html parsing logic is similar to a browser's easyread extension's, which trims off all the unnecessary decorations from a web page, only keeping the title and the article content. The main difference is that the file is downloaded and converted as a pretty formatted markdown.

Also support scraping links under the <p> tag in the current page concurrently.

Terminal usage:

scrape html <url>     # specify a url for scraping
scrape html <url> -d  # specify a directory name for saving files in current folder
scrape html <url> -l  # specify a level: 1 by default for the current page; 2 for links in the current page

Scraping images to your local machine

Images are scraped and downloaded concurrently. Supported formats: jpg, png, gif, svg, jpeg, webp; defaults to all supported formats.

Terminal usage:

scrape image <url>     # specify a url for scraping
scrape image <url> -d  # specify a diretory name for saving files in current folder 
scrape image <url> -f  # specify image formats separated with space 

Installation

pip install scrape_files

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_files-0.2.0.tar.gz (8.9 kB view hashes)

Uploaded Source

Built Distribution

scrape_files-0.2.0-py3-none-any.whl (10.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page