Skip to main content

A library which does some basic web scraping operations

Project description

easierscrape

A library for basic web scraping.

PyPI License Issues Build Status Codecov Docs Deployment

Overview

easierscrape is a library which helps users do some basic web scraping operations. Oftentimes when doing webscraping code is written and re-written with slightly changed parameters to fit the website to be scraped from. This library is an easy to use tool that can scrape essentials from websites (tables, links, files, etc.). It also has the ability to generate hyperlink trees using anytree.

Details

This project is a pure python project using modern tooling. It uses a Makefile as a command registry, with the following commands:

  • make: list available commands
  • make develop: install and build this library and its dependencies using pip
  • make build: build the library using setuptools
  • make lint: perform static analysis of this library with flake8 and black
  • make format: autoformat this library using black
  • make annotate: run type checking using mypy
  • make test: run automated tests with pytest
  • make coverage: run automated tests with pytest and collect coverage information
  • make dist: package library for distribution

Basic Usage

Install with pip: pip install easierscrape

Import Scraper from easierscrape and instantiate it with a url as seen below:

from easierscrape import Scraper

scraper = Scraper("https://quotes.toscrape.com/login")

From there, call class methods to scrape varying resources.

Usage examples:

>>> scraper.parse_text()
["Quotes to Scrape", "Quotes to Scrape", "Login", "Username", "Password", "Quotes by:", "GoodReads.com", "Made with", "❤", "by", "Scrapinghub",]

>>> scraper.print_tree(1)
https://toscrape.com
├── http://books.toscrape.com
├── http://quotes.toscrape.com
├── http://quotes.toscrape.com/scroll
├── http://quotes.toscrape.com/js
├── http://quotes.toscrape.com/js-delayed
├── http://quotes.toscrape.com/tableful
├── http://quotes.toscrape.com/login
├── http://quotes.toscrape.com/search.aspx
└── http://quotes.toscrape.com/random

Command Line Usage

When installed, you can invoke easierscrape from the command-line to generate a hyperlink tree, get a screenshot, download all image, txt, and pdf files, and scrape any tables for a given url and depth:

usage: python -m easierscrape [-h] url depth download_path

positional arguments:
  url            the url to scrape
  depth          the depth of the scrape tree
  download_path  the location to download files to

optional arguments:
  -h, --help  show this help message and exit

Usage example:

>>> python -m  easierscrape https://toscrape.com/ 1 example_down_path
https://toscrape.com
├── http://books.toscrape.com
├── http://quotes.toscrape.com
├── http://quotes.toscrape.com/scroll
├── http://quotes.toscrape.com/js
├── http://quotes.toscrape.com/js-delayed
├── http://quotes.toscrape.com/tableful
├── http://quotes.toscrape.com/login
├── http://quotes.toscrape.com/search.aspx
└── http://quotes.toscrape.com/random

Contributing

See CONTRIBUTING for more information.

License

This software is licensed under the Apache 2.0 license. Please see LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easierscrape-0.2.2.tar.gz (337.3 kB view details)

Uploaded Source

File details

Details for the file easierscrape-0.2.2.tar.gz.

File metadata

  • Download URL: easierscrape-0.2.2.tar.gz
  • Upload date:
  • Size: 337.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.5

File hashes

Hashes for easierscrape-0.2.2.tar.gz
Algorithm Hash digest
SHA256 d97964bbf3022568241004398cea885c3e2105168d535c8d6e19c29059925c7d
MD5 0222ac244ed7c55ccb2857dd7f4b266f
BLAKE2b-256 0234a7bffc196bd3ed256ab2ece3203beca7de6b71e8b00193c938e9052a14da

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page