Skip to main content

A lightweight python module which automates webscraping and parsing through HTML

Project description

pyautoscraper

Author: Jeet Chugh

pyautoscraper is a A lightweight module which automates webscraping and gathering HTML elements within Python 3

Features:

  • Find elements by searching for tags, attributes, classes, id's, and more
  • Parse through Cloudflare protected sites (NOT CAPTCHA)
  • Install easily with pip
  • Lightweight, only uses cloudscraper and BS4

Github Link | PyPi Link | Example Code Link

Quick and Easy Installation via PIP: pip install pyautoscraper

Import Statement: from pyautoscraper.scraper import Scraper

Dependencies: bs4, cloudscraper

Code License: MIT

Documentation

Documentation is split into 2 sections. First is the 'Part' Class and second is the 'Query' Function.

'Scraper' Class:

The 'Scraper' class takes in an input of a URL as a string, and has many methods that return specific chunks of data.

Import:

from pyautoscraper.scraper import Scraper

Instantiation:

webscraper = Scraper('URL') # Takes in url string (with https://)

another_scraper = Scraper('Second URL') # Instantiate multiple Scrapers though variables

'Scraper' Methods:

Scraper will raise a URLerror if the request is unsuccessful. Scraper will return None if no elements are found.

Scraper('url').find(tag, **attributes) --> Scraper('URL').find('h1', class_='blog-title')

returns a string containing the first HTML element that matches your parameters. To find classes, use the 'class_' keyword argument.

(<h1 class="blog-title>Title</h1>")


Scraper('url').findAll(tag, **attributes) --> Scraper('URL').findAll('p')

returns a list of strings, containing all the HTML elements that match the parameters.

([<p>first</p>, <p>second</p>, <p>third</p>])


Scraper('url').findText()

returns a string containing the text content of the HTML, with all tags and attributes stripped.

(h1 text paragraph text span text h5 text im in a div tag)


Scraper('url').findLinks()

returns a list of all http/https links in a tags within the HTML code of the page.

([https://www.google.com, https://www.github.com])


Scraper('url').findJS()

returns a list containing strings, which represent the string tags within the HTML code.

Example Dictionary:{'model':'Intel','Core Clock':'3.2Ghz','TDP':'95W','Socket':'LGA1155'}


Scraper('url').findElementByID(IDname)

returns a string containing the first HTML element that matches your IDname.

(<div id="database_div">content</div>)


Scraper('url').findElementByClass(className)

returns a string containing the first HTML element that matches your className.

(<div class="database_div">content</div>)


Scraper('url').findComments()

returns a list of strings, containing all the HTML comments within the code.

(['<!-- a comment -->','<!-- ANOTHER COMMENT -->'])


Thank you for reading the documentation. If you need an example using all these methods, go to [link]

If you have issues, report them to the github project link.

CHANGELOG:

0.0.1 (10/5/20):

  • GitHub Commit
  • Published to PyPi

0.0.2 (10/6/20):

  • Updated README
  • Fixed Bug

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyautoscraper-0.0.2.tar.gz (4.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page