Skip to main content

A lightweight python module which automates webscraping and parsing through HTML

Project description

pyautoscraper

Author: Jeet Chugh

pyautoscraper is a A lightweight module which automates webscraping and gathering HTML elements within Python 3

Features:

  • Find elements by searching for tags, attributes, classes, id's, and more
  • Parse through Cloudflare protected sites (NOT CAPTCHA)
  • Install easily with pip
  • Lightweight, only uses cloudscraper and BS4

Github Link | PyPi Link | Example Code Link

Quick and Easy Installation via PIP: pip install pyautoscraper

Import Statement: from pyautoscraper.scraper import Scraper

Dependencies: bs4, cloudscraper

Code License: MIT

Documentation

Documentation is split into 2 sections. First is the 'Part' Class and second is the 'Query' Function.

'Scraper' Class:

The 'Scraper' class takes in an input of a URL as a string, and has many methods that return specific chunks of data.

Import:

from pyautoscraper.scraper import Scraper

Instantiation:

webscraper = Scraper('URL') # Takes in url string (with https://)

another_scraper = Scraper('Second URL') # Instantiate multiple Scrapers though variables

'Scraper' Methods:

Scraper will raise a URLerror if the request is unsuccessful. Scraper will return None if no elements are found.

Scraper('url').find(tag, **attributes) --> Scraper('URL').find('h1', class_='blog-title')

returns a string containing the first HTML element that matches your parameters. To find classes, use the 'class_' keyword argument.

(<h1 class="blog-title>Title</h1>")


Scraper('url').findAll(tag, **attributes) --> Scraper('URL').findAll('p')

returns a list of strings, containing all the HTML elements that match the parameters.

([<p>first</p>, <p>second</p>, <p>third</p>])


Scraper('url').findText()

returns a string containing the text content of the HTML, with all tags and attributes stripped.

(h1 text paragraph text span text h5 text im in a div tag)


Scraper('url').findLinks()

returns a list of all http/https links in a tags within the HTML code of the page.

([https://www.google.com, https://www.github.com])


Scraper('url').findJS()

returns a list containing strings, which represent the string tags within the HTML code.

Example Dictionary:{'model':'Intel','Core Clock':'3.2Ghz','TDP':'95W','Socket':'LGA1155'}


Scraper('url').findElementByID(IDname)

returns a string containing the first HTML element that matches your IDname.

(<div id="database_div">content</div>)


Scraper('url').findElementByClass(className)

returns a string containing the first HTML element that matches your className.

(<div class="database_div">content</div>)


Scraper('url').findComments()

returns a list of strings, containing all the HTML comments within the code.

(['<!-- a comment -->','<!-- ANOTHER COMMENT -->'])


Thank you for reading the documentation. If you need an example using all these methods, go to [link]

If you have issues, report them to the github project link.

CHANGELOG:

0.0.1 (10/5/20):

  • GitHub Commit
  • Published to PyPi

0.0.2 (10/6/20):

  • Updated README
  • Fixed Bug

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyautoscraper-0.0.2.tar.gz (4.9 kB view details)

Uploaded Source

File details

Details for the file pyautoscraper-0.0.2.tar.gz.

File metadata

  • Download URL: pyautoscraper-0.0.2.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for pyautoscraper-0.0.2.tar.gz
Algorithm Hash digest
SHA256 0fa2ff3e056ee25a03f8cc069750b393366b4536d19a9352786c677b5891a86f
MD5 232e1fdcdbff3cf47d6ac05ad5d1a47a
BLAKE2b-256 3766d8b4faf4875b4c16139ed9c817abda2c72ab15740d75b015a2192c1d7782

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page