Skip to main content

A Beautiful Soup 4 wrapper for quickly scraping and cleaning data from the web

Project description

AwesomeRasam

A BeautifulSoup4 wrapper for lazy people. Allows you to extract and clean HTML/XML into neat formats with very few lines of elegant code.

Installation

pip3 install awesome-rasam

Initializing

From a URL

AwesomeRasam can use requests and BeautifulSoup4 under the hood to download html from an URL and create a soup object with it

from awesome_rasam import AwesomeRasam

rasam = AwesomeRasam("https://1upkd.com")
# or pass in any additional arguments you would pass to requests.get()
rasam = AwesomeRasam("https://1upkd.com",headers={"User-Agent":"Bot"})

print(rasam.get("title",">text"))

From Text

Initialize the soup under-the-hood with HTML/XML formatted text. This is useful when you get HTML through a request session or through a headless browser.

from awesome_rasam import AwesomeRasam

html = "<html><head><title>Page Title</title></head><body>Hello</body></html>"
rasam = AwesomeRasam(html, features="html5lib")

From a BeautifulSoup4 object

from awesome_rasam import AwesomeRasam
from bs4 import BeautifulSoup

html = "<html><head><title>Page Title</title></head><body>Hello</body></html>"
soup = BeautifulSoup(html, features="html5lib")
rasam = AwesomeRasam(soup)

Scraping data

  • All scraping is done by providing CSS selectors to pick elements, and the attributes to pick from those elements.
  • In addition to the attributes present on element tag, special attributes >text, >inner_markup, >outer_markup, and >rasam
  • get() and get_all() methods are provided to select first matching and all matching elements respectively
  • If the element is not found, or the attributed is not present, an Exception is raised. This can be prevented by passing flag=False, and optional fallback value can be specified by passing fallback="N/A"
  • A pipe argument can be passed containing a function or a list of functions to be executed on the result before returning
import json
from awesome_rasam import AwesomeRasam

rasam = AwesomeRasam("https://1upkd.com/host-website-on-laptop/")
blog = {
    "page_title": rasam.get("title", ">text"),
    "heading": rasam.get("h1", ">text"),
    "author": rasam.get(".title p>b", ">text"),
    "date": rasam.get(".title p>span", ">text", 
        pipe = lambda x: x.replace("\n","").strip()),
    "links": rasam.get_all("a","href"),
    "linked_emails": list(set(rasam.get_all(
        "a[href^='mailto:']", "href", 
        pipe = lambda x: x.split("mailto:")[1]))),
    "linked_emails_are_gmail": rasam.get_all(
        "a[href^='mailto:']", "href", 
        pipe = [
          lambda x: x.split("mailto:")[1],
          lambda x: x.endswith("@gmail.com")
        ]),
    "json_ld_metadata": rasam.get(
        "script[type='application/ld+json']", ">inner_markup",
        pipe=json.loads)        
}

print(json.dumps(blog, indent=2))

Ultimate flex

import json
import random

from awesome_rasam import AwesomeRasam

def parse_blog(rasam):
    return {
        "page_title": rasam.get("title", ">text"),
        "heading": rasam.get("h1", ">text"),
        "author": rasam.get(".title p>b", ">text"),
        "date": rasam.get(".title p>span", ">text", 
            pipe = lambda x: x.replace("\n","").strip()),
        "links": rasam.get_all("a","href"),
        "linked_emails": list(set(rasam.get_all(
            "a[href^='mailto:']", "href", 
            pipe = lambda x: x.split("mailto:")[1]))),
        "linked_emails_are_gmail": rasam.get_all(
            "a[href^='mailto:']", "href", 
            pipe = [
              lambda x: x.split("mailto:")[1],
              lambda x: x.endswith("@gmail.com")
            ]),
        "json_ld_metadata": rasam.get(
            "script[type='application/ld+json']", ">inner_markup",
            pipe=json.loads)        
    }



rasam = AwesomeRasam("https://1upkd.com")
data = {
    "page_title": rasam.get("title", ">text"),
    "blogs": rasam.get_all("#blogs ~ a", "href", pipe=[
      lambda url: AwesomeRasam(
          "https://1upkd.com/"+url, 
          delay=random.randint(1,5)),
      parse_blog
    ])        
}

print(json.dumps(data, indent=2))

Note: The delay argument can be passed while initializing with URL, to delay the request by that many seconds. It can also be a function which returns the number of seconds.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awesome_rasam-0.0.6.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

awesome_rasam-0.0.6-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file awesome_rasam-0.0.6.tar.gz.

File metadata

  • Download URL: awesome_rasam-0.0.6.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.5

File hashes

Hashes for awesome_rasam-0.0.6.tar.gz
Algorithm Hash digest
SHA256 8b0f5610afe21443855e15c87ddecf309c4dd78e7b990ec430544923197bb30e
MD5 f0b7d4f87e310c36d53f21bfaf694ae0
BLAKE2b-256 2d8f85374d2954014d8b26d066e408ef3d88f2a71f08b6e4079b2df3d27eed09

See more details on using hashes here.

File details

Details for the file awesome_rasam-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: awesome_rasam-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.5

File hashes

Hashes for awesome_rasam-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 099eda5aafcaab5deabf9eb22396d2398a7b3ab697a5325bfc573683a56f3c0a
MD5 ca7b3da914ced731f8b717e1d179d66f
BLAKE2b-256 f0b1758f9952d9042eef0572f569198ba2aa346f2e2bb4589cf8beb60752b2d7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page