A Beautiful Soup 4 wrapper for quickly scraping and cleaning data from the web
Project description
AwesomeRasam
A BeautifulSoup4 wrapper for lazy people. Allows you to extract and clean HTML/XML into neat formats with very few lines of elegant code.
Installation
pip3 install awesome-rasam
Initializing
From a URL
AwesomeRasam can use requests and BeautifulSoup4 under the hood to download html from an URL and create a soup object with it
from awesome_rasam import AwesomeRasam
rasam = AwesomeRasam("https://1upkd.com")
# or pass in any additional arguments you would pass to requests.get()
rasam = AwesomeRasam("https://1upkd.com",headers={"User-Agent":"Bot"})
print(rasam.get("title",">text"))
From Text
Initialize the soup under-the-hood with HTML/XML formatted text. This is useful when you get HTML through a request session or through a headless browser.
from awesome_rasam import AwesomeRasam
html = "<html><head><title>Page Title</title></head><body>Hello</body></html>"
rasam = AwesomeRasam(html, features="html5lib")
From a BeautifulSoup4 object
from awesome_rasam import AwesomeRasam
from bs4 import BeautifulSoup
html = "<html><head><title>Page Title</title></head><body>Hello</body></html>"
soup = BeautifulSoup(html, features="html5lib")
rasam = AwesomeRasam(soup)
Scraping data
- All scraping is done by providing CSS selectors to pick elements, and the attributes to pick from those elements.
- In addition to the attributes present on element tag, special attributes
>text
,>inner_markup
,>outer_markup
, and>rasam
get()
andget_all()
methods are provided to select first matching and all matching elements respectively- If the element is not found, or the attributed is not present, an Exception is raised. This can be prevented by passing
flag=False
, and optional fallback value can be specified by passingfallback="N/A"
- A
pipe
argument can be passed containing a function or a list of functions to be executed on the result before returning
import json
from awesome_rasam import AwesomeRasam
rasam = AwesomeRasam("https://1upkd.com/host-website-on-laptop/")
blog = {
"page_title": rasam.get("title", ">text"),
"heading": rasam.get("h1", ">text"),
"author": rasam.get(".title p>b", ">text"),
"date": rasam.get(".title p>span", ">text",
pipe = lambda x: x.replace("\n","").strip()),
"links": rasam.get_all("a","href"),
"linked_emails": list(set(rasam.get_all(
"a[href^='mailto:']", "href",
pipe = lambda x: x.split("mailto:")[1]))),
"linked_emails_are_gmail": rasam.get_all(
"a[href^='mailto:']", "href",
pipe = [
lambda x: x.split("mailto:")[1],
lambda x: x.endswith("@gmail.com")
]),
"json_ld_metadata": rasam.get(
"script[type='application/ld+json']", ">inner_markup",
pipe=json.loads)
}
print(json.dumps(blog, indent=2))
Ultimate flex
import json
import random
from awesome_rasam import AwesomeRasam
def parse_blog(rasam):
return {
"page_title": rasam.get("title", ">text"),
"heading": rasam.get("h1", ">text"),
"author": rasam.get(".title p>b", ">text"),
"date": rasam.get(".title p>span", ">text",
pipe = lambda x: x.replace("\n","").strip()),
"links": rasam.get_all("a","href"),
"linked_emails": list(set(rasam.get_all(
"a[href^='mailto:']", "href",
pipe = lambda x: x.split("mailto:")[1]))),
"linked_emails_are_gmail": rasam.get_all(
"a[href^='mailto:']", "href",
pipe = [
lambda x: x.split("mailto:")[1],
lambda x: x.endswith("@gmail.com")
]),
"json_ld_metadata": rasam.get(
"script[type='application/ld+json']", ">inner_markup",
pipe=json.loads)
}
rasam = AwesomeRasam("https://1upkd.com")
data = {
"page_title": rasam.get("title", ">text"),
"blogs": rasam.get_all("#blogs ~ a", "href", pipe=[
lambda url: AwesomeRasam(
"https://1upkd.com/"+url,
delay=random.randint(1,5)),
parse_blog
])
}
print(json.dumps(data, indent=2))
Note: The delay
argument can be passed while initializing with URL, to delay the request by that many seconds. It can also be a function which returns the number of seconds.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file awesome_rasam-0.0.6.tar.gz
.
File metadata
- Download URL: awesome_rasam-0.0.6.tar.gz
- Upload date:
- Size: 4.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b0f5610afe21443855e15c87ddecf309c4dd78e7b990ec430544923197bb30e |
|
MD5 | f0b7d4f87e310c36d53f21bfaf694ae0 |
|
BLAKE2b-256 | 2d8f85374d2954014d8b26d066e408ef3d88f2a71f08b6e4079b2df3d27eed09 |
File details
Details for the file awesome_rasam-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: awesome_rasam-0.0.6-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 099eda5aafcaab5deabf9eb22396d2398a7b3ab697a5325bfc573683a56f3c0a |
|
MD5 | ca7b3da914ced731f8b717e1d179d66f |
|
BLAKE2b-256 | f0b1758f9952d9042eef0572f569198ba2aa346f2e2bb4589cf8beb60752b2d7 |