The simple, fast, and modern web scraping library
Project description
About
gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies.
Install
Install with pip at the command line:
pip install -U gazpacho
Quickstart
Give this a try:
from gazpacho import get, Soup
url = 'https://scrape.world/books'
html = get(url)
soup = Soup(html)
books = soup.find('div', {'class': 'book-'}, partial=True)
def parse(book):
name = book.find('h4').text
price = float(book.find('p').text[1:].split(' ')[0])
return name, price
[parse(book) for book in books]
Tutorial
Import
Import gazpacho following the convention:
from gazpacho import get, Soup
get
Use the get function to download raw HTML:
url = 'https://scrape.world/soup'
html = get(url)
print(html[:50])
# '<!DOCTYPE html>\n<html lang="en">\n <head>\n <met'
Adjust get requests with optional params and headers:
get(
url='https://httpbin.org/anything',
params={'foo': 'bar', 'bar': 'baz'},
headers={'User-Agent': 'gazpacho'}
)
Soup
Use the Soup wrapper on raw html to enable parsing:
soup = Soup(html)
Soup objects can alternatively be initialized with the .get classmethod:
soup = Soup.get(url)
.find
Use the .find method to target and extract HTML tags:
h1 = soup.find('h1')
print(h1)
# <h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>
attrs=
Use the attrs argument to isolate tags that contain specific HTML element attributes:
soup.find('div', attrs={'class': 'section-'})
partial=
Element attributes are partially matched by default. Turn this off by setting partial to False:
soup.find('div', {'class': 'soup'}, partial=False)
mode=
Override the mode argument {'auto', 'first', 'all'} to guarantee return behaviour:
print(soup.find('span', mode='first'))
# <span class="navbar-toggler-icon"></span>
len(soup.find('span', mode='all'))
# 8
dir()
Soup objects have html, tag, attrs, and text attributes:
dir(h1)
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']
Use them accordingly:
print(h1.html)
# '<h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>'
print(h1.tag)
# h1
print(h1.attrs)
# {'id': 'firstHeading', 'class': 'firstHeading', 'lang': 'en'}
print(h1.text)
# Soup
Support
If you use gazpacho, consider adding the badge to your project README.md:
[](https://github.com/maxhumber/gazpacho)
Contribute
For feature requests or bug reports, please use Github Issues
For PRs, please read the CONTRIBUTING.md document
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file gazpacho-1.1.tar.gz.
File metadata
- Download URL: gazpacho-1.1.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1579c1be2de05b5ded0a97107b179d12491392fb095aeab185b283ea48cd7010
|
|
| MD5 |
b5f3c09706b6a3c3f0963eb3e888a57e
|
|
| BLAKE2b-256 |
1d653151b3837e9fa0fa535524c56e535f88910c10a3703487d9aead154c1339
|