An Easy Scraper for HTML
Project description
easy-scraper-py
An easy scraping tool for HTML
Goal
Re-implementation of tanakh/easy-scraper in Python.
Install from PyPI
pip install easy-scraper-py
Usage Example
Scraping texts
<!-- Target: full or partial HTML code -->
<body>
<b>NotMe</b>
<a class=here>Here</a>
<a class=nothere>NotHere</a>
</body>
<!-- Pattern: partial HTML with variables ({ name }) -->
<a class=here>{ text }</a>
import easy_scraper
target = r"""<body>
<b>NotMe</b>
<a class=here>Here</a>
<a class=nothere>NotHere</a>
</body>
""" # newlines and spaces are all ignored.
# Matching innerText under a-tag with class="here"
pattern = "<a class=here>{ text }</a>"
easy_scraper.match(target, pattern) # [{'text': 'Here'}]
Scraping links
target = r"""
<div>
<div class=here>
<a href="link1">foo</a>
<a href="link2">bar</a>
<a>This is not a link.</a>
<div>
<a href="link3">baz</a>
</div>
</div>
<div class=nothere>
<a href="link4">bazzz</a>
</div>
</div>
"""
# Marching links (href and innerText) under div-tag with class="here"
pattern = r"""
<div class=here>
<a href="{ link }">{ text }</a>
</div>
"""
assert easy_scraper.match(target, pattern) == [
{"link": "link1", "text": "foo"},
{"link": "link2", "text": "bar"},
{"link": "link3", "text": "baz"},
]
Scraping RSS (XML)
easy-scraper-py
just uses html.parser for parsing, also can parse almost XML.
import easy_scraper
import urllib.request
body = urllib.request.urlopen("https://kuragebunch.com/rss/series/10834108156628842505").read().decode()
res = easy_scraper.match(body, "<item><title>{ title }</title><link>{ link }</link></item>")
for item in res[:5]:
print(item)
Scraping Images
import easy_scraper
import urllib.request
url = "https://unsplash.com/s/photos/sample"
body = urllib.request.urlopen(url).read().decode()
# Matching all images
res = easy_scraper.match(body, r"<img src='{ im }' />")
print(res)
# Matching linked (under a-tag) images
res = easy_scraper.match(body, r"<a href='{ link }'><img src='{ im }' /></a>")
print(res)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
easy-scraper-py-1.0.0.tar.gz
(4.8 kB
view details)
Built Distribution
File details
Details for the file easy-scraper-py-1.0.0.tar.gz
.
File metadata
- Download URL: easy-scraper-py-1.0.0.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.12 CPython/3.9.9 Linux/4.19.104-microsoft-standard
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | adff88e6013d43f4e096b327f8b7a6886de590db3f47fc78bf5dac2494679fe9 |
|
MD5 | 4d9f79e46e93af982a18689f53143b1a |
|
BLAKE2b-256 | 2a825a77427fbeb0b12b3576d14288138aed5f3c36b3889b4a9f556c423f94e9 |
File details
Details for the file easy_scraper_py-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: easy_scraper_py-1.0.0-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.12 CPython/3.9.9 Linux/4.19.104-microsoft-standard
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e945df8040a8d88f8f65b95b9ed76ee760a13c2ffc7d45a9d94f996073807be |
|
MD5 | b80cef49a679192a81bd4ed3837fd0c6 |
|
BLAKE2b-256 | 71b84919259c633ce09bb2c5d29e67787855d52c5fe261dc38156e83abc48420 |