An Easy Scraper for HTML
Project description
easy-scraper-py
An easy scraping tool for HTML
Goal
Re-implementation of tanakh/easy-scraper in Python.
Install from PyPI
pip install easy-scraper-py
Usage Example
Scraping texts
<!-- Target: full or partial HTML code -->
<body>
<b>NotMe</b>
<a class=here>Here</a>
<a class=nothere>NotHere</a>
</body>
<!-- Pattern: partial HTML with variables ({{ name }}) -->
<a class=here>{{ text }}</a>
import easy_scraper
target = r"""<body>
<b>NotMe</b>
<a class=here>Here</a>
<a class=nothere>NotHere</a>
</body>
""" # newlines and spaces are all ignored.
# Matching innerText under a-tag with class="here"
pattern = "<a class=here>{{ text }}</a>"
easy_scraper.match(target, pattern) # [{'text': 'Here'}]
Scraping links
target = r"""
<div>
<div class=here>
<a href="link1">foo</a>
<a href="link2">bar</a>
<a>This is not a link.</a>
<div>
<a href="link3">baz</a>
</div>
</div>
<div class=nothere>
<a href="link4">bazzz</a>
</div>
</div>
"""
# Marching links (href and innerText) under div-tag with class="here"
pattern = r"""
<div class=here>
<a href="{{ link }}">{{ text }}</a>
</div>
"""
assert easy_scraper.match(target, pattern) == [
{"link": "link1", "text": "foo"},
{"link": "link2", "text": "bar"},
{"link": "link3", "text": "baz"},
]
Scraping RSS (XML)
easy-scraper-py
just uses html.parser for parsing, also can parse almost XML.
import easy_scraper
import urllib.request
body = urllib.request.urlopen("https://kuragebunch.com/rss/series/10834108156628842505").read().decode()
res = easy_scraper.match(body, "<item><title>{{ title }}</title><link>{{ link }}</link></item>")
for item in res[:5]:
print(item)
Scraping Images
import easy_scraper
import urllib.request
url = "https://unsplash.com/s/photos/sample"
body = urllib.request.urlopen(url).read().decode()
# Matching all images
res = easy_scraper.match(body, r"<img src='{{ im }}' />")
print(res)
# Matching linked (under a-tag) images
res = easy_scraper.match(body, r"<a href='{{ link }}'><img src='{{ im }}' /></a>")
print(res)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
easy-scraper-py-0.1.6.tar.gz
(4.8 kB
view hashes)
Built Distribution
Close
Hashes for easy_scraper_py-0.1.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 12ebafd55cf8d74d9f4072952b0fc941abc81fa1657e994193e060cf00ad8102 |
|
MD5 | 76fe80e733c9bd81079384eda05ba869 |
|
BLAKE2b-256 | ac300d1bc300659680c3ad167224a5c4eec05fbe8b36bb75ebca15c0854dd6f7 |