Skip to main content

An Easy Scraper for HTML

Project description

easy-scraper-py

PyPI

An easy scraping tool for HTML

Goal

Re-implementation of tanakh/easy-scraper in Python.

Install from PyPI

   pip install easy-scraper-py

Usage Example

Scraping texts

<!-- Target: full or partial HTML code -->
<body>
    <b>NotMe</b>
    <a class=here>Here</a>
    <a class=nothere>NotHere</a>
</body>

<!-- Pattern: partial HTML with variables ({ name }) -->
<a class=here>{ text }</a>
import easy_scraper

target = r"""<body>
    <b>NotMe</b>
    <a class=here>Here</a>
    <a class=nothere>NotHere</a>
</body>
"""  # newlines and spaces are all ignored.

# Matching innerText under a-tag with class="here"
pattern = "<a class=here>{ text }</a>"

easy_scraper.match(target, pattern)  # [{'text': 'Here'}]

Scraping links

target = r"""
<div>
    <div class=here>
        <a href="link1">foo</a>
        <a href="link2">bar</a>
        <a>This is not a link.</a>
        <div>
            <a href="link3">baz</a>
        </div>
    </div>
    <div class=nothere>
        <a href="link4">bazzz</a>
    </div>
</div>
"""

# Marching links (href and innerText) under div-tag with class="here"
pattern = r"""
    <div class=here>
        <a href="{ link }">{ text }</a>
    </div>
"""

assert easy_scraper.match(target, pattern) == [
    {"link": "link1", "text": "foo"},
    {"link": "link2", "text": "bar"},
    {"link": "link3", "text": "baz"},
]

Scraping RSS (XML)

easy-scraper-py just uses html.parser for parsing, also can parse almost XML.

import easy_scraper
import urllib.request

body = urllib.request.urlopen("https://kuragebunch.com/rss/series/10834108156628842505").read().decode()
res = easy_scraper.match(body, "<item><title>{ title }</title><link>{ link }</link></item>")
for item in res[:5]:
    print(item)

Scraping Images

import easy_scraper
import urllib.request

url = "https://unsplash.com/s/photos/sample"
body = urllib.request.urlopen(url).read().decode()

# Matching all images
res = easy_scraper.match(body, r"<img src='{ im }' />")
print(res)

# Matching linked (under a-tag) images
res = easy_scraper.match(body, r"<a href='{ link }'><img src='{ im }' /></a>")
print(res)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easy-scraper-py-1.0.0.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

easy_scraper_py-1.0.0-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file easy-scraper-py-1.0.0.tar.gz.

File metadata

  • Download URL: easy-scraper-py-1.0.0.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.9.9 Linux/4.19.104-microsoft-standard

File hashes

Hashes for easy-scraper-py-1.0.0.tar.gz
Algorithm Hash digest
SHA256 adff88e6013d43f4e096b327f8b7a6886de590db3f47fc78bf5dac2494679fe9
MD5 4d9f79e46e93af982a18689f53143b1a
BLAKE2b-256 2a825a77427fbeb0b12b3576d14288138aed5f3c36b3889b4a9f556c423f94e9

See more details on using hashes here.

File details

Details for the file easy_scraper_py-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: easy_scraper_py-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.9.9 Linux/4.19.104-microsoft-standard

File hashes

Hashes for easy_scraper_py-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2e945df8040a8d88f8f65b95b9ed76ee760a13c2ffc7d45a9d94f996073807be
MD5 b80cef49a679192a81bd4ed3837fd0c6
BLAKE2b-256 71b84919259c633ce09bb2c5d29e67787855d52c5fe261dc38156e83abc48420

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page