Skip to main content

No project description provided

Project description

Purifier

A simple scraping library.

It allows you to easily create simple and concise scrapers, even when the input is quite messy.

Example usage

Extract titles and URLs of articles from Hacker News:

from purifier import request, html, xpath, maps, fields, one

scraper = (
    request()
    | html()
    | xpath('//a[@class="titlelink"]')
    | maps(
        fields(
            title=xpath("text()") | one(),
            url=xpath("@href") | one(),
        )
    )
)

result = scraper.scrape("https://news.ycombinator.com")
result == [
     {
         "title": "Why Is the Web So Monotonous? Google",
         "url": "https://reasonablypolymorphic.com/blog/monotonous-web/index.html",
     },
     {
         "title": "Old jokes",
         "url": "https://dynomight.net/old-jokes/",
     },
     ...
]

Tutorial

The simplest possible scraper consists of a single action:

scraper = request()
result == (
    '<html lang="en" op="news"><head><meta name="referrer" content="origin">...'
)

As you can see, this scraper returns the HTTP response as a string. To do something useful with it, connect it to another scraper:

scraper = request() | html()
result == <Element html at 0x7f1be2193e00>

| ("pipe") takes output of one action and passes it to the next one. The html action parses the HTML, so you can then query it with xpath:

scraper = (
    request()
    | html()
    | xpath('//a[@class="titlelink"]/text()')
)
result == [
    "C99 doesn't need function bodies, or 'VLAs are Turing complete'",
    "Quaise Energy is working to create geothermal wells",
    ...
]

Alternatively, instead of using "/text()" at the end of the XPath, you could use maps with xpath and one:

scraper = (
    request()
    | html()
    | xpath('//a[@class="titlelink"]')
    | maps(xpath('text()') | one())
)
result == [
    "Why Is the Web So Monotonous? Google",
    "Old jokes",
    ...
]

maps ("map scraper") applies a scraper to each element of its input, which can be really powerful at times. For example, combine it with fields, and the result will look a bit different:

scraper = (
    request()
    | html()
    | xpath('//a[@class="titlelink"]')
    | maps(
        fields(title=xpath('text()') | one())
    )
)
result == [
    {"title": "Why Is the Web So Monotonous? Google"},
    {"title": "Old jokes"},
    ...
]

fields constructs a dictionary, allowing you to name things and also to extract multiple different things from a single input:

scraper = (
    request()
    | html()
    | xpath('//a[@class="titlelink"]')
    | maps(
        fields(
            title=xpath('text()') | one(),
            url=xpath('@href') | one(),
        )
    )
)
result == [
     {
         "title": "Why Is the Web So Monotonous? Google",
         "url": "https://reasonablypolymorphic.com/blog/monotonous-web/index.html",
     },
     {
         "title": "Old jokes",
         "url": "https://dynomight.net/old-jokes/",
     },
     ...
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

purifier-0.2.1.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

purifier-0.2.1-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file purifier-0.2.1.tar.gz.

File metadata

  • Download URL: purifier-0.2.1.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.9.7 Linux/5.13.0-44-generic

File hashes

Hashes for purifier-0.2.1.tar.gz
Algorithm Hash digest
SHA256 b181dd0d2b655c18659e924007378f337fb4f975a1015ae931c6931e9511acc4
MD5 fb89b80c6768dc3b13ad14d48cfcf241
BLAKE2b-256 8d89f26147fac52127267dc5b25243f3e5c1253b4f9cc88662eca7f7bbe22874

See more details on using hashes here.

File details

Details for the file purifier-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: purifier-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.9.7 Linux/5.13.0-44-generic

File hashes

Hashes for purifier-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3ad4521589ae848fe3a6fa3829d543e063247075537ebe78ab94683e6d1fde93
MD5 7b4317bee7c1984bb5891437fcd9adcb
BLAKE2b-256 6cab798439fff76b06994351a020dbbee6076ee51e4e226c63fd2949d560a34a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page