Skip to main content

No project description provided

Project description

Purifier

A simple scraping library.

It allows you to easily create simple and concise scrapers, even when the input is quite messy.

Example usage

Extract titles and URLs of articles from Hacker News:

from purifier import request, html, xpath, maps, fields, one

scraper = (
    request()
    | html()
    | xpath('//a[@class="titlelink"]')
    | maps(
        fields(
            title=xpath("text()") | one(),
            url=xpath("@href") | one(),
        )
    )
)

result = scraper.scrape("https://news.ycombinator.com")
result == [
     {
         "title": "Why Is the Web So Monotonous? Google",
         "url": "https://reasonablypolymorphic.com/blog/monotonous-web/index.html",
     },
     {
         "title": "Old jokes",
         "url": "https://dynomight.net/old-jokes/",
     },
     ...
]

Tutorial

The simplest possible scraper consists of a single action:

scraper = request()
result == (
    '<html lang="en" op="news"><head><meta name="referrer" content="origin">...'
)

As you can see, this scraper returns the HTTP response as a string. To do something useful with it, connect it to another scraper:

scraper = request() | html()
result == <Element html at 0x7f1be2193e00>

| ("pipe") takes output of one action and passes it to the next one. The html action parses the HTML, so you can then query it with xpath:

scraper = (
    request()
    | html()
    | xpath('//a[@class="titlelink"]/text()')
)
result == [
    "C99 doesn't need function bodies, or 'VLAs are Turing complete'",
    "Quaise Energy is working to create geothermal wells",
    ...
]

Alternatively, instead of using "/text()" at the end of the XPath, you could use maps with xpath and one:

scraper = (
    request()
    | html()
    | xpath('//a[@class="titlelink"]')
    | maps(xpath('text()') | one())
)
result == [
    "Why Is the Web So Monotonous? Google",
    "Old jokes",
    ...
]

maps ("map scraper") applies a scraper to each element of its input, which can be really powerful at times. For example, combine it with fields, and the result will look a bit different:

scraper = (
    request()
    | html()
    | xpath('//a[@class="titlelink"]')
    | maps(
        fields(title=xpath('text()') | one())
    )
)
result == [
    {"title": "Why Is the Web So Monotonous? Google"},
    {"title": "Old jokes"},
    ...
]

fields constructs a dictionary, allowing you to name things and also to extract multiple different things from a single input:

scraper = (
    request()
    | html()
    | xpath('//a[@class="titlelink"]')
    | maps(
        fields(
            title=xpath('text()') | one(),
            url=xpath('@href') | one(),
        )
    )
)
result == [
     {
         "title": "Why Is the Web So Monotonous? Google",
         "url": "https://reasonablypolymorphic.com/blog/monotonous-web/index.html",
     },
     {
         "title": "Old jokes",
         "url": "https://dynomight.net/old-jokes/",
     },
     ...
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

purifier-0.2.0.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

purifier-0.2.0-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file purifier-0.2.0.tar.gz.

File metadata

  • Download URL: purifier-0.2.0.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.9.7 Linux/5.13.0-44-generic

File hashes

Hashes for purifier-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e9dac03727c9ab431f6aff39e5313c7d5b83e213ae8213a37eb84fc48086a9f9
MD5 401aece38538ec1e32384c6fe9d5b8d2
BLAKE2b-256 502bc1889fd6a62f899b7b7a3449ace9fa8d336ed493d6888aefaa85aff1ee5b

See more details on using hashes here.

File details

Details for the file purifier-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: purifier-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.9.7 Linux/5.13.0-44-generic

File hashes

Hashes for purifier-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e38d9e834ed46e3d9a07d95a5e6a201767402a15ac913f3b70791178f4101e20
MD5 2b7067cc1f3f00b26c8bbfd3b5b8e6ea
BLAKE2b-256 8bdad16453051e818ee99716bdf43de6f6dea721b418f6509c6e011c71ca75cc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page