Skip to main content

A simple scraping library.

Project description

Purifier

A simple scraping library.

It allows you to easily create simple and concise scrapers, even when the input is quite messy.

Example usage

Extract titles and URLs of articles from Hacker News:

from purifier import request, html, xpath, maps, fields, one

scraper = (
    request()
    | html()
    | xpath('//a[@class="titlelink"]')
    | maps(
        fields(
            title=xpath("text()") | one(),
            url=xpath("@href") | one(),
        )
    )
)

result = scraper.scrape("https://news.ycombinator.com")
result == [
     {
         "title": "Why Is the Web So Monotonous? Google",
         "url": "https://reasonablypolymorphic.com/blog/monotonous-web/index.html",
     },
     {
         "title": "Old jokes",
         "url": "https://dynomight.net/old-jokes/",
     },
     ...
]

Installation

pip install purifier

Tutorial

See docs/Tutorial.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

purifier-0.2.10.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

purifier-0.2.10-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file purifier-0.2.10.tar.gz.

File metadata

  • Download URL: purifier-0.2.10.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.9.7 Linux/5.13.0-44-generic

File hashes

Hashes for purifier-0.2.10.tar.gz
Algorithm Hash digest
SHA256 620e6bf69e64848a0dca8ada734e7fd6b5acbdf1f8f65ce9fa8948694f5fbd90
MD5 772c2efacfcef2a7eb62bf901b9811ec
BLAKE2b-256 933284ab97a986b640b26e2094ae508626e6e32c8e353fa7336580039074ce18

See more details on using hashes here.

File details

Details for the file purifier-0.2.10-py3-none-any.whl.

File metadata

  • Download URL: purifier-0.2.10-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.9.7 Linux/5.13.0-44-generic

File hashes

Hashes for purifier-0.2.10-py3-none-any.whl
Algorithm Hash digest
SHA256 4c7d6309d77c0d6f5d8f42f634316deb78600cc60e1e22db31b2881371a4a89f
MD5 976c545f74fe0a2477bdccd192086862
BLAKE2b-256 f96613f7c5f2e723cbf8cab1ecd8a88c7d63cf3cc45262ca9528adba47768fab

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page