Project description

Write scraping rules, get dictionaries.

scrapedict is a Python module designed to simplify the process of writing web scraping code. The goal is to make scrapers easy to adapt and maintain, with straightforward and readable code.

Features

The rules dictionary is straightforward and easy to read
Once you define the rules for one item you can extract multiple items
You get ✨dictionaries✨ of the data you want

Installation

$ pip install scrapedict

Usage

import scrapedict as sd
from urllib.request import urlopen

# Fetch the content from the Urban Dictionary page for "larping"
url = "https://www.urbandictionary.com/define.php?term=larping"
content = urlopen(url).read().decode()

# Define the fields to be extracted
fields = {
    "word": sd.text(".word"),
    "meaning": sd.text(".meaning"),
    "example": sd.text(".example"),
}

# Extract the data using scrapedict
item = sd.extract(fields, content)

# The result is a dictionary with the word, its meaning, and an example usage.
# Here, we perform a couple of assertions to demonstrate the expected structure and content.
assert isinstance(item, dict)
assert item["word"] == "Larping"

The orange site example

import scrapedict as sd
from urllib.request import urlopen

# Fetch the content from the Hacker News homepage
url = "https://news.ycombinator.com/"
content = urlopen(url).read().decode()

# Define the fields to extract: title and URL for each news item
fields = {
    "title": sd.text(".titleline a"),
    "url": sd.attr(".titleline a", "href"),
}

# Use scrapedict to extract all news items as a list of dictionaries
items = sd.extract_all(".athing", fields, content)

# The result is a list of dictionaries, each containing the title and URL of a news item.
# Here, we assert that 30 items are extracted, which is the typical number of news items on the Hacker News homepage.
assert len(items) == 30

Development

Dependencies are managed with Poetry.

Testing is done with Tox.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.3.0

Nov 16, 2023

0.2.1

Nov 14, 2023

0.2.0

Nov 13, 2023

0.1.1

Oct 28, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapedict-0.3.0.tar.gz (2.4 kB view hashes)

Uploaded Nov 16, 2023 Source

Built Distribution

scrapedict-0.3.0-py3-none-any.whl (2.7 kB view hashes)

Uploaded Nov 16, 2023 Python 3

Hashes for scrapedict-0.3.0.tar.gz

Hashes for scrapedict-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`0ab26e1f294ece1627a5651cbc37f92d8a342c0297879f6705fabc8c620a6bbc`
MD5	`930d64862fed741b622bd73d495e88cb`
BLAKE2b-256	`f87115cfeff94c649c1f43444724fc9f60473a544dc7331f1cfb5be8f2f81030`

Hashes for scrapedict-0.3.0-py3-none-any.whl

Hashes for scrapedict-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d49e3aa43ed8a7a513f09f984bbd898c21087e9a05f87d76becb7c14796f1702`
MD5	`29157c33c3deff0ffbeccc7197ca8879`
BLAKE2b-256	`332a5633b943285e19b2a5ff69713b3ab2c526b40f44ebe492592254a3ab21f9`