Skip to main content

Scrape HTML to dictionaries

Project description

Write scraping rules, get dictionaries.

scrapedict is a Python module designed to simplify the process of writing web scraping code. The goal is to make scrapers easy to adapt and maintain, with straightforward and readable code.

Features

  • The rules dictionary is straightforward and easy to read
  • Once you define the rules for one item you can extract multiple items
  • You get ✨dictionaries✨ of the data you want

Installation

$ pip install scrapedict

Usage

import scrapedict as sd
from urllib.request import urlopen

# Fetch the content from the Urban Dictionary page for "larping"
url = "https://www.urbandictionary.com/define.php?term=larping"
content = urlopen(url).read().decode()

# Define the fields to be extracted
fields = {
    "word": sd.text(".word"),
    "meaning": sd.text(".meaning"),
    "example": sd.text(".example"),
}

# Extract the data using scrapedict
item = sd.extract(fields, content)

# The result is a dictionary with the word, its meaning, and an example usage.
# Here, we perform a couple of assertions to demonstrate the expected structure and content.
assert isinstance(item, dict)
assert item["word"] == "Larping"

The orange site example

import scrapedict as sd
from urllib.request import urlopen

# Fetch the content from the Hacker News homepage
url = "https://news.ycombinator.com/"
content = urlopen(url).read().decode()

# Define the fields to extract: title and URL for each news item
fields = {
    "title": sd.text(".titleline a"),
    "url": sd.attr(".titleline a", "href"),
}

# Use scrapedict to extract all news items as a list of dictionaries
items = sd.extract_all(".athing", fields, content)

# The result is a list of dictionaries, each containing the title and URL of a news item.
# Here, we assert that 30 items are extracted, which is the typical number of news items on the Hacker News homepage.
assert len(items) == 30

Development

Dependencies are managed with Poetry.

Testing is done with Tox.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapedict-0.3.1.tar.gz (3.6 kB view details)

Uploaded Source

Built Distribution

scrapedict-0.3.1-py3-none-any.whl (2.9 kB view details)

Uploaded Python 3

File details

Details for the file scrapedict-0.3.1.tar.gz.

File metadata

  • Download URL: scrapedict-0.3.1.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.2

File hashes

Hashes for scrapedict-0.3.1.tar.gz
Algorithm Hash digest
SHA256 189436e69e0ea57370e43fcece6ed39832b511959aae343ce000a4ba82b8e011
MD5 99e155291c90b806f73825ad02c3af77
BLAKE2b-256 21f6414b5fc402bbff008ffa34adefbef9b39440f7e56ebab810f819cb596210

See more details on using hashes here.

File details

Details for the file scrapedict-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapedict-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 36364ddbdcbfe9d8b53ca9115c0c2797e938947ee8c2abbb82a6f8395b584cf6
MD5 368c4c3b15f626f380df3986bb717bd3
BLAKE2b-256 0bc4c8e31e1194cef7f5a6c0d44a9d629d0c045d6e27a9edbb4c95a95a88b77a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page