Declarative, resilient, and typed web scraping with Pydantic and selector fallback chains.
Project description
topscrape
Declarative, resilient, and typed web scraping.
Define what you want — topscrape figures out how to get it, even when the site changes.
✨ Why topscrape?
Scrapers break when websites update their HTML. Fixing them means hunting down changed CSS selectors — tedious, repetitive, and always at the worst time.
❌ Standard approach — brittle
soup = BeautifulSoup(html, "html.parser")
price = soup.select_one(".price")
if not price:
price = soup.select_one(".cost")
Manual fallback. Manual debugging. Constant maintenance.
✅ The topscrape Approach
from topscrape import ScraperModel, Field
class Product(ScraperModel):
title: str = Field(selectors=["h1.title", "h1"])
price: float = Field(
selectors=[".product-price", "[data-price]", "//span[@itemprop='price']"],
transform=lambda v: v.replace("$", "").replace(",", ""),
)
image: str = Field(selectors=["img.hero"], attr="src", default="")
product = Product.from_url("https://example.com/item/1")
print(product.price)
If .product-price disappears but [data-price] still works:
- topscrape returns the correct value
- Emits a Selector Drift Warning
- Keeps your scraper alive
That’s resilience by design.
🚀 Features
| Feature | Description |
|---|---|
| Declarative models | Define fields with Field(selectors=[...]) |
| Selector chains | CSS → XPath → Regex fallback |
| Drift detection | Warns before total breakage |
| Pydantic validation | Strong typing enforced |
| Transforms | Clean data before validation |
| Async ready | from_url_async() supported |
| CLI included | Quick one-off extraction |
📦 Installation
pip install topscrape
Requires Python 3.9+.
⚡ Quick Start
Basic Extraction
from topscrape import ScraperModel, Field
class Article(ScraperModel):
title: str = Field(selectors=["h1", ".article-title"])
author: str = Field(selectors=[".byline", "[rel='author']"], default="Unknown")
content: str = Field(selectors=["article p", ".body-text"])
article = Article.from_html(html_string)
print(article.title)
Fetch From URL
product = Product.from_url("https://example.com/item/1")
print(product.title)
Async Usage
import asyncio
async def main():
product = await Product.from_url_async("https://example.com/item/1")
print(product.price)
asyncio.run(main())
Multiple Values
class Page(ScraperModel):
tags: list[str] = Field(selectors=[".tag"], multiple=True)
links: list[str] = Field(selectors=["nav a"], multiple=True, attr="href")
🛡 Drift Detection
If a fallback selector fires:
UserWarning: [Selector Drift] Field 'price':
primary selector '.product-price' failed;
used fallback '[data-price]'.
Catch programmatically:
import warnings
from topscrape import SelectorDriftWarning
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
product = Product.from_url(url)
drifted = [x for x in w if issubclass(x.category, SelectorDriftWarning)]
🖥 CLI Usage
topscrape https://example.com "title"
topscrape https://example.com ".price" "[data-price]"
topscrape https://example.com "a.buy-link" --attr href
topscrape https://example.com "li.feature" --all
topscrape https://example.com "h1" --json
🧩 API Reference
Field
| Parameter | Description |
|---|---|
| selectors | Ordered CSS / XPath / Regex list |
| attr | Attribute to extract |
| transform | Pre-validation function |
| default | Fallback value |
| multiple | Return all matches |
ScraperModel
| Method | Description |
|---|---|
| from_html | Parse raw HTML |
| from_url | Fetch & parse (sync) |
| from_url_async | Fetch & parse (async) |
| from_selector | Parse existing selector |
👨💻 Developer Guide — Run & Contribute via GitHub
Want to run topscrape locally or contribute improvements? Follow this streamlined workflow.
🍴 1. Fork the Repository
- Go to: https://github.com/ronaldgosso/topscrape
- Click Fork
- Clone your fork
📥 2. Clone Your Fork
git clone https://github.com/<your-username>/topscrape.git
cd topscrape
Add upstream:
git remote add upstream https://github.com/ronaldgosso/topscrape.git
Sync later with:
git fetch upstream
git merge upstream/main
🐍 3. Create Virtual Environment
python -m venv .venv
Activate:
Mac/Linux
source .venv/bin/activate
Windows
.venv\Scripts\activate
📦 4. Install in Editable Mode
pip install -e ".[dev]"
Editable mode ensures changes apply instantly.
🧪 5. Run Tests
pytest
No green tests, no merge.
🧹 6. Lint & Type Check
ruff check .
black .
mypy topscrape/
Clean, typed, consistent.
🌿 7. Create Feature Branch
git checkout -b feature/your-feature
Never commit directly to main.
💾 8. Commit Properly
git commit -m "feat: improve fallback logging"
Conventional commit prefixes:
- feat:
- fix:
- docs:
- refactor:
- test:
🚀 9. Push & Open Pull Request
git push origin feature/your-feature
Then open a Pull Request against ronaldgosso/main.
🧠 Development Principles
topscrape prioritizes:
- Resilience over cleverness
- Declarative design
- Type safety
- Drift transparency
Every contribution should reduce brittleness.
🏆 Contribution Standards
Pull requests must:
- Pass CI
- Include tests (if applicable)
- Maintain backward compatibility
- Follow existing style
Quality > Speed.
📄 License
MIT © Ronald Isack Gosso
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file topscrape-0.1.0.tar.gz.
File metadata
- Download URL: topscrape-0.1.0.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
633219f67a05cb0107daf468da0716bfb76176c00869c0a917d5fbba557ff5a0
|
|
| MD5 |
1a722373a8b048c34bbd6a6a0ef7d924
|
|
| BLAKE2b-256 |
5e4179db1b1ed79fac535894ef0497cab7dc78a85da81bceaaff0d1b8d9ccebe
|
Provenance
The following attestation bundles were made for topscrape-0.1.0.tar.gz:
Publisher:
publish.yml on ronaldgosso/topscrape
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
topscrape-0.1.0.tar.gz -
Subject digest:
633219f67a05cb0107daf468da0716bfb76176c00869c0a917d5fbba557ff5a0 - Sigstore transparency entry: 1203267293
- Sigstore integration time:
-
Permalink:
ronaldgosso/topscrape@21482f112016e2f5e5569f5084badbf701e63470 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ronaldgosso
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@21482f112016e2f5e5569f5084badbf701e63470 -
Trigger Event:
push
-
Statement type:
File details
Details for the file topscrape-0.1.0-py3-none-any.whl.
File metadata
- Download URL: topscrape-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a9f201c13ca6fe21e7c97d18b3bfc9508c119081e3e3a34f87934a3841e4a41
|
|
| MD5 |
ae34e90e4f7b2ff69c1207ccbdedd06e
|
|
| BLAKE2b-256 |
427209e24d42d3419391bcc8c2c045eee5a1fb3d4ea510f90dab23c245106fda
|
Provenance
The following attestation bundles were made for topscrape-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on ronaldgosso/topscrape
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
topscrape-0.1.0-py3-none-any.whl -
Subject digest:
0a9f201c13ca6fe21e7c97d18b3bfc9508c119081e3e3a34f87934a3841e4a41 - Sigstore transparency entry: 1203267298
- Sigstore integration time:
-
Permalink:
ronaldgosso/topscrape@21482f112016e2f5e5569f5084badbf701e63470 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ronaldgosso
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@21482f112016e2f5e5569f5084badbf701e63470 -
Trigger Event:
push
-
Statement type: