Intelligent web scraping with automatic backend selection.
Project description
smart-scraper
Intelligent web scraping for Python with automatic backend selection.
Chooses the right scraping strategy for every URL — lightweight static fetch, cloud-based JavaScript rendering, or search-powered extraction — so you do not have to think about it.
Features
- Zero-config core — works out of the box with
requests+beautifulsoup4 - Automatic backend selection — static HTML, JS-heavy SPAs, and paywalled content each get the best tool
- Optional cloud backends — Firecrawl and Tavily activate when their SDK and API key are present
- Graceful degradation — always returns a
ScrapeResult, never raises on network errors - Clean Markdown output — navigation, footers, scripts, and styles are stripped automatically
- Typed API — full type annotations, compatible with mypy strict mode
Install
# Core (no API keys required)
pip install smart-scraper
# With Firecrawl (JS-heavy sites)
pip install smart-scraper[firecrawl]
# With Tavily (search + paywalled content)
pip install smart-scraper[tavily]
# Everything
pip install smart-scraper[all]
Quick start
from smart_scraper import scrape_url
# Automatic backend selection — just pass a URL
result = scrape_url("https://docs.python.org/3/library/json.html")
if result.success:
print(result.title) # "json — JSON encoder and decoder"
print(result.content) # clean Markdown text
else:
print(result.error)
Backend comparison
| Backend | Best for | Dependencies | Free tier |
|---|---|---|---|
| basic | Static HTML, docs, blogs, GitHub raw files | requests, beautifulsoup4 |
Unlimited |
| firecrawl | SPAs, React/Next.js, social platforms, anti-bot sites | firecrawl-py + API key |
500 scrapes |
| tavily | Research queries, paywalled/login-walled content | tavily-python + API key |
1,000 credits/month |
Automatic selection logic
scrape_url() runs the following decision tree before making any network call:
- URL pattern match — raw file hosts (
raw.githubusercontent.com,pastebin.com/raw/,arxiv.org/abs/) always use the basic backend regardless of what is installed. - Paywall domains —
wsj.com,ft.com,nytimes.com, etc. prefer Tavily (its cached access often bypasses paywalls). Falls back to basic when Tavily is not configured. - JS-heavy domains —
medium.com,substack.com,notion.so,twitter.com, etc. prefer Firecrawl. Falls back to basic when Firecrawl is not configured. - Unknown domains — prefer Firecrawl (highest quality), then Tavily, then basic.
Override automatic selection at any time:
# Force a specific backend
result = scrape_url("https://example.com", backend="firecrawl")
result = scrape_url("https://example.com", backend="tavily")
result = scrape_url("https://example.com", backend="basic")
# Shorthand flags
result = scrape_url("https://notion.so/page", force_firecrawl=True)
result = scrape_url("https://wsj.com/article", force_tavily=True)
API reference
scrape_url(url, *, backend=None, force_firecrawl=False, force_tavily=False, only_main_content=True, timeout=30, firecrawl_api_key=None, tavily_api_key=None)
Scrape a URL and return a ScrapeResult. Always returns — never raises.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str |
— | Fully-qualified URL |
backend |
str | Backend | None |
None |
Force a backend: "basic", "firecrawl", "tavily", "auto" |
force_firecrawl |
bool |
False |
Shorthand for backend="firecrawl" |
force_tavily |
bool |
False |
Shorthand for backend="tavily" |
only_main_content |
bool |
True |
Strip nav/footer/sidebar (Firecrawl only) |
timeout |
int |
30 |
Request timeout in seconds |
firecrawl_api_key |
str | None |
None |
Per-call key override (reads env var otherwise) |
tavily_api_key |
str | None |
None |
Per-call key override (reads env var otherwise) |
ScrapeResult
@dataclass
class ScrapeResult:
url: str
content: str # Markdown text
title: Optional[str]
metadata: Optional[dict]
source: str # "basic" | "firecrawl" | "tavily"
success: bool
error: Optional[str]
bool(result) returns result.success for convenient conditional checks.
Using backends directly
Basic backend
from smart_scraper.backends.basic import BasicBackend
backend = BasicBackend(timeout=15)
result = backend.fetch("https://example.com")
print(result.content)
Firecrawl backend
from smart_scraper.backends.firecrawl import FirecrawlBackend
backend = FirecrawlBackend() # reads FIRECRAWL_API_KEY
# Single page
result = backend.fetch("https://medium.com/@author/article", only_main_content=True)
# Crawl a site
pages = backend.crawl("https://docs.example.com", max_pages=20, max_depth=2)
# Structured extraction
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"topics": {"type": "array", "items": {"type": "string"}},
},
}
data = backend.extract_structured("https://example.com", schema=schema)
Tavily backend
from smart_scraper.backends.tavily import TavilyBackend
backend = TavilyBackend() # reads TAVILY_API_KEY
# Extract a specific URL
result = backend.fetch("https://wsj.com/articles/some-story")
# Search the web
response = backend.search("Python web scraping 2024", max_results=5, include_answer=True)
for item in response.results:
print(item.title, item.url)
# Get an AI-generated answer
answer = backend.qna_search("What is the capital of France?")
Inspecting backend selection
from smart_scraper.selector import BackendSelector, Backend
selector = BackendSelector(firecrawl_available=True, tavily_available=False)
chosen = selector.select("https://medium.com/@author/post")
print(chosen) # Backend.FIRECRAWL
reason = selector.explain("https://medium.com/@author/post")
print(reason) # "Backend 'firecrawl' selected: domain requires JavaScript..."
Environment variables
Copy .env.example to .env and fill in your keys:
FIRECRAWL_API_KEY=your_firecrawl_key_here
TAVILY_API_KEY=your_tavily_key_here
Keys are read at runtime via os.environ. Use python-dotenv or your preferred
env-management tool to load them:
from dotenv import load_dotenv
load_dotenv()
from smart_scraper import scrape_url
result = scrape_url("https://notion.so/page")
Running tests
# Install dev dependencies
pip install smart-scraper[dev]
# Run offline tests (no network, no API keys)
pytest
# Include live network tests (requires internet)
pytest --live
# With coverage
pytest --cov=src --cov-report=term-missing
Contributing
- Fork the repo and create a feature branch.
- Install dev dependencies:
pip install -e ".[dev]" - Run linting:
ruff check . && mypy src/ - Run tests:
pytest - Open a pull request — all CI checks must pass.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file edgeless_smart_scraper-0.1.0.tar.gz.
File metadata
- Download URL: edgeless_smart_scraper-0.1.0.tar.gz
- Upload date:
- Size: 22.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9040882a3b472dd0bd25cfe6b350e8d4e507d70ed44313513b8bfbc53dd8acf2
|
|
| MD5 |
2a9e35c4c97df2bf3ed7ae429a4c1b60
|
|
| BLAKE2b-256 |
877db919d42ef642140e2bdcff00f799d2bbe9d9fbe9f14a0b4c8e2ec69fedc2
|
File details
Details for the file edgeless_smart_scraper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: edgeless_smart_scraper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e21f8f63608a47d6e45663aa009c61193cdf16a3cc3cde4d77866c839135410
|
|
| MD5 |
a7d87072ed14e968976b4feadc328fa1
|
|
| BLAKE2b-256 |
046ea0dc384e5f9d586fbe9e17b8b6970397c649d7679691df04fdd191c7683b
|