Intelligent web scraping with automatic backend selection.

These details have not been verified by PyPI

Project links

Project description

smart-scraper

Intelligent web scraping for Python with automatic backend selection.

Chooses the right scraping strategy for every URL — lightweight static fetch, cloud-based JavaScript rendering, or search-powered extraction — so you do not have to think about it.

Features

Zero-config core — works out of the box with requests + beautifulsoup4
Automatic backend selection — static HTML, JS-heavy SPAs, and paywalled content each get the best tool
Optional cloud backends — Firecrawl and Tavily activate when their SDK and API key are present
Graceful degradation — always returns a ScrapeResult, never raises on network errors
Clean Markdown output — navigation, footers, scripts, and styles are stripped automatically
Typed API — full type annotations, compatible with mypy strict mode

Install

# Core (no API keys required)
pip install smart-scraper

# With Firecrawl (JS-heavy sites)
pip install smart-scraper[firecrawl]

# With Tavily (search + paywalled content)
pip install smart-scraper[tavily]

# Everything
pip install smart-scraper[all]

Quick start

from smart_scraper import scrape_url

# Automatic backend selection — just pass a URL
result = scrape_url("https://docs.python.org/3/library/json.html")

if result.success:
    print(result.title)    # "json — JSON encoder and decoder"
    print(result.content)  # clean Markdown text
else:
    print(result.error)

Backend comparison

Backend	Best for	Dependencies	Free tier
basic	Static HTML, docs, blogs, GitHub raw files	`requests`, `beautifulsoup4`	Unlimited
firecrawl	SPAs, React/Next.js, social platforms, anti-bot sites	`firecrawl-py` + API key	500 scrapes
tavily	Research queries, paywalled/login-walled content	`tavily-python` + API key	1,000 credits/month

Automatic selection logic

scrape_url() runs the following decision tree before making any network call:

URL pattern match — raw file hosts (raw.githubusercontent.com, pastebin.com/raw/, arxiv.org/abs/) always use the basic backend regardless of what is installed.
Paywall domains — wsj.com, ft.com, nytimes.com, etc. prefer Tavily (its cached access often bypasses paywalls). Falls back to basic when Tavily is not configured.
JS-heavy domains — medium.com, substack.com, notion.so, twitter.com, etc. prefer Firecrawl. Falls back to basic when Firecrawl is not configured.
Unknown domains — prefer Firecrawl (highest quality), then Tavily, then basic.

Override automatic selection at any time:

# Force a specific backend
result = scrape_url("https://example.com", backend="firecrawl")
result = scrape_url("https://example.com", backend="tavily")
result = scrape_url("https://example.com", backend="basic")

# Shorthand flags
result = scrape_url("https://notion.so/page", force_firecrawl=True)
result = scrape_url("https://wsj.com/article", force_tavily=True)

API reference

`scrape_url(url, *, backend=None, force_firecrawl=False, force_tavily=False, only_main_content=True, timeout=30, firecrawl_api_key=None, tavily_api_key=None)`

Scrape a URL and return a ScrapeResult. Always returns — never raises.

Parameters

Parameter	Type	Default	Description
`url`	`str`	—	Fully-qualified URL
`backend`	`str \| Backend \| None`	`None`	Force a backend: `"basic"`, `"firecrawl"`, `"tavily"`, `"auto"`
`force_firecrawl`	`bool`	`False`	Shorthand for `backend="firecrawl"`
`force_tavily`	`bool`	`False`	Shorthand for `backend="tavily"`
`only_main_content`	`bool`	`True`	Strip nav/footer/sidebar (Firecrawl only)
`timeout`	`int`	`30`	Request timeout in seconds
`firecrawl_api_key`	`str \| None`	`None`	Per-call key override (reads env var otherwise)
`tavily_api_key`	`str \| None`	`None`	Per-call key override (reads env var otherwise)

`ScrapeResult`

@dataclass
class ScrapeResult:
    url: str
    content: str           # Markdown text
    title: Optional[str]
    metadata: Optional[dict]
    source: str            # "basic" | "firecrawl" | "tavily"
    success: bool
    error: Optional[str]

bool(result) returns result.success for convenient conditional checks.

Using backends directly

Basic backend

from smart_scraper.backends.basic import BasicBackend

backend = BasicBackend(timeout=15)
result = backend.fetch("https://example.com")
print(result.content)

Firecrawl backend

from smart_scraper.backends.firecrawl import FirecrawlBackend

backend = FirecrawlBackend()  # reads FIRECRAWL_API_KEY

# Single page
result = backend.fetch("https://medium.com/@author/article", only_main_content=True)

# Crawl a site
pages = backend.crawl("https://docs.example.com", max_pages=20, max_depth=2)

# Structured extraction
schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "topics": {"type": "array", "items": {"type": "string"}},
    },
}
data = backend.extract_structured("https://example.com", schema=schema)

Tavily backend

from smart_scraper.backends.tavily import TavilyBackend

backend = TavilyBackend()  # reads TAVILY_API_KEY

# Extract a specific URL
result = backend.fetch("https://wsj.com/articles/some-story")

# Search the web
response = backend.search("Python web scraping 2024", max_results=5, include_answer=True)
for item in response.results:
    print(item.title, item.url)

# Get an AI-generated answer
answer = backend.qna_search("What is the capital of France?")

Inspecting backend selection

from smart_scraper.selector import BackendSelector, Backend

selector = BackendSelector(firecrawl_available=True, tavily_available=False)
chosen = selector.select("https://medium.com/@author/post")
print(chosen)  # Backend.FIRECRAWL

reason = selector.explain("https://medium.com/@author/post")
print(reason)  # "Backend 'firecrawl' selected: domain requires JavaScript..."

Environment variables

Copy .env.example to .env and fill in your keys:

FIRECRAWL_API_KEY=your_firecrawl_key_here
TAVILY_API_KEY=your_tavily_key_here

Keys are read at runtime via os.environ. Use python-dotenv or your preferred env-management tool to load them:

from dotenv import load_dotenv
load_dotenv()

from smart_scraper import scrape_url
result = scrape_url("https://notion.so/page")

Running tests

# Install dev dependencies
pip install smart-scraper[dev]

# Run offline tests (no network, no API keys)
pytest

# Include live network tests (requires internet)
pytest --live

# With coverage
pytest --cov=src --cov-report=term-missing

Contributing

Fork the repo and create a feature branch.
Install dev dependencies: pip install -e ".[dev]"
Run linting: ruff check . && mypy src/
Run tests: pytest
Open a pull request — all CI checks must pass.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edgeless_smart_scraper-0.1.0.tar.gz (22.1 kB view details)

Uploaded May 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

edgeless_smart_scraper-0.1.0-py3-none-any.whl (21.1 kB view details)

Uploaded May 3, 2026 Python 3

File details

Details for the file edgeless_smart_scraper-0.1.0.tar.gz.

File metadata

Download URL: edgeless_smart_scraper-0.1.0.tar.gz
Upload date: May 3, 2026
Size: 22.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for edgeless_smart_scraper-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9040882a3b472dd0bd25cfe6b350e8d4e507d70ed44313513b8bfbc53dd8acf2`
MD5	`2a9e35c4c97df2bf3ed7ae429a4c1b60`
BLAKE2b-256	`877db919d42ef642140e2bdcff00f799d2bbe9d9fbe9f14a0b4c8e2ec69fedc2`

See more details on using hashes here.

File details

Details for the file edgeless_smart_scraper-0.1.0-py3-none-any.whl.

File metadata

Download URL: edgeless_smart_scraper-0.1.0-py3-none-any.whl
Upload date: May 3, 2026
Size: 21.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for edgeless_smart_scraper-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6e21f8f63608a47d6e45663aa009c61193cdf16a3cc3cde4d77866c839135410`
MD5	`a7d87072ed14e968976b4feadc328fa1`
BLAKE2b-256	`046ea0dc384e5f9d586fbe9e17b8b6970397c649d7679691df04fdd191c7683b`

See more details on using hashes here.

edgeless-smart-scraper 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

smart-scraper

Features

Install

Quick start

Backend comparison

Automatic selection logic

API reference

scrape_url(url, *, backend=None, force_firecrawl=False, force_tavily=False, only_main_content=True, timeout=30, firecrawl_api_key=None, tavily_api_key=None)

ScrapeResult

Using backends directly

Basic backend

Firecrawl backend

Tavily backend

Inspecting backend selection

Environment variables

Running tests

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`scrape_url(url, *, backend=None, force_firecrawl=False, force_tavily=False, only_main_content=True, timeout=30, firecrawl_api_key=None, tavily_api_key=None)`

`ScrapeResult`