Skip to main content

Intelligent web scraping with automatic backend selection.

Project description

CI License: MIT Python

smart-scraper

Intelligent web scraping for Python with automatic backend selection.

Chooses the right scraping strategy for every URL — lightweight static fetch, cloud-based JavaScript rendering, or search-powered extraction — so you do not have to think about it.

Features

  • Zero-config core — works out of the box with requests + beautifulsoup4
  • Automatic backend selection — static HTML, JS-heavy SPAs, and paywalled content each get the best tool
  • Optional cloud backends — Firecrawl and Tavily activate when their SDK and API key are present
  • Graceful degradation — always returns a ScrapeResult, never raises on network errors
  • Clean Markdown output — navigation, footers, scripts, and styles are stripped automatically
  • Typed API — full type annotations, compatible with mypy strict mode

Install

# Core (no API keys required)
pip install smart-scraper

# With Firecrawl (JS-heavy sites)
pip install smart-scraper[firecrawl]

# With Tavily (search + paywalled content)
pip install smart-scraper[tavily]

# Everything
pip install smart-scraper[all]

Quick start

from smart_scraper import scrape_url

# Automatic backend selection — just pass a URL
result = scrape_url("https://docs.python.org/3/library/json.html")

if result.success:
    print(result.title)    # "json — JSON encoder and decoder"
    print(result.content)  # clean Markdown text
else:
    print(result.error)

Backend comparison

Backend Best for Dependencies Free tier
basic Static HTML, docs, blogs, GitHub raw files requests, beautifulsoup4 Unlimited
firecrawl SPAs, React/Next.js, social platforms, anti-bot sites firecrawl-py + API key 500 scrapes
tavily Research queries, paywalled/login-walled content tavily-python + API key 1,000 credits/month

Automatic selection logic

scrape_url() runs the following decision tree before making any network call:

  1. URL pattern match — raw file hosts (raw.githubusercontent.com, pastebin.com/raw/, arxiv.org/abs/) always use the basic backend regardless of what is installed.
  2. Paywall domainswsj.com, ft.com, nytimes.com, etc. prefer Tavily (its cached access often bypasses paywalls). Falls back to basic when Tavily is not configured.
  3. JS-heavy domainsmedium.com, substack.com, notion.so, twitter.com, etc. prefer Firecrawl. Falls back to basic when Firecrawl is not configured.
  4. Unknown domains — prefer Firecrawl (highest quality), then Tavily, then basic.

Override automatic selection at any time:

# Force a specific backend
result = scrape_url("https://example.com", backend="firecrawl")
result = scrape_url("https://example.com", backend="tavily")
result = scrape_url("https://example.com", backend="basic")

# Shorthand flags
result = scrape_url("https://notion.so/page", force_firecrawl=True)
result = scrape_url("https://wsj.com/article", force_tavily=True)

API reference

scrape_url(url, *, backend=None, force_firecrawl=False, force_tavily=False, only_main_content=True, timeout=30, firecrawl_api_key=None, tavily_api_key=None)

Scrape a URL and return a ScrapeResult. Always returns — never raises.

Parameters

Parameter Type Default Description
url str Fully-qualified URL
backend str | Backend | None None Force a backend: "basic", "firecrawl", "tavily", "auto"
force_firecrawl bool False Shorthand for backend="firecrawl"
force_tavily bool False Shorthand for backend="tavily"
only_main_content bool True Strip nav/footer/sidebar (Firecrawl only)
timeout int 30 Request timeout in seconds
firecrawl_api_key str | None None Per-call key override (reads env var otherwise)
tavily_api_key str | None None Per-call key override (reads env var otherwise)

ScrapeResult

@dataclass
class ScrapeResult:
    url: str
    content: str           # Markdown text
    title: Optional[str]
    metadata: Optional[dict]
    source: str            # "basic" | "firecrawl" | "tavily"
    success: bool
    error: Optional[str]

bool(result) returns result.success for convenient conditional checks.

Using backends directly

Basic backend

from smart_scraper.backends.basic import BasicBackend

backend = BasicBackend(timeout=15)
result = backend.fetch("https://example.com")
print(result.content)

Firecrawl backend

from smart_scraper.backends.firecrawl import FirecrawlBackend

backend = FirecrawlBackend()  # reads FIRECRAWL_API_KEY

# Single page
result = backend.fetch("https://medium.com/@author/article", only_main_content=True)

# Crawl a site
pages = backend.crawl("https://docs.example.com", max_pages=20, max_depth=2)

# Structured extraction
schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "topics": {"type": "array", "items": {"type": "string"}},
    },
}
data = backend.extract_structured("https://example.com", schema=schema)

Tavily backend

from smart_scraper.backends.tavily import TavilyBackend

backend = TavilyBackend()  # reads TAVILY_API_KEY

# Extract a specific URL
result = backend.fetch("https://wsj.com/articles/some-story")

# Search the web
response = backend.search("Python web scraping 2024", max_results=5, include_answer=True)
for item in response.results:
    print(item.title, item.url)

# Get an AI-generated answer
answer = backend.qna_search("What is the capital of France?")

Inspecting backend selection

from smart_scraper.selector import BackendSelector, Backend

selector = BackendSelector(firecrawl_available=True, tavily_available=False)
chosen = selector.select("https://medium.com/@author/post")
print(chosen)  # Backend.FIRECRAWL

reason = selector.explain("https://medium.com/@author/post")
print(reason)  # "Backend 'firecrawl' selected: domain requires JavaScript..."

Environment variables

Copy .env.example to .env and fill in your keys:

FIRECRAWL_API_KEY=your_firecrawl_key_here
TAVILY_API_KEY=your_tavily_key_here

Keys are read at runtime via os.environ. Use python-dotenv or your preferred env-management tool to load them:

from dotenv import load_dotenv
load_dotenv()

from smart_scraper import scrape_url
result = scrape_url("https://notion.so/page")

Running tests

# Install dev dependencies
pip install smart-scraper[dev]

# Run offline tests (no network, no API keys)
pytest

# Include live network tests (requires internet)
pytest --live

# With coverage
pytest --cov=src --cov-report=term-missing

Contributing

  1. Fork the repo and create a feature branch.
  2. Install dev dependencies: pip install -e ".[dev]"
  3. Run linting: ruff check . && mypy src/
  4. Run tests: pytest
  5. Open a pull request — all CI checks must pass.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edgeless_smart_scraper-0.1.0.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

edgeless_smart_scraper-0.1.0-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file edgeless_smart_scraper-0.1.0.tar.gz.

File metadata

  • Download URL: edgeless_smart_scraper-0.1.0.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for edgeless_smart_scraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9040882a3b472dd0bd25cfe6b350e8d4e507d70ed44313513b8bfbc53dd8acf2
MD5 2a9e35c4c97df2bf3ed7ae429a4c1b60
BLAKE2b-256 877db919d42ef642140e2bdcff00f799d2bbe9d9fbe9f14a0b4c8e2ec69fedc2

See more details on using hashes here.

File details

Details for the file edgeless_smart_scraper-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for edgeless_smart_scraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6e21f8f63608a47d6e45663aa009c61193cdf16a3cc3cde4d77866c839135410
MD5 a7d87072ed14e968976b4feadc328fa1
BLAKE2b-256 046ea0dc384e5f9d586fbe9e17b8b6970397c649d7679691df04fdd191c7683b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page