Skip to main content

AI-powered web scraper. Extract structured data using plain English.

Project description

Extracto Logo

Extracto

AI-powered web scraper. Give it a URL and tell it what data you want — it handles the rest.

Built with Crawlee + Playwright + ScrapeGraphAI + pandas.

What it does

  • Smart extraction — describe what you want in plain English, the AI pulls exactly that
  • JavaScript rendering — handles SPAs, dynamic content, infinite scroll
  • Multi-format export — JSON, CSV, XML, SQLite, Excel, Markdown
  • Batch mode — process hundreds of URLs from a file
  • Proxy rotation — avoid IP bans with automatic proxy cycling
  • Resume/checkpoint — crash-safe crawling, picks up where it left off
  • Scheduled runs — repeat crawls on a timer (every 6h, daily, etc.)
  • REST API — deploy as a service anyone on your team can use
  • Webhook notifications — get pinged on Discord/Slack when a crawl finishes
  • robots.txt compliance — respects site rules by default
  • 5 LLM providers — Mistral, OpenAI, Groq, Gemini, Ollama (local)

Quick start

# install globally via pip
pip install extracto-scraper

# run the interactive wizard
extracto

Note: Playwright requires browsers to be installed on your first run:

playwright install chromium

add your API key

cp .env.example .env

edit .env and paste your key

run it

python main.py "https://books.toscrape.com/" "Extract all book titles and prices"


## Interactive mode

Don't want to memorize flags? Just run it with no arguments:

```bash
python main.py

A friendly wizard walks you through everything — URL, what to extract, output format, LLM provider, and optional advanced settings. No flags needed.

CLI reference

python main.py <url> <prompt> [options]
python main.py serve                    # start REST API

Core options

Flag Short What it does Default
--format -f Output format: json, csv, xml, sql, excel, markdown json
--depth -d Link levels to follow (0 = single page) 0
--scope -s Link scope: same_domain, same_directory, external same_domain
--provider -p LLM provider: mistral, openai, groq, gemini, ollama mistral
--model -m Override the default model name auto
--output -o Output directory output

Power features

Flag What it does
--batch FILE Process multiple URLs from a text file
--proxy PROXY Proxy URL or path to a proxy list file
--rate-limit N Seconds between requests (polite crawling)
--resume FILE Save/restore crawl state to a checkpoint file
--schema JSON Enforce structured output with a JSON schema
--screenshots Save full-page screenshots of every page
--cache Cache rendered pages (skip re-rendering on re-runs)
--sitemap Auto-discover pages from sitemap.xml
--no-robots Ignore robots.txt (default: respect it)
--webhook URL Send completion notification (Discord, Slack, generic)
--schedule INT Repeat interval: "6h", "30m", "1d"
--config FILE Load settings from a YAML file
--port N API server port (default: 8000)

Examples

# basic — scrape one page
python main.py "https://news.ycombinator.com/" "Extract all post titles and links"

# depth crawl — follow links 2 levels deep, export CSV
python main.py "https://docs.python.org/3/" "Extract function names and descriptions" -d 2 -f csv

# batch mode — scrape many URLs at once
python main.py --batch urls.txt "Extract all product names and prices" -f json

# structured output — force exact JSON shape
python main.py "https://example.com" "Get products" --schema '{"name": "str", "price": "float", "in_stock": "bool"}'

# with proxy rotation + rate limiting
python main.py "https://example.com" "Get all links" --proxy proxies.txt --rate-limit 2

# resume after crash
python main.py --batch urls.txt "Get data" --resume checkpoint.json

# use a YAML config for complex jobs
python main.py --config crawl.yaml

# API server mode
python main.py serve --port 8080

# scheduled monitoring — run every 6 hours
python main.py "https://competitor.com/pricing" "Get all prices" --schedule 6h --webhook https://hooks.slack.com/your/url

# use different LLM providers
python main.py "https://example.com" "Get contact info" -p openai
python main.py "https://example.com" "Get links" -p ollama -m llama3.2

YAML config

For complex or recurring jobs, use a YAML file instead of CLI flags:

# crawl.yaml
start_url: "https://books.toscrape.com/"
user_prompt: "Extract all book titles and prices"
max_depth: 1
output_format: csv
rate_limit: 1.0
proxy: "proxies.txt"
checkpoint_file: "books_checkpoint.json"
python main.py --config crawl.yaml

See crawl.example.yaml for all available options.

REST API

Start the API server:

pip install fastapi uvicorn  # one-time setup
python main.py serve

Then call it:

curl -X POST http://localhost:8000/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "prompt": "Extract all links"}'

Interactive docs at http://localhost:8000/docs.

Docker

docker build -t extracto .
docker run -p 8000:8000 -e MISTRAL_API_KEY=your_key extracto

Supported LLM providers

Provider Env variable Default model
Mistral MISTRAL_API_KEY mistral-small-latest
OpenAI OPENAI_API_KEY gpt-4o-mini
Groq GROQ_API_KEY llama-3.1-8b-instant
Google Gemini GOOGLE_API_KEY gemini-2.0-flash
Ollama none (local) llama3.2

Run python main.py --list-models to see all available models.

Architecture

main.py (CLI + scheduler)
  └─ CrawlerConfig           config.py           ← settings dataclass + YAML loader
  └─ CrawlerEngine           crawler_engine.py   ← crawl loop + batch + checkpoint
       ├─ BrowserEngine       browser_engine.py   ← stealth Playwright + proxy rotation
       ├─ AIExtractor         ai_extractor.py     ← ScrapeGraphAI + your LLM
       ├─ RobotsChecker       robots.py           ← robots.txt compliance
       ├─ PageCache           cache.py            ← file-based render cache
       ├─ SitemapDiscovery    sitemap.py          ← XML sitemap parser
       └─ SchemaLoader        schema.py           ← structured output enforcement
  └─ DataExporter             data_exporter.py    ← pandas → any format
  └─ Server                   server.py           ← FastAPI REST API
  └─ Webhooks                 webhooks.py         ← Discord/Slack notifications

Project structure

├── main.py               CLI entry point + scheduler
├── config.py             settings dataclass + YAML loader
├── crawler_engine.py     crawl loop, batch mode, checkpoint/resume
├── browser_engine.py     Playwright with stealth, proxy rotation, screenshots
├── ai_extractor.py       multi-provider LLM extraction
├── data_exporter.py      export to JSON/CSV/XML/SQL/Excel/Markdown
├── robots.py             robots.txt compliance checker
├── schema.py             JSON schema → structured output
├── sitemap.py            sitemap.xml auto-discovery
├── cache.py              file-based page cache
├── server.py             FastAPI REST API
├── webhooks.py           webhook notifications
├── utils.py              Rich terminal UI helpers
├── crawl.example.yaml    example YAML config
├── urls.example.txt      example batch URL file
├── Dockerfile
├── requirements.txt
├── .env.example
├── .gitignore
└── LICENSE               MIT

Requirements

  • Python 3.10+
  • An API key for at least one LLM provider (or Ollama running locally)
  • Optional: fastapi + uvicorn for API server mode

License

MIT


Built with ❤️ by Nishal.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extracto_scraper-2.0.2.tar.gz (31.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extracto_scraper-2.0.2-py3-none-any.whl (35.0 kB view details)

Uploaded Python 3

File details

Details for the file extracto_scraper-2.0.2.tar.gz.

File metadata

  • Download URL: extracto_scraper-2.0.2.tar.gz
  • Upload date:
  • Size: 31.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for extracto_scraper-2.0.2.tar.gz
Algorithm Hash digest
SHA256 94a950d7b374605b8387f2ae839894720752c51ffe8a4e79a046461cd0e16e78
MD5 7521a6806c6757f3074488713da24316
BLAKE2b-256 b42792f442c8ef7e23938520fc23d6362db7d61dd297578528edea7d165c54ab

See more details on using hashes here.

File details

Details for the file extracto_scraper-2.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for extracto_scraper-2.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d62672621446d49a6ee597823bdd3adf6c365758ddacbf13c4541a8508222903
MD5 4cff08ca2c6cca4363610e521427d6a7
BLAKE2b-256 9cf44cdbd52fa1ada6f035fae9c7c371430a9c29c2ed63e23eb927110e975a30

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page