AI-powered web scraper. Extract structured data using plain English.
Project description
Extracto
AI-powered web scraper. Give it a URL and tell it what data you want — it handles the rest.
Built with Crawlee + Playwright + ScrapeGraphAI + pandas.
What it does
- Smart extraction — describe what you want in plain English, the AI pulls exactly that
- JavaScript rendering — handles SPAs, dynamic content, infinite scroll
- Multi-format export — JSON, CSV, XML, SQLite, Excel, Markdown
- Batch mode — process hundreds of URLs from a file
- Proxy rotation — avoid IP bans with automatic proxy cycling
- Resume/checkpoint — crash-safe crawling, picks up where it left off
- Scheduled runs — repeat crawls on a timer (every 6h, daily, etc.)
- REST API — deploy as a service anyone on your team can use
- Webhook notifications — get pinged on Discord/Slack when a crawl finishes
- robots.txt compliance — respects site rules by default
- 5 LLM providers — Mistral, OpenAI, Groq, Gemini, Ollama (local)
Quick start
# clone and setup
git clone https://github.com/nishal21/Extracto.git
cd Extracto
python -m venv .venv && .venv\Scripts\activate # or source .venv/bin/activate on mac/linux
pip install -r requirements.txt
playwright install chromium
# add your API key
cp .env.example .env
# edit .env and paste your key
# run it
python main.py "https://books.toscrape.com/" "Extract all book titles and prices"
Interactive mode
Don't want to memorize flags? Just run it with no arguments:
python main.py
A friendly wizard walks you through everything — URL, what to extract, output format, LLM provider, and optional advanced settings. No flags needed.
CLI reference
python main.py <url> <prompt> [options]
python main.py serve # start REST API
Core options
| Flag | Short | What it does | Default |
|---|---|---|---|
--format |
-f |
Output format: json, csv, xml, sql, excel, markdown | json |
--depth |
-d |
Link levels to follow (0 = single page) | 0 |
--scope |
-s |
Link scope: same_domain, same_directory, external | same_domain |
--provider |
-p |
LLM provider: mistral, openai, groq, gemini, ollama | mistral |
--model |
-m |
Override the default model name | auto |
--output |
-o |
Output directory | output |
Power features
| Flag | What it does |
|---|---|
--batch FILE |
Process multiple URLs from a text file |
--proxy PROXY |
Proxy URL or path to a proxy list file |
--rate-limit N |
Seconds between requests (polite crawling) |
--resume FILE |
Save/restore crawl state to a checkpoint file |
--schema JSON |
Enforce structured output with a JSON schema |
--screenshots |
Save full-page screenshots of every page |
--cache |
Cache rendered pages (skip re-rendering on re-runs) |
--sitemap |
Auto-discover pages from sitemap.xml |
--no-robots |
Ignore robots.txt (default: respect it) |
--webhook URL |
Send completion notification (Discord, Slack, generic) |
--schedule INT |
Repeat interval: "6h", "30m", "1d" |
--config FILE |
Load settings from a YAML file |
--port N |
API server port (default: 8000) |
Examples
# basic — scrape one page
python main.py "https://news.ycombinator.com/" "Extract all post titles and links"
# depth crawl — follow links 2 levels deep, export CSV
python main.py "https://docs.python.org/3/" "Extract function names and descriptions" -d 2 -f csv
# batch mode — scrape many URLs at once
python main.py --batch urls.txt "Extract all product names and prices" -f json
# structured output — force exact JSON shape
python main.py "https://example.com" "Get products" --schema '{"name": "str", "price": "float", "in_stock": "bool"}'
# with proxy rotation + rate limiting
python main.py "https://example.com" "Get all links" --proxy proxies.txt --rate-limit 2
# resume after crash
python main.py --batch urls.txt "Get data" --resume checkpoint.json
# use a YAML config for complex jobs
python main.py --config crawl.yaml
# API server mode
python main.py serve --port 8080
# scheduled monitoring — run every 6 hours
python main.py "https://competitor.com/pricing" "Get all prices" --schedule 6h --webhook https://hooks.slack.com/your/url
# use different LLM providers
python main.py "https://example.com" "Get contact info" -p openai
python main.py "https://example.com" "Get links" -p ollama -m llama3.2
YAML config
For complex or recurring jobs, use a YAML file instead of CLI flags:
# crawl.yaml
start_url: "https://books.toscrape.com/"
user_prompt: "Extract all book titles and prices"
max_depth: 1
output_format: csv
rate_limit: 1.0
proxy: "proxies.txt"
checkpoint_file: "books_checkpoint.json"
python main.py --config crawl.yaml
See crawl.example.yaml for all available options.
REST API
Start the API server:
pip install fastapi uvicorn # one-time setup
python main.py serve
Then call it:
curl -X POST http://localhost:8000/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "prompt": "Extract all links"}'
Interactive docs at http://localhost:8000/docs.
Docker
docker build -t extracto .
docker run -p 8000:8000 -e MISTRAL_API_KEY=your_key extracto
Supported LLM providers
| Provider | Env variable | Default model |
|---|---|---|
| Mistral | MISTRAL_API_KEY |
mistral-small-latest |
| OpenAI | OPENAI_API_KEY |
gpt-4o-mini |
| Groq | GROQ_API_KEY |
llama-3.1-8b-instant |
| Google Gemini | GOOGLE_API_KEY |
gemini-2.0-flash |
| Ollama | none (local) | llama3.2 |
Run python main.py --list-models to see all available models.
Architecture
main.py (CLI + scheduler)
└─ CrawlerConfig config.py ← settings dataclass + YAML loader
└─ CrawlerEngine crawler_engine.py ← crawl loop + batch + checkpoint
├─ BrowserEngine browser_engine.py ← stealth Playwright + proxy rotation
├─ AIExtractor ai_extractor.py ← ScrapeGraphAI + your LLM
├─ RobotsChecker robots.py ← robots.txt compliance
├─ PageCache cache.py ← file-based render cache
├─ SitemapDiscovery sitemap.py ← XML sitemap parser
└─ SchemaLoader schema.py ← structured output enforcement
└─ DataExporter data_exporter.py ← pandas → any format
└─ Server server.py ← FastAPI REST API
└─ Webhooks webhooks.py ← Discord/Slack notifications
Project structure
├── main.py CLI entry point + scheduler
├── config.py settings dataclass + YAML loader
├── crawler_engine.py crawl loop, batch mode, checkpoint/resume
├── browser_engine.py Playwright with stealth, proxy rotation, screenshots
├── ai_extractor.py multi-provider LLM extraction
├── data_exporter.py export to JSON/CSV/XML/SQL/Excel/Markdown
├── robots.py robots.txt compliance checker
├── schema.py JSON schema → structured output
├── sitemap.py sitemap.xml auto-discovery
├── cache.py file-based page cache
├── server.py FastAPI REST API
├── webhooks.py webhook notifications
├── utils.py Rich terminal UI helpers
├── crawl.example.yaml example YAML config
├── urls.example.txt example batch URL file
├── Dockerfile
├── requirements.txt
├── .env.example
├── .gitignore
└── LICENSE MIT
Requirements
- Python 3.10+
- An API key for at least one LLM provider (or Ollama running locally)
- Optional:
fastapi+uvicornfor API server mode
License
MIT
Built with ❤️ by Nishal.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extracto_scraper-2.0.0.tar.gz.
File metadata
- Download URL: extracto_scraper-2.0.0.tar.gz
- Upload date:
- Size: 31.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59f3b1a0ae340830a38c7c4e4a9d629d51e18a04b743474d4ceb58f97af357fc
|
|
| MD5 |
5271d7428c11ab19caf07e1dd4fba01e
|
|
| BLAKE2b-256 |
db1fc090474316efa5ee22beb19f3d39eb652ff2c20932e57521234cd50a1cb3
|
File details
Details for the file extracto_scraper-2.0.0-py3-none-any.whl.
File metadata
- Download URL: extracto_scraper-2.0.0-py3-none-any.whl
- Upload date:
- Size: 35.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51a24792b79fd64697a42ddaabf24e8f3cfbaf45a7edba0e29ab9c83220aadf4
|
|
| MD5 |
019fbd806079898dd9f257fd30345f06
|
|
| BLAKE2b-256 |
a98077df7d5f536feefce4a99196e6564ad6e4cc34e78cd0753881f8945178d7
|