AI-powered web scraping SDK with intelligent configuration generation
Project description
ScrapAI - AI-Powered Web Scraping Made Simple
Extract data from any website or API using natural language - no coding required!
ScrapAI uses artificial intelligence to understand what data you need and automatically extracts it from websites and APIs. Just describe what you want, and ScrapAI handles the rest.
✨ What Can You Achieve?
- Extract structured data from websites and APIs using simple descriptions
- Create reusable scraping configurations for repeated data collection
- Get instant results with one-off data extraction (SmartScraper)
- Automate data pipelines with scheduled scraping configurations
- Support multiple AI services including OpenAI, Ollama, Anthropic, Grok, and more
- No manual configuration - AI discovers APIs, tests paths, and creates optimal configs automatically
🚀 Quick Start
Installation
pip install scrapai
Option 1: Direct Data Extraction (SmartScraper)
Get structured data immediately without creating configuration files:
import asyncio
from scrapai import ScrapAIClient
async def main():
client = ScrapAIClient(
service_name="ollama", # or "openai", "grok", "anthropic", etc.
service_key="your-api-key", # not needed for local Ollama
)
result = await client.smartscraper(
url="https://example.com/data",
description="Get product name, price, and rating"
)
if result["success"]:
print(result["data"]) # Structured JSON output
await client.close()
asyncio.run(main())
Output:
{
"product_name": "Example Product",
"price": 29.99,
"rating": 4.5
}
Option 2: Create Reusable Configuration
Generate a reusable scraping configuration for repeated data collection:
import asyncio
from scrapai import ScrapAIClient
async def main():
client = ScrapAIClient(
service_name="ollama",
service_key="your-api-key",
)
# AI creates the configuration automatically
result = client.add_config(
url="https://api.example.com/metrics",
description="Get transaction count and total volume"
)
config_name = result["config_name"]
# Execute the configuration
data = await client.execute_config(config_name)
# Results are in structured format
for item in data:
print(f"{item['name']}: {item['metric']} = {item['value']}")
await client.close()
asyncio.run(main())
Once created, you can run configurations anytime - perfect for scheduled jobs!
# Run existing configuration (no AI needed - already configured)
async def scheduled_data_collection():
client = ScrapAIClient(service_name="ollama", service_key="your-key")
# Execute any existing configuration
data = await client.execute_config("my_config_name")
# Process the data (save to database, send alerts, etc.)
if data.get("success"):
for item in data["data"]:
print(f"Collected: {item['name']} - {item['metric']} = {item['value']}")
await client.close()
# Use with cron jobs, task schedulers, or automation tools
# This runs without AI - just executes the saved configuration
📋 Use Cases
For Data Engineers
- Rapidly create scraping configs for data pipelines
- Automate data collection from multiple sources
- Schedule recurring extractions - Run saved configurations anytime (cron jobs, task schedulers, etc.)
- No AI calls needed for execution - configs run independently
For Analysts
- Extract metrics from APIs and websites without coding
- Get structured data ready for analysis
- No need to learn XPath, CSS selectors, or API endpoints
For Developers
- Integrate intelligent scraping into applications
- Support multiple AI services with unified API
- Handle complex pages with automatic fallback strategies
🔧 Supported AI Services
ScrapAI works with any OpenAI-compatible API:
- OpenAI - GPT-4, GPT-3.5
- Ollama - Local models (llama3, qwen, mistral, etc.)
- Anthropic - Claude models
- Grok - xAI's Grok
- Google - Gemini models
- Mistral AI - Mistral models
- Custom Services - Any OpenAI-compatible endpoint
# Using OpenAI
client = ScrapAIClient(
service_name="openai",
service_key="sk-...",
service_model="gpt-4"
)
# Using Ollama (local)
client = ScrapAIClient(
service_name="ollama",
service_key="not-needed", # Local Ollama doesn't need key
service_model="llama3:latest"
)
# Using custom service
client = ScrapAIClient(
service_name="custom",
service_key="your-key",
service_base_url="https://your-api.com/v1",
service_model="your-model"
)
💡 Key Features
Intelligent Resource Selection
- API-first approach - Automatically discovers and uses APIs when available
- HTML fallback - Falls back to HTML scraping if API fails
- Multiple resources - Configures automatic fallback strategies
Automatic Configuration Generation
- AI analyzes URLs and discovers APIs
- Tests extraction paths before creating configs
- Iteratively refines until config works correctly
- Creates reusable configuration files
Production-Ready
- Error handling and automatic retries
- Proxy rotation support
- Browser rendering for JavaScript-heavy pages
- Structured data output with metadata
📖 Basic Usage
List Available Configurations
configs = client.list_configs()
print(configs) # ['config1', 'config2', ...]
Execute a Configuration
result = await client.execute_config("config_name")
if result["success"]:
for item in result["data"]:
print(f"{item['name']}: {item['metric']} = {item['value']}")
Remove a Configuration
client.remove_config("config_name")
📊 Output Format
All extractions return structured data:
[
{
"name": "entity_name",
"metric": "metric_name",
"value": 12345,
"date": "2024-01-15T10:30:00Z",
"config_name": "my_config"
},
...
]
🔗 Additional Resources
- GitHub Repository: https://github.com/zohaib3249/scrapai
- Issue Tracker: https://github.com/zohaib3249/scrapai/issues
- Documentation: See GitHub README for detailed architecture and examples
📄 License
MIT License - See LICENSE file for details
🤝 Contributing
Contributions are welcome! Please see the GitHub repository for contribution guidelines.
👨💻 About the Author
Zohaib Yousaf - Full Stack Developer & Data Engineer
Passionate about building intelligent systems that automate complex workflows. ScrapAI combines my expertise in web scraping, AI integration, and data engineering to make data extraction accessible to everyone.
- GitHub: @zohaib3249
- Email: chzohaib136@gmail.com
Version: 0.6.0
Last Updated: November 2025
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file scrapai-0.6.0.tar.gz.
File metadata
- Download URL: scrapai-0.6.0.tar.gz
- Upload date:
- Size: 91.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03d3ea96cc8e9367ab51994e93f7a51c82ea84c44b335dbb83148eff7d714e9b
|
|
| MD5 |
b7e476725980992cf568009d4ff7dda8
|
|
| BLAKE2b-256 |
0a9ed912f9f6240e3501d73d588e77c8d4118de24f38e61df247bdc46a80d4be
|