Skip to main content

AI-powered web scraping SDK with intelligent configuration generation

Project description

ScrapAI - AI-Powered Web Scraping Made Simple

Extract data from any website or API using natural language - no coding required!

ScrapAI uses artificial intelligence to understand what data you need and automatically extracts it from websites and APIs. Just describe what you want, and ScrapAI handles the rest.


✨ What Can You Achieve?

  • Extract structured data from websites and APIs using simple descriptions
  • Create reusable scraping configurations for repeated data collection
  • Get instant results with one-off data extraction (SmartScraper)
  • Automate data pipelines with scheduled scraping configurations
  • Support multiple AI services including OpenAI, Ollama, Anthropic, Grok, and more
  • No manual configuration - AI discovers APIs, tests paths, and creates optimal configs automatically

🚀 Quick Start

Installation

pip install scrapai

Option 1: Direct Data Extraction (SmartScraper)

Get structured data immediately without creating configuration files:

import asyncio
from scrapai import ScrapAIClient

async def main():
    client = ScrapAIClient(
        service_name="ollama",  # or "openai", "grok", "anthropic", etc.
        service_key="your-api-key",  # not needed for local Ollama
    )
    
    result = await client.smartscraper(
        url="https://example.com/data",
        description="Get product name, price, and rating"
    )
    
    if result["success"]:
        print(result["data"])  # Structured JSON output
    
    await client.close()

asyncio.run(main())

Output:

{
  "product_name": "Example Product",
  "price": 29.99,
  "rating": 4.5
}

Option 2: Create Reusable Configuration

Generate a reusable scraping configuration for repeated data collection:

import asyncio
from scrapai import ScrapAIClient

async def main():
    client = ScrapAIClient(
        service_name="ollama",
        service_key="your-api-key",
    )
    
    # AI creates the configuration automatically
    result = client.add_config(
        url="https://api.example.com/metrics",
        description="Get transaction count and total volume"
    )
    
    config_name = result["config_name"]
    
    # Execute the configuration
    data = await client.execute_config(config_name)
    
    # Results are in structured format
    for item in data:
        print(f"{item['name']}: {item['metric']} = {item['value']}")
    
    await client.close()

asyncio.run(main())

Once created, you can run configurations anytime - perfect for scheduled jobs!

# Run existing configuration (no AI needed - already configured)
async def scheduled_data_collection():
    client = ScrapAIClient(service_name="ollama", service_key="your-key")
    
    # Execute any existing configuration
    data = await client.execute_config("my_config_name")
    
    # Process the data (save to database, send alerts, etc.)
    if data.get("success"):
        for item in data["data"]:
            print(f"Collected: {item['name']} - {item['metric']} = {item['value']}")
    
    await client.close()

# Use with cron jobs, task schedulers, or automation tools
# This runs without AI - just executes the saved configuration

📋 Use Cases

For Data Engineers

  • Rapidly create scraping configs for data pipelines
  • Automate data collection from multiple sources
  • Schedule recurring extractions - Run saved configurations anytime (cron jobs, task schedulers, etc.)
  • No AI calls needed for execution - configs run independently

For Analysts

  • Extract metrics from APIs and websites without coding
  • Get structured data ready for analysis
  • No need to learn XPath, CSS selectors, or API endpoints

For Developers

  • Integrate intelligent scraping into applications
  • Support multiple AI services with unified API
  • Handle complex pages with automatic fallback strategies

🔧 Supported AI Services

ScrapAI works with any OpenAI-compatible API:

  • OpenAI - GPT-4, GPT-3.5
  • Ollama - Local models (llama3, qwen, mistral, etc.)
  • Anthropic - Claude models
  • Grok - xAI's Grok
  • Google - Gemini models
  • Mistral AI - Mistral models
  • Custom Services - Any OpenAI-compatible endpoint
# Using OpenAI
client = ScrapAIClient(
    service_name="openai",
    service_key="sk-...",
    service_model="gpt-4"
)

# Using Ollama (local)
client = ScrapAIClient(
    service_name="ollama",
    service_key="not-needed",  # Local Ollama doesn't need key
    service_model="llama3:latest"
)

# Using custom service
client = ScrapAIClient(
    service_name="custom",
    service_key="your-key",
    service_base_url="https://your-api.com/v1",
    service_model="your-model"
)

💡 Key Features

Intelligent Resource Selection

  • API-first approach - Automatically discovers and uses APIs when available
  • HTML fallback - Falls back to HTML scraping if API fails
  • Multiple resources - Configures automatic fallback strategies

Automatic Configuration Generation

  • AI analyzes URLs and discovers APIs
  • Tests extraction paths before creating configs
  • Iteratively refines until config works correctly
  • Creates reusable configuration files

Production-Ready

  • Error handling and automatic retries
  • Proxy rotation support
  • Browser rendering for JavaScript-heavy pages
  • Structured data output with metadata

📖 Basic Usage

List Available Configurations

configs = client.list_configs()
print(configs)  # ['config1', 'config2', ...]

Execute a Configuration

result = await client.execute_config("config_name")
if result["success"]:
    for item in result["data"]:
        print(f"{item['name']}: {item['metric']} = {item['value']}")

Remove a Configuration

client.remove_config("config_name")

📊 Output Format

All extractions return structured data:

[
    {
        "name": "entity_name",
        "metric": "metric_name",
        "value": 12345,
        "date": "2024-01-15T10:30:00Z",
        "config_name": "my_config"
    },
    ...
]

🔗 Additional Resources


📄 License

MIT License - See LICENSE file for details


🤝 Contributing

Contributions are welcome! Please see the GitHub repository for contribution guidelines.


👨‍💻 About the Author

Zohaib Yousaf - Full Stack Developer & Data Engineer

Passionate about building intelligent systems that automate complex workflows. ScrapAI combines my expertise in web scraping, AI integration, and data engineering to make data extraction accessible to everyone.


Version: 0.6.0
Last Updated: November 2025

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapai-0.6.0.tar.gz (91.3 kB view details)

Uploaded Source

File details

Details for the file scrapai-0.6.0.tar.gz.

File metadata

  • Download URL: scrapai-0.6.0.tar.gz
  • Upload date:
  • Size: 91.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapai-0.6.0.tar.gz
Algorithm Hash digest
SHA256 03d3ea96cc8e9367ab51994e93f7a51c82ea84c44b335dbb83148eff7d714e9b
MD5 b7e476725980992cf568009d4ff7dda8
BLAKE2b-256 0a9ed912f9f6240e3501d73d588e77c8d4118de24f38e61df247bdc46a80d4be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page