Skip to main content

AI-powered web scraping SDK with intelligent configuration generation

Project description

ScrapAI - AI-Powered Web Scraping Made Simple

Extract data from any website or API using natural language - no coding required!

ScrapAI uses artificial intelligence to understand what data you need and automatically extracts it from websites and APIs. Just describe what you want, and ScrapAI handles the rest.


✨ What Can You Achieve?

  • Extract structured data from websites and APIs using simple descriptions
  • Create reusable scraping configurations for repeated data collection
  • Get instant results with one-off data extraction (SmartScraper)
  • Automate data pipelines with scheduled scraping configurations
  • Support multiple AI services including OpenAI, Ollama, Anthropic, Grok, and more
  • No manual configuration - AI discovers APIs, tests paths, and creates optimal configs automatically

🚀 Quick Start

Installation

pip install scrapai

Option 1: Direct Data Extraction (SmartScraper)

Get structured data immediately without creating configuration files:

import asyncio
from scrapai import ScrapAIClient

async def main():
    client = ScrapAIClient(
        service_name="ollama",  # or "openai", "grok", "anthropic", etc.
        service_key="your-api-key",  # not needed for local Ollama
    )
    
    result = await client.smartscraper(
        url="https://example.com/data",
        description="Get product name, price, and rating"
    )
    
    if result["success"]:
        print(result["data"])  # Structured JSON output
    
    await client.close()

asyncio.run(main())

Output:

{
  "product_name": "Example Product",
  "price": 29.99,
  "rating": 4.5
}

Option 2: Create Reusable Configuration

Generate a reusable scraping configuration for repeated data collection:

import asyncio
from scrapai import ScrapAIClient

async def main():
    client = ScrapAIClient(
        service_name="ollama",
        service_key="your-api-key",
    )
    
    # AI creates the configuration automatically
    result = client.add_config(
        url="https://api.example.com/metrics",
        description="Get transaction count and total volume"
    )
    
    config_name = result["config_name"]
    
    # Execute the configuration
    data = await client.execute_config(config_name)
    
    # Results are in structured format
    for item in data:
        print(f"{item['name']}: {item['metric']} = {item['value']}")
    
    await client.close()

asyncio.run(main())

Once created, you can run configurations anytime - perfect for scheduled jobs!

# Run existing configuration (no AI needed - already configured)
async def scheduled_data_collection():
    client = ScrapAIClient(service_name="ollama", service_key="your-key")
    
    # Execute any existing configuration
    data = await client.execute_config("my_config_name")
    
    # Process the data (save to database, send alerts, etc.)
    if data.get("success"):
        for item in data["data"]:
            print(f"Collected: {item['name']} - {item['metric']} = {item['value']}")
    
    await client.close()

# Use with cron jobs, task schedulers, or automation tools
# This runs without AI - just executes the saved configuration

📋 Use Cases

For Data Engineers

  • Rapidly create scraping configs for data pipelines
  • Automate data collection from multiple sources
  • Schedule recurring extractions - Run saved configurations anytime (cron jobs, task schedulers, etc.)
  • No AI calls needed for execution - configs run independently

For Analysts

  • Extract metrics from APIs and websites without coding
  • Get structured data ready for analysis
  • No need to learn XPath, CSS selectors, or API endpoints

For Developers

  • Integrate intelligent scraping into applications
  • Support multiple AI services with unified API
  • Handle complex pages with automatic fallback strategies

🔧 Supported AI Services

ScrapAI works with any OpenAI-compatible API:

  • OpenAI - GPT-4, GPT-3.5
  • Ollama - Local models (llama3, qwen, mistral, etc.)
  • Anthropic - Claude models
  • Grok - xAI's Grok
  • Google - Gemini models
  • Mistral AI - Mistral models
  • Custom Services - Any OpenAI-compatible endpoint
# Using OpenAI
client = ScrapAIClient(
    service_name="openai",
    service_key="sk-...",
    service_model="gpt-4"
)

# Using Ollama (local)
client = ScrapAIClient(
    service_name="ollama",
    service_key="not-needed",  # Local Ollama doesn't need key
    service_model="llama3:latest"
)

# Using custom service
client = ScrapAIClient(
    service_name="custom",
    service_key="your-key",
    service_base_url="https://your-api.com/v1",
    service_model="your-model"
)

💡 Key Features

Intelligent Resource Selection

  • API-first approach - Automatically discovers and uses APIs when available
  • HTML fallback - Falls back to HTML scraping if API fails
  • Multiple resources - Configures automatic fallback strategies

Automatic Configuration Generation

  • AI analyzes URLs and discovers APIs
  • Tests extraction paths before creating configs
  • Iteratively refines until config works correctly
  • Creates reusable configuration files

Production-Ready

  • Error handling and automatic retries
  • Proxy rotation support
  • Browser rendering for JavaScript-heavy pages
  • Structured data output with metadata

📖 Basic Usage

List Available Configurations

configs = client.list_configs()
print(configs)  # ['config1', 'config2', ...]

Execute a Configuration

result = await client.execute_config("config_name")
if result["success"]:
    for item in result["data"]:
        print(f"{item['name']}: {item['metric']} = {item['value']}")

Remove a Configuration

client.remove_config("config_name")

📊 Output Format

All extractions return structured data:

[
    {
        "name": "entity_name",
        "metric": "metric_name",
        "value": 12345,
        "date": "2024-01-15T10:30:00Z",
        "config_name": "my_config"
    },
    ...
]

🔗 Additional Resources


📄 License

MIT License - See LICENSE file for details


🤝 Contributing

Contributions are welcome! Please see the GitHub repository for contribution guidelines.


👨‍💻 About the Author

Zohaib Yousaf - Full Stack Developer & Data Engineer

Passionate about building intelligent systems that automate complex workflows. ScrapAI combines my expertise in web scraping, AI integration, and data engineering to make data extraction accessible to everyone.


Version: 0.5.0
Last Updated: November 2025

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapai-0.5.0.tar.gz (82.7 kB view details)

Uploaded Source

File details

Details for the file scrapai-0.5.0.tar.gz.

File metadata

  • Download URL: scrapai-0.5.0.tar.gz
  • Upload date:
  • Size: 82.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapai-0.5.0.tar.gz
Algorithm Hash digest
SHA256 1805d6985b3a6f3d6ae0fb699bb85a2c1a79bddc1287729c03a4775317a0148c
MD5 d4f8b59f12af4e43ee78cbf438660da9
BLAKE2b-256 4ab060a8779452f742efbfee66501926a5e990de9a44321b0f015d8c53ab5dd8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page