AI-powered web scraping SDK with intelligent configuration generation

These details have not been verified by PyPI

Project links

Project description

ScrapAI - AI-Powered Web Scraping Made Simple

Extract data from any website or API using natural language - no coding required!

ScrapAI uses artificial intelligence to understand what data you need and automatically extracts it from websites and APIs. Just describe what you want, and ScrapAI handles the rest.

✨ What Can You Achieve?

Extract structured data from websites and APIs using simple descriptions
Create reusable scraping configurations for repeated data collection
Get instant results with one-off data extraction (SmartScraper)
Automate data pipelines with scheduled scraping configurations
Support multiple AI services including OpenAI, Ollama, Anthropic, Grok, and more
No manual configuration - AI discovers APIs, tests paths, and creates optimal configs automatically

🚀 Quick Start

Installation

pip install scrapai

Option 1: Direct Data Extraction (SmartScraper)

Get structured data immediately without creating configuration files:

import asyncio
from scrapai import ScrapAIClient

async def main():
    client = ScrapAIClient(
        service_name="ollama",  # or "openai", "grok", "anthropic", etc.
        service_key="your-api-key",  # not needed for local Ollama
    )
    
    result = await client.smartscraper(
        url="https://example.com/data",
        description="Get product name, price, and rating"
    )
    
    if result["success"]:
        print(result["data"])  # Structured JSON output
    
    await client.close()

asyncio.run(main())

Output:

{
  "product_name": "Example Product",
  "price": 29.99,
  "rating": 4.5
}

Option 2: Create Reusable Configuration

Generate a reusable scraping configuration for repeated data collection:

import asyncio
from scrapai import ScrapAIClient

async def main():
    client = ScrapAIClient(
        service_name="ollama",
        service_key="your-api-key",
    )
    
    # AI creates the configuration automatically
    result = client.add_config(
        url="https://api.example.com/metrics",
        description="Get transaction count and total volume"
    )
    
    config_name = result["config_name"]
    
    # Execute the configuration
    data = await client.execute_config(config_name)
    
    # Results are in structured format
    for item in data:
        print(f"{item['name']}: {item['metric']} = {item['value']}")
    
    await client.close()

asyncio.run(main())

Once created, you can run configurations anytime - perfect for scheduled jobs!

# Run existing configuration (no AI needed - already configured)
async def scheduled_data_collection():
    client = ScrapAIClient(service_name="ollama", service_key="your-key")
    
    # Execute any existing configuration
    data = await client.execute_config("my_config_name")
    
    # Process the data (save to database, send alerts, etc.)
    if data.get("success"):
        for item in data["data"]:
            print(f"Collected: {item['name']} - {item['metric']} = {item['value']}")
    
    await client.close()

# Use with cron jobs, task schedulers, or automation tools
# This runs without AI - just executes the saved configuration

📋 Use Cases

For Data Engineers

Rapidly create scraping configs for data pipelines
Automate data collection from multiple sources
Schedule recurring extractions - Run saved configurations anytime (cron jobs, task schedulers, etc.)
No AI calls needed for execution - configs run independently

For Analysts

Extract metrics from APIs and websites without coding
Get structured data ready for analysis
No need to learn XPath, CSS selectors, or API endpoints

For Developers

Integrate intelligent scraping into applications
Support multiple AI services with unified API
Handle complex pages with automatic fallback strategies

🔧 Supported AI Services

ScrapAI works with any OpenAI-compatible API:

OpenAI - GPT-4, GPT-3.5
Ollama - Local models (llama3, qwen, mistral, etc.)
Anthropic - Claude models
Grok - xAI's Grok
Google - Gemini models
Mistral AI - Mistral models
Custom Services - Any OpenAI-compatible endpoint

# Using OpenAI
client = ScrapAIClient(
    service_name="openai",
    service_key="sk-...",
    service_model="gpt-4"
)

# Using Ollama (local)
client = ScrapAIClient(
    service_name="ollama",
    service_key="not-needed",  # Local Ollama doesn't need key
    service_model="llama3:latest"
)

# Using custom service
client = ScrapAIClient(
    service_name="custom",
    service_key="your-key",
    service_base_url="https://your-api.com/v1",
    service_model="your-model"
)

💡 Key Features

Intelligent Resource Selection

API-first approach - Automatically discovers and uses APIs when available
HTML fallback - Falls back to HTML scraping if API fails
Multiple resources - Configures automatic fallback strategies

Automatic Configuration Generation

AI analyzes URLs and discovers APIs
Tests extraction paths before creating configs
Iteratively refines until config works correctly
Creates reusable configuration files

Production-Ready

Error handling and automatic retries
Proxy rotation support
Browser rendering for JavaScript-heavy pages
Structured data output with metadata

📖 Basic Usage

List Available Configurations

configs = client.list_configs()
print(configs)  # ['config1', 'config2', ...]

Execute a Configuration

result = await client.execute_config("config_name")
if result["success"]:
    for item in result["data"]:
        print(f"{item['name']}: {item['metric']} = {item['value']}")

Remove a Configuration

client.remove_config("config_name")

📊 Output Format

All extractions return structured data:

[
    {
        "name": "entity_name",
        "metric": "metric_name",
        "value": 12345,
        "date": "2024-01-15T10:30:00Z",
        "config_name": "my_config"
    },
    ...
]

🔗 Additional Resources

GitHub Repository: https://github.com/zohaib3249/scrapai
Issue Tracker: https://github.com/zohaib3249/scrapai/issues
Documentation: See GitHub README for detailed architecture and examples

📄 License

MIT License - See LICENSE file for details

🤝 Contributing

Contributions are welcome! Please see the GitHub repository for contribution guidelines.

👨‍💻 About the Author

Zohaib Yousaf - Full Stack Developer & Data Engineer

Passionate about building intelligent systems that automate complex workflows. ScrapAI combines my expertise in web scraping, AI integration, and data engineering to make data extraction accessible to everyone.

GitHub: @zohaib3249
Email: chzohaib136@gmail.com

Version: 0.6.0
Last Updated: November 2025

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.6.0

Nov 5, 2025

0.5.1

Nov 4, 2025

0.5.0

Nov 4, 2025

0.4.0

Nov 4, 2025

0.2.1

Nov 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapai-0.6.0.tar.gz (91.3 kB view details)

Uploaded Nov 5, 2025 Source

File details

Details for the file scrapai-0.6.0.tar.gz.

File metadata

Download URL: scrapai-0.6.0.tar.gz
Upload date: Nov 5, 2025
Size: 91.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapai-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`03d3ea96cc8e9367ab51994e93f7a51c82ea84c44b335dbb83148eff7d714e9b`
MD5	`b7e476725980992cf568009d4ff7dda8`
BLAKE2b-256	`0a9ed912f9f6240e3501d73d588e77c8d4118de24f38e61df247bdc46a80d4be`

See more details on using hashes here.

scrapai 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ScrapAI - AI-Powered Web Scraping Made Simple

✨ What Can You Achieve?

🚀 Quick Start

Installation

Option 1: Direct Data Extraction (SmartScraper)

Option 2: Create Reusable Configuration

📋 Use Cases

For Data Engineers

For Analysts

For Developers

🔧 Supported AI Services

💡 Key Features

Intelligent Resource Selection

Automatic Configuration Generation

Production-Ready

📖 Basic Usage

List Available Configurations

Execute a Configuration

Remove a Configuration

📊 Output Format

🔗 Additional Resources

📄 License

🤝 Contributing

👨‍💻 About the Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes