Skip to main content

AI-powered web scraping SDK with intelligent configuration generation

Project description

ScrapAI - AI-Powered Web Scraping Made Simple

Extract data from any website or API using natural language - no coding required!

ScrapAI uses artificial intelligence to understand what data you need and automatically extracts it from websites and APIs. Just describe what you want, and ScrapAI handles the rest.


✨ What Can You Achieve?

  • Extract structured data from websites and APIs using simple descriptions
  • Create reusable scraping configurations for repeated data collection
  • Get instant results with one-off data extraction (SmartScraper)
  • Automate data pipelines with scheduled scraping configurations
  • Support multiple AI services including OpenAI, Ollama, Anthropic, Grok, and more
  • No manual configuration - AI discovers APIs, tests paths, and creates optimal configs automatically

🚀 Quick Start

Installation

pip install scrapai

Option 1: Direct Data Extraction (SmartScraper)

Get structured data immediately without creating configuration files:

import asyncio
from scrapai import ScrapAIClient

async def main():
    client = ScrapAIClient(
        service_name="ollama",  # or "openai", "grok", "anthropic", etc.
        service_key="your-api-key",  # not needed for local Ollama
    )
    
    result = await client.smartscraper(
        url="https://example.com/data",
        description="Get product name, price, and rating"
    )
    
    if result["success"]:
        print(result["data"])  # Structured JSON output
    
    await client.close()

asyncio.run(main())

Output:

{
  "product_name": "Example Product",
  "price": 29.99,
  "rating": 4.5
}

Option 2: Create Reusable Configuration

Generate a reusable scraping configuration for repeated data collection:

import asyncio
from scrapai import ScrapAIClient

async def main():
    client = ScrapAIClient(
        service_name="ollama",
        service_key="your-api-key",
    )
    
    # AI creates the configuration automatically
    result = client.add_config(
        url="https://api.example.com/metrics",
        description="Get transaction count and total volume"
    )
    
    config_name = result["config_name"]
    
    # Execute the configuration
    data = await client.execute_config(config_name)
    
    # Results are in structured format
    for item in data:
        print(f"{item['name']}: {item['metric']} = {item['value']}")
    
    await client.close()

asyncio.run(main())

Once created, you can run configurations anytime - perfect for scheduled jobs!

# Run existing configuration (no AI needed - already configured)
async def scheduled_data_collection():
    client = ScrapAIClient(service_name="ollama", service_key="your-key")
    
    # Execute any existing configuration
    data = await client.execute_config("my_config_name")
    
    # Process the data (save to database, send alerts, etc.)
    if data.get("success"):
        for item in data["data"]:
            print(f"Collected: {item['name']} - {item['metric']} = {item['value']}")
    
    await client.close()

# Use with cron jobs, task schedulers, or automation tools
# This runs without AI - just executes the saved configuration

📋 Use Cases

For Data Engineers

  • Rapidly create scraping configs for data pipelines
  • Automate data collection from multiple sources
  • Schedule recurring extractions - Run saved configurations anytime (cron jobs, task schedulers, etc.)
  • No AI calls needed for execution - configs run independently

For Analysts

  • Extract metrics from APIs and websites without coding
  • Get structured data ready for analysis
  • No need to learn XPath, CSS selectors, or API endpoints

For Developers

  • Integrate intelligent scraping into applications
  • Support multiple AI services with unified API
  • Handle complex pages with automatic fallback strategies

🔧 Supported AI Services

ScrapAI works with any OpenAI-compatible API:

  • OpenAI - GPT-4, GPT-3.5
  • Ollama - Local models (llama3, qwen, mistral, etc.)
  • Anthropic - Claude models
  • Grok - xAI's Grok
  • Google - Gemini models
  • Mistral AI - Mistral models
  • Custom Services - Any OpenAI-compatible endpoint
# Using OpenAI
client = ScrapAIClient(
    service_name="openai",
    service_key="sk-...",
    service_model="gpt-4"
)

# Using Ollama (local)
client = ScrapAIClient(
    service_name="ollama",
    service_key="not-needed",  # Local Ollama doesn't need key
    service_model="llama3:latest"
)

# Using custom service
client = ScrapAIClient(
    service_name="custom",
    service_key="your-key",
    service_base_url="https://your-api.com/v1",
    service_model="your-model"
)

💡 Key Features

Intelligent Resource Selection

  • API-first approach - Automatically discovers and uses APIs when available
  • HTML fallback - Falls back to HTML scraping if API fails
  • Multiple resources - Configures automatic fallback strategies

Automatic Configuration Generation

  • AI analyzes URLs and discovers APIs
  • Tests extraction paths before creating configs
  • Iteratively refines until config works correctly
  • Creates reusable configuration files

Production-Ready

  • Error handling and automatic retries
  • Proxy rotation support
  • Browser rendering for JavaScript-heavy pages
  • Structured data output with metadata

📖 Basic Usage

List Available Configurations

configs = client.list_configs()
print(configs)  # ['config1', 'config2', ...]

Execute a Configuration

result = await client.execute_config("config_name")
if result["success"]:
    for item in result["data"]:
        print(f"{item['name']}: {item['metric']} = {item['value']}")

Remove a Configuration

client.remove_config("config_name")

📊 Output Format

All extractions return structured data:

[
    {
        "name": "entity_name",
        "metric": "metric_name",
        "value": 12345,
        "date": "2024-01-15T10:30:00Z",
        "config_name": "my_config"
    },
    ...
]

🔗 Additional Resources


📄 License

MIT License - See LICENSE file for details


🤝 Contributing

Contributions are welcome! Please see the GitHub repository for contribution guidelines.


👨‍💻 About the Author

Zohaib Yousaf - Full Stack Developer & Data Engineer

Passionate about building intelligent systems that automate complex workflows. ScrapAI combines my expertise in web scraping, AI integration, and data engineering to make data extraction accessible to everyone.


Last Updated: December 2024

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapai-0.2.1.tar.gz (81.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapai-0.2.1-py3-none-any.whl (83.5 kB view details)

Uploaded Python 3

File details

Details for the file scrapai-0.2.1.tar.gz.

File metadata

  • Download URL: scrapai-0.2.1.tar.gz
  • Upload date:
  • Size: 81.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapai-0.2.1.tar.gz
Algorithm Hash digest
SHA256 824a54cfe2f917757a048529276be8efbd4c981eff7bd09801186b7764e527ef
MD5 aa645a2517ee04cb9bfae57f832e9f7d
BLAKE2b-256 821f2775337f8e46c3d0e5ad3d5ca7161b3c7d08e4762465b4e26b09b35f5d7a

See more details on using hashes here.

File details

Details for the file scrapai-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: scrapai-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 83.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapai-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 07607f6963a1fbe9ae2a65426878cfc6b2748d3b08023030a253827866704f84
MD5 d70736d01f65192c79d4ba6a97e9b49a
BLAKE2b-256 064b66d581c60118275ccb9bc7952444dd7d689963bc653424e71b7b95f07c87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page