Skip to main content

Advanced web automation framework with AI-powered agents, Chawan terminal browser integration, and geographic search targeting

Project description

Pydantic Scrape

A modern web scraping framework that combines AI-powered content extraction with intelligent workflow orchestration. Built on pydantic-ai for reliable, type-safe scraping operations.

Why Pydantic Scrape?

Web scraping is complex. You need to handle dynamic content, extract meaningful information, and orchestrate multi-step workflows. Most tools force you to choose between simple scrapers or complex frameworks with steep learning curves.

Pydantic Scrape bridges this gap by providing:

  • AI-powered extraction - Let AI understand and extract what you need instead of writing brittle selectors
  • Type-safe workflows - Structured data with validation built-in
  • Academic research focus - First-class support for papers, citations, and research workflows
  • Browser automation - Handle JavaScript, authentication, and complex interactions seamlessly

Installation

1. Install Chawan Terminal Browser (Required)

Pydantic Scrape uses Chawan for advanced web automation and JavaScript-heavy sites.

macOS (Homebrew):

brew install chawan

Linux (from source):

# Install Nim compiler
curl https://nim-lang.org/choosenim/init.sh -sSf | sh
# Install Chawan
git clone https://git.sr.ht/~bptato/chawan
cd chawan && make && sudo make install

Verify installation:

cha --version

2. Install Pydantic Scrape

# Standard installation
pip install pydantic-scrape

# With development tools (if contributing)
pip install pydantic-scrape[dev]

Quick Start

Get a comprehensive research answer in under 10 lines:

import asyncio
from pydantic_scrape.graphs.search_answer import search_answer

async def research():
    result = await search_answer(
        query="latest advances in quantum computing",
        max_search_results=5
    )
    
    print(f"Found {len(result['answer']['sources'])} sources")
    print(result['answer']['answer'])

asyncio.run(research())

This searches academic sources, extracts content, and generates a structured answer with citations - all automatically.

Core Features

🔍 Smart Content Extraction

from pydantic_scrape.dependencies.fetch import fetch_url

# Automatically handles JavaScript, selects best extraction method
content = await fetch_url("https://example.com/article")
print(content.title, content.text, content.metadata)

🤖 AI-Powered Scraping

from pydantic_scrape.agents.bs4_scrape_script_agent import get_bs4_scrape_script_agent

# AI writes the scraping code for you
agent = get_bs4_scrape_script_agent()
result = await agent.run_sync("Extract product prices from this e-commerce page", 
                              html_content=page_html)

📚 Academic Research

from pydantic_scrape.dependencies.openalex import OpenAlexDependency

# Search papers by topic, author, or DOI
openalex = OpenAlexDependency()
papers = await openalex.search_papers("machine learning healthcare")

📄 Document Processing

from pydantic_scrape.dependencies.document import DocumentDependency

# Extract text from PDFs, Word docs, EPUBs
doc = DocumentDependency()
content = await doc.extract_text("research_paper.pdf")

🌐 Advanced Browser Automation

from pydantic_scrape.agents.search_and_browse import SearchAndBrowseAgent

# Intelligent search + browse with memory and geographic targeting
agent = SearchAndBrowseAgent()
result = await agent.run(
    "Find 5 cabinet refacing services in North West England with contact details"
)

# Automatically handles: cookie popups, JavaScript, geographic targeting, 
# content caching, parallel browsing, and form detection

Common Use Cases

  • Literature Reviews - Automatically search, extract, and summarize academic papers
  • Market Research - Monitor competitor content, pricing, and product updates
  • News Monitoring - Track mentions, trends, and breaking news across sources
  • Content Migration - Extract structured data from legacy systems or websites
  • Research Workflows - Build custom pipelines for domain-specific content extraction

Architecture

Pydantic Scrape organizes code into three layers:

  • Dependencies (pydantic_scrape.dependencies.*) - Reusable components for specific tasks
  • Agents (pydantic_scrape.agents.*) - AI-powered workers that make decisions
  • Graphs (pydantic_scrape.graphs.*) - Orchestrate multi-step workflows

This makes it easy to compose complex workflows from simple, tested components.

Configuration

1. Set API Keys

Create a .env file in your project root:

# AI Providers (choose one or more)
OPENAI_API_KEY=your_openai_key
GOOGLE_GENAI_API_KEY=your_google_key  
ANTHROPIC_API_KEY=your_anthropic_key

# Google Search (for enhanced search capabilities)
GOOGLE_SEARCH_API_KEY=your_google_search_key
GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id

2. Chawan Configuration

The package includes an optimized Chawan configuration in .chawan/config.toml that provides:

  • 7x faster web automation vs default settings
  • Cookie popup handling without JavaScript overhead
  • Content caching for instant subsequent operations
  • Geographic search targeting for accurate local results

No additional Chawan setup required - works out of the box!

Documentation

Contributing

We welcome contributions! The framework is designed to be extensible - add new content sources, AI agents, or workflow patterns.

See CONTRIBUTING.md for development setup and guidelines.

License

MIT License - see LICENSE for details.


Ready to build intelligent scraping workflows? Start with pip install pydantic-scrape and try the examples above.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydantic_scrape-0.2.0.tar.gz (149.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydantic_scrape-0.2.0-py3-none-any.whl (177.0 kB view details)

Uploaded Python 3

File details

Details for the file pydantic_scrape-0.2.0.tar.gz.

File metadata

  • Download URL: pydantic_scrape-0.2.0.tar.gz
  • Upload date:
  • Size: 149.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for pydantic_scrape-0.2.0.tar.gz
Algorithm Hash digest
SHA256 52e752069b632da629541e39cf400a48d5e8ec604e1c736dad9d3dbf93f28b22
MD5 a398398081e8e8d4487eb65408da00e6
BLAKE2b-256 b9fb6c6234392c5e275e0f7bb8defbbc99efb6c371e0f464672f6b6540907663

See more details on using hashes here.

File details

Details for the file pydantic_scrape-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pydantic_scrape-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 06d0c33fc06a229272dd308726cf0e269672d1f1ac356c2612491c5440a5d969
MD5 32e34f4e52943a95c99f9540bdf5b38d
BLAKE2b-256 6750a54e7e173b92c97f7fd38fb3db6daff570306be07a3a6b263131b1b98081

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page