Advanced web automation framework with AI-powered agents, Chawan terminal browser integration, and geographic search targeting

These details have not been verified by PyPI

Project links

Project description

Pydantic Scrape

A modern web scraping framework that combines AI-powered content extraction with intelligent workflow orchestration. Built on pydantic-ai for reliable, type-safe scraping operations.

Why Pydantic Scrape?

Web scraping is complex. You need to handle dynamic content, extract meaningful information, and orchestrate multi-step workflows. Most tools force you to choose between simple scrapers or complex frameworks with steep learning curves.

Pydantic Scrape bridges this gap by providing:

AI-powered extraction - Let AI understand and extract what you need instead of writing brittle selectors
Type-safe workflows - Structured data with validation built-in
Academic research focus - First-class support for papers, citations, and research workflows
Browser automation - Handle JavaScript, authentication, and complex interactions seamlessly

Installation

1. Install Chawan Terminal Browser (Required)

Pydantic Scrape uses Chawan for advanced web automation and JavaScript-heavy sites.

Option 1: Homebrew (macOS) - Stable Release

brew install chawan

Option 2: From Source - Latest Development Version

# Install Nim compiler (if not already installed)
brew install nim  # macOS
# OR for Linux: curl https://nim-lang.org/choosenim/init.sh -sSf | sh -s -- -y

# Build latest Chawan from source
git clone https://git.sr.ht/~bptato/chawan
cd chawan && make

# Install locally (recommended for development)
mkdir -p ~/.local/bin ~/.local/libexec
cp target/release/bin/cha ~/.local/bin/
cp -r target/release/libexec/chawan ~/.local/libexec/
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc

Verify installation:

cha --version

2. Install Pydantic Scrape

Lean Installation (Recommended)

# Core functionality only (~50MB)
pip install pydantic-scrape

Extended Installation with Optional Features

# YouTube video processing
pip install pydantic-scrape[youtube]

# AI services (OpenAI, Google AI)  
pip install pydantic-scrape[ai]

# Academic research (OpenAlex, CrossRef)
pip install pydantic-scrape[academic]

# Document processing (PDF, Word, eBooks)
pip install pydantic-scrape[documents]

# Advanced content extraction
pip install pydantic-scrape[content]

# Everything included (~300MB)
pip install pydantic-scrape[all]

# Development tools (if contributing)
pip install pydantic-scrape[dev]

Quick Start

Get a comprehensive research answer in under 10 lines:

import asyncio
from pydantic_scrape.graphs.search_answer import search_answer

async def research():
    result = await search_answer(
        query="latest advances in quantum computing",
        max_search_results=5
    )
    
    print(f"Found {len(result['answer']['sources'])} sources")
    print(result['answer']['answer'])

asyncio.run(research())

This searches academic sources, extracts content, and generates a structured answer with citations - all automatically.

Core Features

🔍 Smart Content Extraction

from pydantic_scrape.dependencies.fetch import fetch_url

# Automatically handles JavaScript, selects best extraction method
content = await fetch_url("https://example.com/article")
print(content.title, content.text, content.metadata)

🤖 AI-Powered Scraping

from pydantic_scrape.agents.bs4_scrape_script_agent import get_bs4_scrape_script_agent

# AI writes the scraping code for you
agent = get_bs4_scrape_script_agent()
result = await agent.run_sync("Extract product prices from this e-commerce page", 
                              html_content=page_html)

📚 Academic Research

from pydantic_scrape.dependencies.openalex import OpenAlexDependency

# Search papers by topic, author, or DOI
openalex = OpenAlexDependency()
papers = await openalex.search_papers("machine learning healthcare")

📄 Document Processing

from pydantic_scrape.dependencies.document import DocumentDependency

# Extract text from PDFs, Word docs, EPUBs
doc = DocumentDependency()
content = await doc.extract_text("research_paper.pdf")

🌐 Advanced Browser Automation

from pydantic_scrape.agents.search_and_browse import SearchAndBrowseAgent

# Intelligent search + browse with memory and geographic targeting
agent = SearchAndBrowseAgent()
result = await agent.run(
    "Find 5 cabinet refacing services in North West England with contact details"
)

# Automatically handles: cookie popups, JavaScript, geographic targeting, 
# content caching, parallel browsing, and form detection

Common Use Cases

Literature Reviews - Automatically search, extract, and summarize academic papers
Market Research - Monitor competitor content, pricing, and product updates
News Monitoring - Track mentions, trends, and breaking news across sources
Content Migration - Extract structured data from legacy systems or websites
Research Workflows - Build custom pipelines for domain-specific content extraction

Architecture

Pydantic Scrape organizes code into three layers:

Dependencies (pydantic_scrape.dependencies.*) - Reusable components for specific tasks
Agents (pydantic_scrape.agents.*) - AI-powered workers that make decisions
Graphs (pydantic_scrape.graphs.*) - Orchestrate multi-step workflows

This makes it easy to compose complex workflows from simple, tested components.

Configuration

1. Set API Keys

Create a .env file in your project root:

# AI Providers (choose one or more)
OPENAI_API_KEY=your_openai_key
GOOGLE_GENAI_API_KEY=your_google_key  
ANTHROPIC_API_KEY=your_anthropic_key

# Google Search (for enhanced search capabilities)
GOOGLE_SEARCH_API_KEY=your_google_search_key
GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id

2. Chawan Configuration

The package includes an optimized Chawan configuration in .chawan/config.toml that provides:

7x faster web automation vs default settings
Cookie popup handling without JavaScript overhead
Content caching for instant subsequent operations
Geographic search targeting for accurate local results

No additional Chawan setup required - works out of the box!

Documentation

Installation Guide - Detailed setup instructions
Examples - Working code samples for common tasks
API Reference - Full documentation

Contributing

We welcome contributions! The framework is designed to be extensible - add new content sources, AI agents, or workflow patterns.

See CONTRIBUTING.md for development setup and guidelines.

License

MIT License - see LICENSE for details.

Ready to build intelligent scraping workflows? Start with pip install pydantic-scrape and try the examples above.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.2

Jul 31, 2025

0.2.0

Jul 31, 2025

0.1.2

Jul 23, 2025

0.1.0

Jul 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydantic_scrape-0.2.2.tar.gz (151.2 kB view details)

Uploaded Jul 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydantic_scrape-0.2.2-py3-none-any.whl (178.3 kB view details)

Uploaded Jul 31, 2025 Python 3

File details

Details for the file pydantic_scrape-0.2.2.tar.gz.

File metadata

Download URL: pydantic_scrape-0.2.2.tar.gz
Upload date: Jul 31, 2025
Size: 151.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for pydantic_scrape-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`c39c3084e12fbff32376d5150f9687680866691dfb23bfb26d7292955d9c47a7`
MD5	`ce6bca3c7eab442bddd285944881c251`
BLAKE2b-256	`22571086e99f6420debc5ec238aed2059228f33f5fe367e066b0cc8035acebea`

See more details on using hashes here.

File details

Details for the file pydantic_scrape-0.2.2-py3-none-any.whl.

File metadata

Download URL: pydantic_scrape-0.2.2-py3-none-any.whl
Upload date: Jul 31, 2025
Size: 178.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for pydantic_scrape-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1c5c7cdc1fabed7342dfbc90f3a4581dd7f9f2b70d68f9174a9567550e23df31`
MD5	`0e6d46288548bae9d0143f3f20f5be0f`
BLAKE2b-256	`47ca34363e55d8555706739ee6d3e8d6dec7e56086e7c21f069d1ae15522f798`

See more details on using hashes here.

pydantic-scrape 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pydantic Scrape

Why Pydantic Scrape?

Installation

1. Install Chawan Terminal Browser (Required)

2. Install Pydantic Scrape

Quick Start

Core Features

🔍 Smart Content Extraction

🤖 AI-Powered Scraping

📚 Academic Research

📄 Document Processing

🌐 Advanced Browser Automation

Common Use Cases

Architecture

Configuration

1. Set API Keys

2. Chawan Configuration

Documentation

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes