Advanced web automation framework with AI-powered agents, Chawan terminal browser integration, and geographic search targeting
Project description
Pydantic Scrape
A modern web scraping framework that combines AI-powered content extraction with intelligent workflow orchestration. Built on pydantic-ai for reliable, type-safe scraping operations.
Why Pydantic Scrape?
Web scraping is complex. You need to handle dynamic content, extract meaningful information, and orchestrate multi-step workflows. Most tools force you to choose between simple scrapers or complex frameworks with steep learning curves.
Pydantic Scrape bridges this gap by providing:
- AI-powered extraction - Let AI understand and extract what you need instead of writing brittle selectors
- Type-safe workflows - Structured data with validation built-in
- Academic research focus - First-class support for papers, citations, and research workflows
- Browser automation - Handle JavaScript, authentication, and complex interactions seamlessly
Installation
1. Install Chawan Terminal Browser (Required)
Pydantic Scrape uses Chawan for advanced web automation and JavaScript-heavy sites.
macOS (Homebrew):
brew install chawan
Linux (from source):
# Install Nim compiler
curl https://nim-lang.org/choosenim/init.sh -sSf | sh
# Install Chawan
git clone https://git.sr.ht/~bptato/chawan
cd chawan && make && sudo make install
Verify installation:
cha --version
2. Install Pydantic Scrape
# Standard installation
pip install pydantic-scrape
# With development tools (if contributing)
pip install pydantic-scrape[dev]
Quick Start
Get a comprehensive research answer in under 10 lines:
import asyncio
from pydantic_scrape.graphs.search_answer import search_answer
async def research():
result = await search_answer(
query="latest advances in quantum computing",
max_search_results=5
)
print(f"Found {len(result['answer']['sources'])} sources")
print(result['answer']['answer'])
asyncio.run(research())
This searches academic sources, extracts content, and generates a structured answer with citations - all automatically.
Core Features
🔍 Smart Content Extraction
from pydantic_scrape.dependencies.fetch import fetch_url
# Automatically handles JavaScript, selects best extraction method
content = await fetch_url("https://example.com/article")
print(content.title, content.text, content.metadata)
🤖 AI-Powered Scraping
from pydantic_scrape.agents.bs4_scrape_script_agent import get_bs4_scrape_script_agent
# AI writes the scraping code for you
agent = get_bs4_scrape_script_agent()
result = await agent.run_sync("Extract product prices from this e-commerce page",
html_content=page_html)
📚 Academic Research
from pydantic_scrape.dependencies.openalex import OpenAlexDependency
# Search papers by topic, author, or DOI
openalex = OpenAlexDependency()
papers = await openalex.search_papers("machine learning healthcare")
📄 Document Processing
from pydantic_scrape.dependencies.document import DocumentDependency
# Extract text from PDFs, Word docs, EPUBs
doc = DocumentDependency()
content = await doc.extract_text("research_paper.pdf")
🌐 Advanced Browser Automation
from pydantic_scrape.agents.search_and_browse import SearchAndBrowseAgent
# Intelligent search + browse with memory and geographic targeting
agent = SearchAndBrowseAgent()
result = await agent.run(
"Find 5 cabinet refacing services in North West England with contact details"
)
# Automatically handles: cookie popups, JavaScript, geographic targeting,
# content caching, parallel browsing, and form detection
Common Use Cases
- Literature Reviews - Automatically search, extract, and summarize academic papers
- Market Research - Monitor competitor content, pricing, and product updates
- News Monitoring - Track mentions, trends, and breaking news across sources
- Content Migration - Extract structured data from legacy systems or websites
- Research Workflows - Build custom pipelines for domain-specific content extraction
Architecture
Pydantic Scrape organizes code into three layers:
- Dependencies (
pydantic_scrape.dependencies.*) - Reusable components for specific tasks - Agents (
pydantic_scrape.agents.*) - AI-powered workers that make decisions - Graphs (
pydantic_scrape.graphs.*) - Orchestrate multi-step workflows
This makes it easy to compose complex workflows from simple, tested components.
Configuration
1. Set API Keys
Create a .env file in your project root:
# AI Providers (choose one or more)
OPENAI_API_KEY=your_openai_key
GOOGLE_GENAI_API_KEY=your_google_key
ANTHROPIC_API_KEY=your_anthropic_key
# Google Search (for enhanced search capabilities)
GOOGLE_SEARCH_API_KEY=your_google_search_key
GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id
2. Chawan Configuration
The package includes an optimized Chawan configuration in .chawan/config.toml that provides:
- 7x faster web automation vs default settings
- Cookie popup handling without JavaScript overhead
- Content caching for instant subsequent operations
- Geographic search targeting for accurate local results
No additional Chawan setup required - works out of the box!
Documentation
- Installation Guide - Detailed setup instructions
- Examples - Working code samples for common tasks
- API Reference - Full documentation
Contributing
We welcome contributions! The framework is designed to be extensible - add new content sources, AI agents, or workflow patterns.
See CONTRIBUTING.md for development setup and guidelines.
License
MIT License - see LICENSE for details.
Ready to build intelligent scraping workflows? Start with pip install pydantic-scrape and try the examples above.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydantic_scrape-0.2.0.tar.gz.
File metadata
- Download URL: pydantic_scrape-0.2.0.tar.gz
- Upload date:
- Size: 149.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52e752069b632da629541e39cf400a48d5e8ec604e1c736dad9d3dbf93f28b22
|
|
| MD5 |
a398398081e8e8d4487eb65408da00e6
|
|
| BLAKE2b-256 |
b9fb6c6234392c5e275e0f7bb8defbbc99efb6c371e0f464672f6b6540907663
|
File details
Details for the file pydantic_scrape-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pydantic_scrape-0.2.0-py3-none-any.whl
- Upload date:
- Size: 177.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06d0c33fc06a229272dd308726cf0e269672d1f1ac356c2612491c5440a5d969
|
|
| MD5 |
32e34f4e52943a95c99f9540bdf5b38d
|
|
| BLAKE2b-256 |
6750a54e7e173b92c97f7fd38fb3db6daff570306be07a3a6b263131b1b98081
|