A modular web scraping framework using pydantic-ai and pydantic-graph with intelligent caching

These details have not been verified by PyPI

Project links

Project description

Pydantic Scrape

A modular AI-powered web scraping framework built on pydantic-ai and pydantic-graph for intelligent content extraction and research workflows.

What is Pydantic Scrape?

Pydantic Scrape is a framework for building intelligent web scraping workflows that combine:

AI-powered content extraction using pydantic-ai agents
Graph-based workflow orchestration with pydantic-graph
Type-safe dependency injection for modular, reusable components
Specialized content handlers for academic papers, articles, videos, and more

⚡ Quick Start: Search → Answer Workflow

Get comprehensive research answers in seconds with our streamlined search-to-answer pipeline:

from pydantic_scrape.graphs.search_answer import search_answer

# One line to research any topic
result = await search_answer(
    query="Ivermectin working as a treatment for Cancer",
    max_search_results=5
)

# Rich structured output with sources
print(f"✅ Found {result['processing_stats']['search_results']} sources")
print(f"📝 Answer: {result['answer']['answer']}")
print(f"💡 Key insights: {len(result['answer']['key_insights'])}")
print(f"📚 Sources: {len(result['answer']['sources'])}")

What it does:

🔍 Intelligent search - Finds relevant academic papers and articles
📄 Content synthesis - Combines multiple sources into comprehensive summaries
🎯 Answer generation - Creates structured answers with key insights and sources
⚡ Fast execution - Complete research workflow in ~10 seconds

Core Architecture: Agents + Dependencies + Graphs

Pydantic Scrape follows a clean three-layer architecture:

🤖 Agents - AI-powered workers

# Intelligent search agent
from pydantic_scrape.agents.search import search_agent

# AI summarization agent  
from pydantic_scrape.agents.summarization import summarize_content

# Dynamic scraping agent
from pydantic_scrape.agents.bs4_scrape_script_agent import get_bs4_scrape_script_agent

🔧 Dependencies - Reusable components

# Content fetching with browser automation
from pydantic_scrape.dependencies.fetch import FetchDependency

# Academic API integrations
from pydantic_scrape.dependencies.openalex import OpenAlexDependency
from pydantic_scrape.dependencies.crossref import CrossrefDependency

# Content analysis and extraction
from pydantic_scrape.dependencies.content_analysis import ContentAnalysisDependency

📊 Graphs - Workflow orchestration

# Fast search → answer workflow
from pydantic_scrape.graphs.search_answer import search_answer_graph

# Complete science paper extraction
from pydantic_scrape.graphs.science import science_graph

# Dynamic scraping workflows
from pydantic_scrape.graphs.dynamic_scrape import dynamic_scrape_graph

🔬 Example: AI Content Summarization

Create structured summaries from any content:

from pydantic_scrape.agents.summarization import summarize_content

# Single document
summary = await summarize_content(
    "Machine learning advances in 2024 have focused on efficiency and safety...",
    max_length=1000
)

print(f"Title: {summary.title}")
print(f"Summary: {summary.summary}")
print(f"Key findings: {summary.key_findings}")
print(f"Confidence: {summary.confidence_score}")

# Multiple documents (returns comprehensive summary)
combined_summary = await summarize_content([
    doc1, doc2, doc3  # List of content objects
])

🧩 Example: Custom Dependency

Build reusable components for specific content types:

from dataclasses import dataclass
from typing import Optional
from pydantic import BaseModel

class TwitterContent(BaseModel):
    tweet_text: str
    author: str
    likes: int
    retweets: int

@dataclass  
class TwitterDependency:
    """Extract structured data from Twitter/X"""
    
    def __init__(self, api_key: str):
        self.api_key = api_key
    
    async def extract_tweet_data(self, url: str) -> TwitterContent:
        # Custom extraction logic here
        pass

📈 Example: Custom Graph Workflow

Compose agents and dependencies into intelligent workflows:

from dataclasses import dataclass
from typing import Union
from pydantic_graph import BaseNode, Graph, GraphRunContext, End

@dataclass
class ResearchState:
    query: str
    sources_found: list = None
    summaries: list = None
    final_report: str = None

@dataclass
class ResearchDeps:
    search: SearchDependency
    summarizer: SummarizationDependency
    
@dataclass
class SearchNode(BaseNode[ResearchState, ResearchDeps, Union["SummarizeNode", End]]):
    async def run(self, ctx: GraphRunContext[ResearchState, ResearchDeps]):
        sources = await ctx.deps.search.find_sources(ctx.state.query)
        if not sources:
            return End({"error": "No sources found"})
        
        ctx.state.sources_found = sources
        return SummarizeNode()

# Assemble the graph
research_graph = Graph(nodes=[SearchNode, SummarizeNode, ReportNode])

🛠️ Installation

Development Installation

# Clone the repository
git clone https://github.com/yourusername/pydantic-scrape.git
cd pydantic-scrape

# Install with development dependencies (using uv for speed)
uv pip install -e ".[dev]"
# or with pip
pip install -e ".[dev]"

# Set up environment variables
cp .env.example .env
# Add your API keys (OPENAI_API_KEY, etc.)

🧪 Comprehensive Testing & Validation

✅ ALL 4 CORE GRAPHS TESTED AND OPERATIONAL!

Run the complete test suite:

# Test all 4 graphs with real examples
python test_all_graphs.py

# Results: 4/4 graphs passing in ~90 seconds
# ✅ Search → Answer: Research workflow (32.9s)
# ✅ Dynamic AI Scraping: Extract from any site (12.4s)  
# ✅ Complete Science Scraping: Full academic processing (20.0s)
# ✅ Search → Scrape → Answer: Advanced research pipeline (29.0s)

🎯 Framework Capabilities Demonstrated:

🔍 Fast Research - Search academic sources and generate comprehensive answers
🤖 AI Extraction - Dynamically extract structured data from any website using AI agents
📄 Science Processing - Complete academic paper processing with metadata enrichment
🔬 Deep Research - Advanced pipeline that searches, scrapes full content, and synthesizes answers

Quick Individual Tests

# Test search-answer workflow
python -c "
import asyncio
from pydantic_scrape.graphs.search_answer import search_answer

async def test():
    result = await search_answer('latest advances in quantum computing')
    print(f'Found {len(result[\"answer\"][\"sources\"])} sources')
    print(result['answer']['answer'][:200] + '...')

asyncio.run(test())
"

# Test summarization agent
python -c "
import asyncio
from pydantic_scrape.agents.summarization import summarize_content

async def test():
    summary = await summarize_content(
        'Artificial intelligence is transforming scientific research...'
    )
    print(f'Summary: {summary.summary}')

asyncio.run(test())
"

🤝 Contributing - We Need Your Help!

We're building the future of intelligent web scraping and we want you to be part of it!

🎯 What We're Looking For

🤖 Agent Builders

Create specialized AI agents for:

Domain-specific extraction (legal docs, medical papers, financial reports)
Multi-modal content (image + text analysis, video transcription)
Real-time processing (news monitoring, social media tracking)
Quality assurance (fact-checking, source verification)

🔧 Dependency Developers

Build reusable components for:

API integrations (Google Scholar, PubMed, arXiv, GitHub, social platforms)
Content processors (PDF extraction, video analysis, image recognition)
Data enrichment (NLP analysis, metadata extraction, classification)
Storage & caching (vector databases, knowledge graphs, search indices)

📊 Graph Architects

Design intelligent workflows for:

Research pipelines (literature review, systematic analysis, meta-analysis)
Content monitoring (news tracking, social listening, trend analysis)
Knowledge extraction (entity recognition, relationship mapping, fact extraction)
Quality control (validation, verification, bias detection)

🚀 Getting Started as a Contributor

# 1. Fork and clone
git clone https://github.com/yourusername/pydantic-scrape.git
cd pydantic-scrape

# 2. Install development dependencies  
uv pip install -e ".[dev]"

# 3. Test current functionality
python test_search_answer.py  # Should work out of the box

# 4. Check the current structure
ls pydantic_scrape/agents/      # See existing agents
ls pydantic_scrape/dependencies/ # See existing dependencies  
ls pydantic_scrape/graphs/       # See existing graphs

# 5. Start building!

💡 Contribution Ideas

Easy wins for new contributors:

Add a new academic API (NASA ADS, bioRxiv, SSRN)
Create a social media dependency (Reddit, LinkedIn, Mastodon)
Build a specialized graph for a domain (legal research, patent analysis)
Add content format support (EPUB, Markdown, slides)

Advanced challenges:

Multi-agent coordination for complex research tasks
Real-time streaming workflows with live updates
Advanced caching and optimization strategies
Cross-language content extraction and translation

🌟 Community & Support

🐛 Found a bug? Open an issue with reproduction steps
💡 Have an idea? Start a discussion about new features
🔧 Ready to contribute? Check out our contribution guidelines
📧 Questions? Reach out to the maintainers

📋 Core Dependencies

AI Framework: pydantic-ai - Type-safe AI agents with structured outputs
Workflow Engine: pydantic-graph - Graph-based workflow orchestration
Browser Automation: Camoufox - Undetectable browser automation
Content Processing: BeautifulSoup4, newspaper3k, pypdf
Academic APIs: Integration with OpenAlex, Crossref, arXiv

📄 License

MIT License - see LICENSE file for details.

🎉 Join Us!

Pydantic Scrape is more than a framework - it's a community building the next generation of intelligent web scraping tools. Whether you're a researcher, developer, data scientist, or domain expert, there's a place for you here.

Let's build something amazing together! 🚀

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

Jul 31, 2025

0.2.0

Jul 31, 2025

0.1.2

Jul 23, 2025

This version

0.1.0

Jul 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydantic_scrape-0.1.0.tar.gz (88.2 kB view details)

Uploaded Jul 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydantic_scrape-0.1.0-py3-none-any.whl (96.8 kB view details)

Uploaded Jul 23, 2025 Python 3

File details

Details for the file pydantic_scrape-0.1.0.tar.gz.

File metadata

Download URL: pydantic_scrape-0.1.0.tar.gz
Upload date: Jul 23, 2025
Size: 88.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for pydantic_scrape-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f845f90f68e2ddb3678ecdeaacac85ec85b9fa468c7515bb9e9ac92e6ee1a369`
MD5	`724297e38889cad49c82db8b51609541`
BLAKE2b-256	`de69fd494afdcc68aa7f3d730bada2bf15eaf65e9266b926fae2762bfcc36b06`

See more details on using hashes here.

File details

Details for the file pydantic_scrape-0.1.0-py3-none-any.whl.

File metadata

Download URL: pydantic_scrape-0.1.0-py3-none-any.whl
Upload date: Jul 23, 2025
Size: 96.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for pydantic_scrape-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`88fc31bee44e667c23f8f860d77a119cbcaa8fb1392351b08c5973b27c79884f`
MD5	`97b1d3d9bb9108df3fc24bb1e5a81f46`
BLAKE2b-256	`c056badd7d501ae33695a99d66d49694a2c718d2dfd0cec32bd5b9c769291ab2`

See more details on using hashes here.

pydantic-scrape 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pydantic Scrape

What is Pydantic Scrape?

⚡ Quick Start: Search → Answer Workflow

Core Architecture: Agents + Dependencies + Graphs

🤖 Agents - AI-powered workers

🔧 Dependencies - Reusable components

📊 Graphs - Workflow orchestration

🔬 Example: AI Content Summarization

🧩 Example: Custom Dependency

📈 Example: Custom Graph Workflow

🛠️ Installation

Development Installation

🧪 Comprehensive Testing & Validation

Quick Individual Tests

🤝 Contributing - We Need Your Help!

🎯 What We're Looking For

🤖 Agent Builders

🔧 Dependency Developers

📊 Graph Architects

🚀 Getting Started as a Contributor

💡 Contribution Ideas

🌟 Community & Support

📋 Core Dependencies

📄 License

🎉 Join Us!

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes