Skip to main content

A modular web scraping framework using pydantic-ai and pydantic-graph with intelligent caching

Project description

Pydantic Scrape

A modular AI-powered web scraping framework built on pydantic-ai and pydantic-graph for intelligent content extraction and research workflows.

What is Pydantic Scrape?

Pydantic Scrape is a framework for building intelligent web scraping workflows that combine:

  • AI-powered content extraction using pydantic-ai agents
  • Graph-based workflow orchestration with pydantic-graph
  • Type-safe dependency injection for modular, reusable components
  • Specialized content handlers for academic papers, articles, videos, and more

โšก Quick Start: Search โ†’ Answer Workflow

Get comprehensive research answers in seconds with our streamlined search-to-answer pipeline:

from pydantic_scrape.graphs.search_answer import search_answer

# One line to research any topic
result = await search_answer(
    query="Ivermectin working as a treatment for Cancer",
    max_search_results=5
)

# Rich structured output with sources
print(f"โœ… Found {result['processing_stats']['search_results']} sources")
print(f"๐Ÿ“ Answer: {result['answer']['answer']}")
print(f"๐Ÿ’ก Key insights: {len(result['answer']['key_insights'])}")
print(f"๐Ÿ“š Sources: {len(result['answer']['sources'])}")

What it does:

  1. ๐Ÿ” Intelligent search - Finds relevant academic papers and articles
  2. ๐Ÿ“„ Content synthesis - Combines multiple sources into comprehensive summaries
  3. ๐ŸŽฏ Answer generation - Creates structured answers with key insights and sources
  4. โšก Fast execution - Complete research workflow in ~10 seconds

Core Architecture: Agents + Dependencies + Graphs

Pydantic Scrape follows a clean three-layer architecture:

๐Ÿค– Agents - AI-powered workers

# Intelligent search agent
from pydantic_scrape.agents.search import search_agent

# AI summarization agent  
from pydantic_scrape.agents.summarization import summarize_content

# Dynamic scraping agent
from pydantic_scrape.agents.bs4_scrape_script_agent import get_bs4_scrape_script_agent

๐Ÿ”ง Dependencies - Reusable components

# Content fetching with browser automation
from pydantic_scrape.dependencies.fetch import FetchDependency

# Academic API integrations
from pydantic_scrape.dependencies.openalex import OpenAlexDependency
from pydantic_scrape.dependencies.crossref import CrossrefDependency

# Content analysis and extraction
from pydantic_scrape.dependencies.content_analysis import ContentAnalysisDependency

๐Ÿ“Š Graphs - Workflow orchestration

# Fast search โ†’ answer workflow
from pydantic_scrape.graphs.search_answer import search_answer_graph

# Complete science paper extraction
from pydantic_scrape.graphs.science import science_graph

# Dynamic scraping workflows
from pydantic_scrape.graphs.dynamic_scrape import dynamic_scrape_graph

๐Ÿ”ฌ Example: AI Content Summarization

Create structured summaries from any content:

from pydantic_scrape.agents.summarization import summarize_content

# Single document
summary = await summarize_content(
    "Machine learning advances in 2024 have focused on efficiency and safety...",
    max_length=1000
)

print(f"Title: {summary.title}")
print(f"Summary: {summary.summary}")
print(f"Key findings: {summary.key_findings}")
print(f"Confidence: {summary.confidence_score}")

# Multiple documents (returns comprehensive summary)
combined_summary = await summarize_content([
    doc1, doc2, doc3  # List of content objects
])

๐Ÿงฉ Example: Custom Dependency

Build reusable components for specific content types:

from dataclasses import dataclass
from typing import Optional
from pydantic import BaseModel

class TwitterContent(BaseModel):
    tweet_text: str
    author: str
    likes: int
    retweets: int

@dataclass  
class TwitterDependency:
    """Extract structured data from Twitter/X"""
    
    def __init__(self, api_key: str):
        self.api_key = api_key
    
    async def extract_tweet_data(self, url: str) -> TwitterContent:
        # Custom extraction logic here
        pass

๐Ÿ“ˆ Example: Custom Graph Workflow

Compose agents and dependencies into intelligent workflows:

from dataclasses import dataclass
from typing import Union
from pydantic_graph import BaseNode, Graph, GraphRunContext, End

@dataclass
class ResearchState:
    query: str
    sources_found: list = None
    summaries: list = None
    final_report: str = None

@dataclass
class ResearchDeps:
    search: SearchDependency
    summarizer: SummarizationDependency
    
@dataclass
class SearchNode(BaseNode[ResearchState, ResearchDeps, Union["SummarizeNode", End]]):
    async def run(self, ctx: GraphRunContext[ResearchState, ResearchDeps]):
        sources = await ctx.deps.search.find_sources(ctx.state.query)
        if not sources:
            return End({"error": "No sources found"})
        
        ctx.state.sources_found = sources
        return SummarizeNode()

# Assemble the graph
research_graph = Graph(nodes=[SearchNode, SummarizeNode, ReportNode])

๐Ÿ› ๏ธ Installation

Development Installation

# Clone the repository
git clone https://github.com/yourusername/pydantic-scrape.git
cd pydantic-scrape

# Install with development dependencies (using uv for speed)
uv pip install -e ".[dev]"
# or with pip
pip install -e ".[dev]"

# Set up environment variables
cp .env.example .env
# Add your API keys (OPENAI_API_KEY, etc.)

๐Ÿงช Comprehensive Testing & Validation

โœ… ALL 4 CORE GRAPHS TESTED AND OPERATIONAL!

Run the complete test suite:

# Test all 4 graphs with real examples
python test_all_graphs.py

# Results: 4/4 graphs passing in ~90 seconds
# โœ… Search โ†’ Answer: Research workflow (32.9s)
# โœ… Dynamic AI Scraping: Extract from any site (12.4s)  
# โœ… Complete Science Scraping: Full academic processing (20.0s)
# โœ… Search โ†’ Scrape โ†’ Answer: Advanced research pipeline (29.0s)

๐ŸŽฏ Framework Capabilities Demonstrated:

  • ๐Ÿ” Fast Research - Search academic sources and generate comprehensive answers
  • ๐Ÿค– AI Extraction - Dynamically extract structured data from any website using AI agents
  • ๐Ÿ“„ Science Processing - Complete academic paper processing with metadata enrichment
  • ๐Ÿ”ฌ Deep Research - Advanced pipeline that searches, scrapes full content, and synthesizes answers

Quick Individual Tests

# Test search-answer workflow
python -c "
import asyncio
from pydantic_scrape.graphs.search_answer import search_answer

async def test():
    result = await search_answer('latest advances in quantum computing')
    print(f'Found {len(result[\"answer\"][\"sources\"])} sources')
    print(result['answer']['answer'][:200] + '...')

asyncio.run(test())
"

# Test summarization agent
python -c "
import asyncio
from pydantic_scrape.agents.summarization import summarize_content

async def test():
    summary = await summarize_content(
        'Artificial intelligence is transforming scientific research...'
    )
    print(f'Summary: {summary.summary}')

asyncio.run(test())
"

๐Ÿค Contributing - We Need Your Help!

We're building the future of intelligent web scraping and we want you to be part of it!

๐ŸŽฏ What We're Looking For

๐Ÿค– Agent Builders

Create specialized AI agents for:

  • Domain-specific extraction (legal docs, medical papers, financial reports)
  • Multi-modal content (image + text analysis, video transcription)
  • Real-time processing (news monitoring, social media tracking)
  • Quality assurance (fact-checking, source verification)

๐Ÿ”ง Dependency Developers

Build reusable components for:

  • API integrations (Google Scholar, PubMed, arXiv, GitHub, social platforms)
  • Content processors (PDF extraction, video analysis, image recognition)
  • Data enrichment (NLP analysis, metadata extraction, classification)
  • Storage & caching (vector databases, knowledge graphs, search indices)

๐Ÿ“Š Graph Architects

Design intelligent workflows for:

  • Research pipelines (literature review, systematic analysis, meta-analysis)
  • Content monitoring (news tracking, social listening, trend analysis)
  • Knowledge extraction (entity recognition, relationship mapping, fact extraction)
  • Quality control (validation, verification, bias detection)

๐Ÿš€ Getting Started as a Contributor

# 1. Fork and clone
git clone https://github.com/yourusername/pydantic-scrape.git
cd pydantic-scrape

# 2. Install development dependencies  
uv pip install -e ".[dev]"

# 3. Test current functionality
python test_search_answer.py  # Should work out of the box

# 4. Check the current structure
ls pydantic_scrape/agents/      # See existing agents
ls pydantic_scrape/dependencies/ # See existing dependencies  
ls pydantic_scrape/graphs/       # See existing graphs

# 5. Start building!

๐Ÿ’ก Contribution Ideas

Easy wins for new contributors:

  • Add a new academic API (NASA ADS, bioRxiv, SSRN)
  • Create a social media dependency (Reddit, LinkedIn, Mastodon)
  • Build a specialized graph for a domain (legal research, patent analysis)
  • Add content format support (EPUB, Markdown, slides)

Advanced challenges:

  • Multi-agent coordination for complex research tasks
  • Real-time streaming workflows with live updates
  • Advanced caching and optimization strategies
  • Cross-language content extraction and translation

๐ŸŒŸ Community & Support

๐Ÿ“‹ Core Dependencies

  • AI Framework: pydantic-ai - Type-safe AI agents with structured outputs
  • Workflow Engine: pydantic-graph - Graph-based workflow orchestration
  • Browser Automation: Camoufox - Undetectable browser automation
  • Content Processing: BeautifulSoup4, newspaper3k, pypdf
  • Academic APIs: Integration with OpenAlex, Crossref, arXiv

๐Ÿ“„ License

MIT License - see LICENSE file for details.


๐ŸŽ‰ Join Us!

Pydantic Scrape is more than a framework - it's a community building the next generation of intelligent web scraping tools. Whether you're a researcher, developer, data scientist, or domain expert, there's a place for you here.

Let's build something amazing together! ๐Ÿš€

Stars Forks Issues Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydantic_scrape-0.1.0.tar.gz (88.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydantic_scrape-0.1.0-py3-none-any.whl (96.8 kB view details)

Uploaded Python 3

File details

Details for the file pydantic_scrape-0.1.0.tar.gz.

File metadata

  • Download URL: pydantic_scrape-0.1.0.tar.gz
  • Upload date:
  • Size: 88.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for pydantic_scrape-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f845f90f68e2ddb3678ecdeaacac85ec85b9fa468c7515bb9e9ac92e6ee1a369
MD5 724297e38889cad49c82db8b51609541
BLAKE2b-256 de69fd494afdcc68aa7f3d730bada2bf15eaf65e9266b926fae2762bfcc36b06

See more details on using hashes here.

File details

Details for the file pydantic_scrape-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pydantic_scrape-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 88fc31bee44e667c23f8f860d77a119cbcaa8fb1392351b08c5973b27c79884f
MD5 97b1d3d9bb9108df3fc24bb1e5a81f46
BLAKE2b-256 c056badd7d501ae33695a99d66d49694a2c718d2dfd0cec32bd5b9c769291ab2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page