Skip to main content

Instructor integration for Built-Simple research APIs - structured extraction from PubMed, ArXiv, and Wikipedia

Project description

instructor-builtsimple

PyPI version Python 3.9+ License: MIT

Structured research extraction from PubMed, ArXiv, and Wikipedia using Instructor and Pydantic.

Extract structured, validated data from research APIs using LLMs. Define your schema with Pydantic, and let Instructor handle the extraction.

Features

  • 🔬 Multi-source search: Query PubMed, ArXiv, and Wikipedia simultaneously
  • 📊 Structured output: Extract data into validated Pydantic models
  • 🧠 Research synthesis: Combine sources into comprehensive summaries
  • 📚 Topic analysis: Deep-dive into research topics with citations
  • ⚖️ Comparisons: Compare technologies, methods, or concepts
  • 🎯 Custom schemas: Define any Pydantic model for extraction

Installation

pip install instructor-builtsimple

For Anthropic Claude support:

pip install instructor-builtsimple[anthropic]

Quick Start

from instructor_builtsimple import ResearchClient

# Initialize client (uses OPENAI_API_KEY env var)
client = ResearchClient()

# Search PubMed and extract structured articles
articles = client.pubmed("CRISPR gene therapy", limit=5)
for article in articles:
    print(f"{article.title}")
    print(f"  Summary: {article.abstract_summary}")
    print(f"  Key findings: {article.key_findings}")

# Search ArXiv for ML papers
papers = client.arxiv("transformer attention mechanisms", limit=5)
for paper in papers:
    print(f"{paper.title} by {', '.join(paper.authors[:3])}")
    print(f"  Contribution: {paper.main_contribution}")

# Synthesize research from all sources
summary = client.synthesize("mRNA vaccine technology")
print(summary.executive_summary)
for finding in summary.key_findings:
    print(f"- {finding.finding} (confidence: {finding.confidence:.0%})")

Custom Extraction Schemas

Define your own Pydantic models to extract exactly what you need:

from pydantic import BaseModel, Field
from instructor_builtsimple import ResearchClient

class DrugInfo(BaseModel):
    """Custom schema for drug information extraction."""
    drug_names: list[str] = Field(description="Names of drugs mentioned")
    mechanisms: list[str] = Field(description="Mechanisms of action")
    conditions: list[str] = Field(description="Target medical conditions")
    side_effects: list[str] = Field(default_factory=list)

client = ResearchClient()

# Extract custom structured data
drug_data = client.extract(
    query="Parkinson's disease treatments",
    response_model=DrugInfo,
    sources=["pubmed"],
    limit=10,
)

print(f"Drugs: {drug_data.drug_names}")
print(f"Mechanisms: {drug_data.mechanisms}")

Research Synthesis

Combine multiple sources into comprehensive research summaries:

from instructor_builtsimple import ResearchClient

client = ResearchClient()

# Synthesize from all sources
summary = client.synthesize(
    query="quantum machine learning",
    limit=5,
    sources=["pubmed", "arxiv", "wikipedia"]
)

print(f"Executive Summary: {summary.executive_summary}")
print(f"\nKey Findings:")
for finding in summary.key_findings:
    print(f"  • {finding.finding}")
    print(f"    Confidence: {finding.confidence:.0%}")
    print(f"    Sources: {[s.identifier for s in finding.sources]}")

print(f"\nKnowledge Gaps: {summary.knowledge_gaps}")
print(f"Applications: {summary.practical_applications}")

Topic Analysis

Get deep analysis of research topics:

analysis = client.analyze("neural network interpretability")

print(f"Definition: {analysis.definition}")
print(f"Current State: {analysis.current_state}")
print(f"\nOpen Questions:")
for q in analysis.open_questions:
    print(f"  • {q}")
print(f"\nFuture Directions: {analysis.future_directions}")

Comparison Analysis

Compare technologies, methods, or concepts:

comparison = client.compare(
    items=["BERT", "GPT-4", "T5"],
    context_query="language model performance"
)

print(f"Similarities: {comparison.similarities}")
print(f"Differences: {comparison.differences}")
for item, strengths in comparison.strengths.items():
    print(f"{item} strengths: {strengths}")

Built-in Models

The package includes pre-built Pydantic models for common extraction patterns:

Model Description
PubMedArticle Structured PubMed article with summary, findings, methodology
ArxivPaper ArXiv paper with authors, contribution, categories
WikipediaArticle Wikipedia article with summary, key facts, related topics
ResearchSummary Multi-source synthesis with key findings and citations
TopicAnalysis Deep topic analysis with history, current state, future directions
ComparisonAnalysis Structured comparison of multiple items
Citation Citation reference with source, identifier, URL
KeyFinding Research finding with confidence and supporting citations

API Reference

ResearchClient

The main entry point for all operations:

from instructor_builtsimple import ResearchClient

client = ResearchClient(
    openai_client=None,      # Optional: provide your own OpenAI client
    api_config=None,         # Optional: custom API endpoints
    model="gpt-4o-mini",     # Model for extraction
)

# Source-specific searches
articles = client.pubmed(query, limit=5, response_model=None)
papers = client.arxiv(query, limit=5, response_model=None)
wiki = client.wikipedia(query, limit=5, category=None, response_model=None)

# Multi-source operations
summary = client.synthesize(query, limit=5, sources=None)
analysis = client.analyze(topic, limit=10, sources=None)
comparison = client.compare(items, context_query=None, limit=5)

# Custom extraction
result = client.extract(query, response_model, sources=None, limit=5)

Low-level API Access

For raw API access without LLM extraction:

from instructor_builtsimple.api import BuiltSimpleAPI

api = BuiltSimpleAPI()

# Raw API calls
pubmed_data = api.search_pubmed("cancer treatment", limit=10)
arxiv_data = api.search_arxiv("machine learning", limit=10)
wiki_data = api.search_wikipedia("artificial intelligence", limit=10)

# Search all sources
all_data = api.search_all("CRISPR", limit=5, sources=["pubmed", "arxiv"])

Configuration

Custom API Endpoints

from instructor_builtsimple.api import APIConfig
from instructor_builtsimple import ResearchClient

config = APIConfig(
    pubmed_url="https://pubmed.built-simple.ai",
    arxiv_url="https://arxiv.built-simple.ai",
    wikipedia_url="https://wikipedia.built-simple.ai",
    timeout=30.0,
)

client = ResearchClient(api_config=config)

Using Different Models

# Use GPT-4 for better extraction quality
client = ResearchClient(model="gpt-4o")

# Use a specific OpenAI client
from openai import OpenAI
custom_client = OpenAI(api_key="...", base_url="...")
client = ResearchClient(openai_client=custom_client)

Examples

See the examples/ directory for complete working examples:

  • basic_extraction.py - Simple extraction from each source
  • custom_extraction.py - Define custom Pydantic models
  • research_synthesis.py - Multi-source synthesis and analysis

Requirements

  • Python 3.9+
  • OpenAI API key (set OPENAI_API_KEY environment variable)

Built-Simple Research APIs

This package uses the free Built-Simple research APIs:

  • PubMed: Biomedical and life sciences literature
  • ArXiv: Physics, mathematics, computer science preprints
  • Wikipedia: General knowledge encyclopedia

No API keys required for the research APIs - just your OpenAI key for the LLM extraction.

License

MIT License - see LICENSE for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instructor_builtsimple-0.1.0.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

instructor_builtsimple-0.1.0-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file instructor_builtsimple-0.1.0.tar.gz.

File metadata

  • Download URL: instructor_builtsimple-0.1.0.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for instructor_builtsimple-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bf393ae29ef30847614b9b80a592d99a06e992b2b09e46aaa7e11596063656be
MD5 b4b03f4af5ff22c82b0b31c25b090f36
BLAKE2b-256 df3258dc37991b661506937545f6b7f34c3a65cc4f2743f31973dd0b49e22a0d

See more details on using hashes here.

File details

Details for the file instructor_builtsimple-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for instructor_builtsimple-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 37d4712f878e8ffe3405f8ab58f97b39bc33e58855d3ddb727bdd97fb5e67837
MD5 b45ed65cc9d5bf6f632fdc5e845391e2
BLAKE2b-256 767187cf2803e35e68118548978c750ec8bfc87c4e978cd77a8eb27e4b0059e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page