Instructor integration for Built-Simple research APIs - structured extraction from PubMed, ArXiv, and Wikipedia

These details have not been verified by PyPI

Project links

Project description

instructor-builtsimple

Structured research extraction from PubMed, ArXiv, and Wikipedia using Instructor and Pydantic.

Extract structured, validated data from research APIs using LLMs. Define your schema with Pydantic, and let Instructor handle the extraction.

Features

🔬 Multi-source search: Query PubMed, ArXiv, and Wikipedia simultaneously
📊 Structured output: Extract data into validated Pydantic models
🧠 Research synthesis: Combine sources into comprehensive summaries
📚 Topic analysis: Deep-dive into research topics with citations
⚖️ Comparisons: Compare technologies, methods, or concepts
🎯 Custom schemas: Define any Pydantic model for extraction

Installation

pip install instructor-builtsimple

For Anthropic Claude support:

pip install instructor-builtsimple[anthropic]

Quick Start

from instructor_builtsimple import ResearchClient

# Initialize client (uses OPENAI_API_KEY env var)
client = ResearchClient()

# Search PubMed and extract structured articles
articles = client.pubmed("CRISPR gene therapy", limit=5)
for article in articles:
    print(f"{article.title}")
    print(f"  Summary: {article.abstract_summary}")
    print(f"  Key findings: {article.key_findings}")

# Search ArXiv for ML papers
papers = client.arxiv("transformer attention mechanisms", limit=5)
for paper in papers:
    print(f"{paper.title} by {', '.join(paper.authors[:3])}")
    print(f"  Contribution: {paper.main_contribution}")

# Synthesize research from all sources
summary = client.synthesize("mRNA vaccine technology")
print(summary.executive_summary)
for finding in summary.key_findings:
    print(f"- {finding.finding} (confidence: {finding.confidence:.0%})")

Custom Extraction Schemas

Define your own Pydantic models to extract exactly what you need:

from pydantic import BaseModel, Field
from instructor_builtsimple import ResearchClient

class DrugInfo(BaseModel):
    """Custom schema for drug information extraction."""
    drug_names: list[str] = Field(description="Names of drugs mentioned")
    mechanisms: list[str] = Field(description="Mechanisms of action")
    conditions: list[str] = Field(description="Target medical conditions")
    side_effects: list[str] = Field(default_factory=list)

client = ResearchClient()

# Extract custom structured data
drug_data = client.extract(
    query="Parkinson's disease treatments",
    response_model=DrugInfo,
    sources=["pubmed"],
    limit=10,
)

print(f"Drugs: {drug_data.drug_names}")
print(f"Mechanisms: {drug_data.mechanisms}")

Research Synthesis

Combine multiple sources into comprehensive research summaries:

from instructor_builtsimple import ResearchClient

client = ResearchClient()

# Synthesize from all sources
summary = client.synthesize(
    query="quantum machine learning",
    limit=5,
    sources=["pubmed", "arxiv", "wikipedia"]
)

print(f"Executive Summary: {summary.executive_summary}")
print(f"\nKey Findings:")
for finding in summary.key_findings:
    print(f"  • {finding.finding}")
    print(f"    Confidence: {finding.confidence:.0%}")
    print(f"    Sources: {[s.identifier for s in finding.sources]}")

print(f"\nKnowledge Gaps: {summary.knowledge_gaps}")
print(f"Applications: {summary.practical_applications}")

Topic Analysis

Get deep analysis of research topics:

analysis = client.analyze("neural network interpretability")

print(f"Definition: {analysis.definition}")
print(f"Current State: {analysis.current_state}")
print(f"\nOpen Questions:")
for q in analysis.open_questions:
    print(f"  • {q}")
print(f"\nFuture Directions: {analysis.future_directions}")

Comparison Analysis

Compare technologies, methods, or concepts:

comparison = client.compare(
    items=["BERT", "GPT-4", "T5"],
    context_query="language model performance"
)

print(f"Similarities: {comparison.similarities}")
print(f"Differences: {comparison.differences}")
for item, strengths in comparison.strengths.items():
    print(f"{item} strengths: {strengths}")

Built-in Models

The package includes pre-built Pydantic models for common extraction patterns:

Model	Description
`PubMedArticle`	Structured PubMed article with summary, findings, methodology
`ArxivPaper`	ArXiv paper with authors, contribution, categories
`WikipediaArticle`	Wikipedia article with summary, key facts, related topics
`ResearchSummary`	Multi-source synthesis with key findings and citations
`TopicAnalysis`	Deep topic analysis with history, current state, future directions
`ComparisonAnalysis`	Structured comparison of multiple items
`Citation`	Citation reference with source, identifier, URL
`KeyFinding`	Research finding with confidence and supporting citations

API Reference

ResearchClient

The main entry point for all operations:

from instructor_builtsimple import ResearchClient

client = ResearchClient(
    openai_client=None,      # Optional: provide your own OpenAI client
    api_config=None,         # Optional: custom API endpoints
    model="gpt-4o-mini",     # Model for extraction
)

# Source-specific searches
articles = client.pubmed(query, limit=5, response_model=None)
papers = client.arxiv(query, limit=5, response_model=None)
wiki = client.wikipedia(query, limit=5, category=None, response_model=None)

# Multi-source operations
summary = client.synthesize(query, limit=5, sources=None)
analysis = client.analyze(topic, limit=10, sources=None)
comparison = client.compare(items, context_query=None, limit=5)

# Custom extraction
result = client.extract(query, response_model, sources=None, limit=5)

Low-level API Access

For raw API access without LLM extraction:

from instructor_builtsimple.api import BuiltSimpleAPI

api = BuiltSimpleAPI()

# Raw API calls
pubmed_data = api.search_pubmed("cancer treatment", limit=10)
arxiv_data = api.search_arxiv("machine learning", limit=10)
wiki_data = api.search_wikipedia("artificial intelligence", limit=10)

# Search all sources
all_data = api.search_all("CRISPR", limit=5, sources=["pubmed", "arxiv"])

Configuration

Custom API Endpoints

from instructor_builtsimple.api import APIConfig
from instructor_builtsimple import ResearchClient

config = APIConfig(
    pubmed_url="https://pubmed.built-simple.ai",
    arxiv_url="https://arxiv.built-simple.ai",
    wikipedia_url="https://wikipedia.built-simple.ai",
    timeout=30.0,
)

client = ResearchClient(api_config=config)

Using Different Models

# Use GPT-4 for better extraction quality
client = ResearchClient(model="gpt-4o")

# Use a specific OpenAI client
from openai import OpenAI
custom_client = OpenAI(api_key="...", base_url="...")
client = ResearchClient(openai_client=custom_client)

Examples

See the examples/ directory for complete working examples:

basic_extraction.py - Simple extraction from each source
custom_extraction.py - Define custom Pydantic models
research_synthesis.py - Multi-source synthesis and analysis

Requirements

Python 3.9+
OpenAI API key (set OPENAI_API_KEY environment variable)

Built-Simple Research APIs

This package uses the free Built-Simple research APIs:

PubMed: Biomedical and life sciences literature
ArXiv: Physics, mathematics, computer science preprints
Wikipedia: General knowledge encyclopedia

No API keys required for the research APIs - just your OpenAI key for the LLM extraction.

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Feb 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instructor_builtsimple-0.1.0.tar.gz (14.8 kB view details)

Uploaded Feb 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

instructor_builtsimple-0.1.0-py3-none-any.whl (16.1 kB view details)

Uploaded Feb 1, 2026 Python 3

File details

Details for the file instructor_builtsimple-0.1.0.tar.gz.

File metadata

Download URL: instructor_builtsimple-0.1.0.tar.gz
Upload date: Feb 1, 2026
Size: 14.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for instructor_builtsimple-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`bf393ae29ef30847614b9b80a592d99a06e992b2b09e46aaa7e11596063656be`
MD5	`b4b03f4af5ff22c82b0b31c25b090f36`
BLAKE2b-256	`df3258dc37991b661506937545f6b7f34c3a65cc4f2743f31973dd0b49e22a0d`

See more details on using hashes here.

File details

Details for the file instructor_builtsimple-0.1.0-py3-none-any.whl.

File metadata

Download URL: instructor_builtsimple-0.1.0-py3-none-any.whl
Upload date: Feb 1, 2026
Size: 16.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for instructor_builtsimple-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`37d4712f878e8ffe3405f8ab58f97b39bc33e58855d3ddb727bdd97fb5e67837`
MD5	`b45ed65cc9d5bf6f632fdc5e845391e2`
BLAKE2b-256	`767187cf2803e35e68118548978c750ec8bfc87c4e978cd77a8eb27e4b0059e2`

See more details on using hashes here.

instructor-builtsimple 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

instructor-builtsimple

Features

Installation

Quick Start

Custom Extraction Schemas

Research Synthesis

Topic Analysis

Comparison Analysis

Built-in Models

API Reference

ResearchClient

Low-level API Access

Configuration

Custom API Endpoints

Using Different Models

Examples

Requirements

Built-Simple Research APIs

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes