DSPy retriever modules for Built-Simple research APIs (PubMed, ArXiv, Wikipedia)
Project description
dspy-builtsimple
DSPy retriever modules for Built-Simple research APIs. Search millions of scientific papers from PubMed, ArXiv, and Wikipedia using GPU-accelerated semantic search.
Features
- 🔬 PubMed: 4.5M+ biomedical articles with hybrid semantic + keyword search
- 📚 ArXiv: 2.7M+ preprints in physics, math, CS, and ML
- 📖 Wikipedia: 4.8M+ articles with GPU-accelerated embeddings
- ⚡ Fast: Sub-second search powered by FAISS on GPU
- 🔌 Native DSPy: Drop-in retriever modules for RAG pipelines
Installation
pip install dspy-builtsimple
Quick Start
Basic Usage
import dspy
from dspy_builtsimple import PubMedRM, ArxivRM, WikipediaRM
# Configure your LM
lm = dspy.LM("openai/gpt-4o-mini")
dspy.settings.configure(lm=lm)
# Use PubMed retriever
rm = PubMedRM(k=5)
results = rm("CRISPR gene editing mechanisms")
for passage in results.passages:
print(f"[{passage.metadata['pmid']}] {passage.metadata['title']}")
print(passage.long_text[:200])
print()
Configure as Default RM
import dspy
from dspy_builtsimple import ArxivRM
# Set as the default retriever
rm = ArxivRM(k=5)
dspy.settings.configure(rm=rm)
# Now dspy.Retrieve will use ArXiv
retrieve = dspy.Retrieve(k=3)
results = retrieve("transformer attention mechanism")
for passage in results.passages:
print(passage.long_text)
Multi-Source Search
from dspy_builtsimple import ResearchRM
# Search across all sources
rm = ResearchRM(k=9, sources=["pubmed", "arxiv", "wikipedia"])
results = rm("machine learning in drug discovery")
# Results are interleaved from each source
for passage in results.passages:
source = passage.metadata["source"]
title = passage.metadata["title"]
print(f"[{source}] {title}")
Building a RAG Pipeline
Here's a complete example of a research Q&A system:
import dspy
from dspy_builtsimple import PubMedRM
# Configure DSPy
lm = dspy.LM("openai/gpt-4o-mini")
rm = PubMedRM(k=5)
dspy.settings.configure(lm=lm, rm=rm)
# Define the RAG signature
class ResearchQA(dspy.Signature):
"""Answer research questions using scientific literature."""
context = dspy.InputField(desc="Retrieved scientific passages")
question = dspy.InputField(desc="Research question to answer")
answer = dspy.OutputField(desc="Evidence-based answer with citations")
# Build the RAG module
class ResearchRAG(dspy.Module):
def __init__(self, num_passages=5):
super().__init__()
self.retrieve = dspy.Retrieve(k=num_passages)
self.generate = dspy.ChainOfThought(ResearchQA)
def forward(self, question):
context = self.retrieve(question).passages
response = self.generate(context=context, question=question)
return dspy.Prediction(
context=context,
answer=response.answer
)
# Use it
rag = ResearchRAG(num_passages=5)
result = rag("What are the latest advances in mRNA vaccine technology?")
print(result.answer)
Retriever Reference
PubMedRM
Search PubMed biomedical literature.
from dspy_builtsimple import PubMedRM
rm = PubMedRM(
k=5, # Number of passages to retrieve
base_url="https://pubmed.built-simple.ai",
timeout=30.0, # Request timeout in seconds
include_full_text=False, # Fetch full articles (slower)
)
Metadata fields:
pmid: PubMed IDtitle: Article titlejournal: Journal namepub_year: Publication yeardoi: Digital Object Identifierurl: Link to PubMedsimilarity_score: Semantic similarity score
ArxivRM
Search ArXiv preprints.
from dspy_builtsimple import ArxivRM
rm = ArxivRM(
k=5,
base_url="https://arxiv.built-simple.ai",
timeout=30.0,
)
Metadata fields:
arxiv_id: ArXiv paper ID (e.g., "2301.12345")title: Paper titleauthors: Author namesyear: Publication yearurl: Link to abstractpdf_url: Direct PDF linksimilarity_score: Semantic similarity score
WikipediaRM
Search Wikipedia articles.
from dspy_builtsimple import WikipediaRM
rm = WikipediaRM(
k=5,
base_url="https://wikipedia.built-simple.ai",
timeout=30.0,
)
Metadata fields:
id: Internal article IDtitle: Article titlecategory: Article categoryurl: Wikipedia linksimilarity_score: Semantic similarity score
ResearchRM
Search multiple sources simultaneously.
from dspy_builtsimple import ResearchRM
rm = ResearchRM(
k=9, # Total passages to retrieve
sources=["pubmed", "arxiv", "wikipedia"], # Sources to search
timeout=30.0,
)
Advanced Usage
Full-Text Retrieval (PubMed)
For deeper context, fetch full article text instead of abstracts:
from dspy_builtsimple import PubMedRM
rm = PubMedRM(k=3, include_full_text=True)
results = rm("COVID-19 vaccine efficacy trials")
# Full article text is now in the passages
for passage in results.passages:
print(f"Content length: {len(passage.long_text)} chars")
print(f"Has full text: {passage.metadata.get('has_full_text', False)}")
Batch Queries
All retrievers support batch queries:
from dspy_builtsimple import ArxivRM
rm = ArxivRM(k=3)
queries = [
"large language models",
"diffusion models",
"reinforcement learning",
]
results = rm(queries) # Returns combined results
Custom Timeouts
For large result sets or slow connections:
from dspy_builtsimple import PubMedRM
rm = PubMedRM(k=50, timeout=60.0) # 60 second timeout
API Information
These retrievers use the Built-Simple research APIs:
| API | Endpoint | Documents | Features |
|---|---|---|---|
| PubMed | pubmed.built-simple.ai | 4.5M+ | Hybrid search, full text |
| ArXiv | arxiv.built-simple.ai | 2.7M+ | GPU semantic search |
| Wikipedia | wikipedia.built-simple.ai | 4.8M+ | Hybrid + Elasticsearch |
All APIs are free to use with reasonable rate limits.
Requirements
- Python 3.9+
- dspy >= 2.4.0
- httpx >= 0.25.0
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE for details.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dspy_builtsimple-0.1.0.tar.gz.
File metadata
- Download URL: dspy_builtsimple-0.1.0.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de80aea98d49b8819d7d573f4075716223d443ee69a2c5b9ec28b4b4ee7da106
|
|
| MD5 |
3f165982f7134e6cc2409c600aa4d9ae
|
|
| BLAKE2b-256 |
4fd9559afd60d0c545c0882956db03dc3b3483605b200508762856b688757152
|
File details
Details for the file dspy_builtsimple-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dspy_builtsimple-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85fdaabd6e4d50b61efcf626e73f05f95a3e20ec71b854defe4007ccdcb7d1e4
|
|
| MD5 |
b7aa198f49d717747842cb5d8cea29b6
|
|
| BLAKE2b-256 |
1ed6f08834ce9876d543ffe647144c63e5191be82bea49f649e2d1e25c22b458
|