LangChain integration for Built-Simple research APIs (PubMed & ArXiv)
Project description
langchain-builtsimple
LangChain integration for Built-Simple research APIs, providing easy access to PubMed and ArXiv scientific literature.
Features
- PubMed Retriever & Tool - Search 4.5M+ peer-reviewed biomedical articles
- ArXiv Retriever & Tool - Search 2.7M+ preprints in physics, math, CS, and ML
- Combined Retriever - Search both sources simultaneously
- RAG-ready - Documents include full metadata for citations
- Agent-compatible - Tools work with LangChain agents out of the box
What Data is Included
PubMed Documents
- page_content: Title + abstract (default) OR full article text (with
include_full_text=True) - metadata:
pmid- PubMed IDtitle- Article titlejournal- Journal namepub_year- Publication yeardoi- DOI identifierurl- Link to PubMedhas_full_text- Boolean indicating if full text was fetchedfull_text_length- Character count when available
🔥 FULL TEXT AVAILABLE! Unlike most research APIs, Built-Simple provides complete article text:
# Get full articles (15K-70K chars each!)
retriever = BuiltSimplePubMedRetriever(limit=5, include_full_text=True)
docs = retriever.invoke("cancer immunotherapy")
for doc in docs:
print(f"Full text: {len(doc.page_content)} chars") # ~15,000-70,000!
ArXiv Documents
- page_content: Title + authors + abstract
- metadata:
arxiv_id- ArXiv ID (e.g., "2301.12345")title- Paper titleauthors- Author listyear- Publication yearurl- ArXiv page linkpdf_url- Direct PDF link
⚠️ Abstracts only - Full PDFs are not downloaded. Use pdf_url to fetch if needed.
Installation
pip install langchain-builtsimple
For development with examples:
pip install langchain-builtsimple[dev]
Quick Start
Basic Retrieval
from langchain_builtsimple import BuiltSimplePubMedRetriever, BuiltSimpleArxivRetriever
# Search PubMed
pubmed = BuiltSimplePubMedRetriever(limit=5)
docs = pubmed.invoke("CRISPR gene therapy")
for doc in docs:
print(f"Title: {doc.metadata['title']}")
print(f"Journal: {doc.metadata['journal']}")
print(f"URL: {doc.metadata['url']}\n")
# Search ArXiv
arxiv = BuiltSimpleArxivRetriever(limit=5)
docs = arxiv.invoke("transformer neural networks")
for doc in docs:
print(f"Title: {doc.metadata['title']}")
print(f"Authors: {doc.metadata['authors']}")
print(f"ArXiv ID: {doc.metadata['arxiv_id']}\n")
RAG Chain with ChatOpenAI
from langchain_builtsimple import BuiltSimplePubMedRetriever
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# Create retriever
retriever = BuiltSimplePubMedRetriever(limit=5)
# Format documents for context
def format_docs(docs):
return "\n\n".join(
f"[{i+1}] {doc.metadata['title']} ({doc.metadata.get('pub_year', 'N/A')})\n{doc.page_content}"
for i, doc in enumerate(docs)
)
# Create RAG prompt
prompt = ChatPromptTemplate.from_template("""
Answer the question based on the following research papers.
Cite papers by number [1], [2], etc.
Papers:
{context}
Question: {question}
Answer:""")
# Build chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Ask a question
answer = chain.invoke("What are the latest developments in CAR-T cell therapy?")
print(answer)
Agent with Research Tools
from langchain_builtsimple import BuiltSimplePubMedTool, BuiltSimpleArxivTool
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate
# Create tools
tools = [
BuiltSimplePubMedTool(), # For biomedical research
BuiltSimpleArxivTool(), # For CS/ML/physics papers
]
# Create agent
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", """You are a research assistant with access to scientific databases.
Use pubmed_search for medical/biological topics.
Use arxiv_search for AI/ML/physics/math topics.
Always cite your sources."""),
("human", "{input}"),
("placeholder", "{agent_scratchpad}"),
])
agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
# Run agent
response = executor.invoke({
"input": "Find recent papers on using transformers for drug discovery"
})
print(response["output"])
API Reference
Retrievers
All retrievers inherit from langchain_core.retrievers.BaseRetriever and return List[Document].
BuiltSimplePubMedRetriever
BuiltSimplePubMedRetriever(
base_url: str = "https://pubmed.built-simple.ai",
limit: int = 10,
timeout: float = 30.0
)
Document Metadata:
source: "pubmed"pmid: PubMed IDtitle: Paper titlejournal: Journal namepub_year: Publication yeardoi: DOI (if available)url: Link to PubMed page
BuiltSimpleArxivRetriever
BuiltSimpleArxivRetriever(
base_url: str = "https://arxiv.built-simple.ai",
limit: int = 10,
timeout: float = 30.0
)
Document Metadata:
source: "arxiv"arxiv_id: ArXiv identifiertitle: Paper titleauthors: List of author namesyear: Publication yearurl: Link to ArXiv page
BuiltSimpleResearchRetriever
Searches both PubMed and ArXiv, interleaving results.
BuiltSimpleResearchRetriever(
pubmed_url: str = "https://pubmed.built-simple.ai",
arxiv_url: str = "https://arxiv.built-simple.ai",
limit_per_source: int = 5,
timeout: float = 30.0
)
Tools
All tools inherit from langchain_core.tools.BaseTool and can be used with LangChain agents.
BuiltSimplePubMedTool
- Name:
pubmed_search - Description: Search PubMed for peer-reviewed biomedical literature
- Input:
query(str),limit(int, default=5)
BuiltSimpleArxivTool
- Name:
arxiv_search - Description: Search ArXiv for preprints in physics, math, CS, ML
- Input:
query(str),limit(int, default=5)
BuiltSimpleResearchTool
- Name:
research_search - Description: Search both PubMed and ArXiv simultaneously
- Input:
query(str),limit(int, default=5)
Examples
See the examples/ directory for complete working examples:
basic_retrieval.py- Simple retriever usagerag_chain.py- RAG chain with ChatOpenAIagent_with_tools.py- Agent with research tools
License
MIT License - see LICENSE for details.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_builtsimple-0.1.0.tar.gz.
File metadata
- Download URL: langchain_builtsimple-0.1.0.tar.gz
- Upload date:
- Size: 10.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8b7145a9ba86fe73a8b7f90fd8b9fe5d7fa590254fc61a399afd1e6a065517e
|
|
| MD5 |
da28ebccdae11c0ab7ff0e73a0ea6a81
|
|
| BLAKE2b-256 |
4c7d910908e5833082a23d418a0b6010f600c43bd5ba79486f0946d1faf1d0fc
|
File details
Details for the file langchain_builtsimple-0.1.0-py3-none-any.whl.
File metadata
- Download URL: langchain_builtsimple-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5580901a584b8912c4ffee509e0b5496f24dc7f8a28024f128a33ef46225c951
|
|
| MD5 |
6edd01d8e4733f890c9c4527055f3a1c
|
|
| BLAKE2b-256 |
6ca339a8addc02a6b9879dc31312ef7c59bf3b90c7235e7f706dd4b3ed271538
|