Skip to main content

LangChain integration for Built-Simple research APIs (PubMed & ArXiv)

Project description

langchain-builtsimple

LangChain integration for Built-Simple research APIs, providing easy access to PubMed and ArXiv scientific literature.

PyPI version License: MIT

Features

  • PubMed Retriever & Tool - Search 4.5M+ peer-reviewed biomedical articles
  • ArXiv Retriever & Tool - Search 2.7M+ preprints in physics, math, CS, and ML
  • Combined Retriever - Search both sources simultaneously
  • RAG-ready - Documents include full metadata for citations
  • Agent-compatible - Tools work with LangChain agents out of the box

What Data is Included

PubMed Documents

  • page_content: Title + abstract (default) OR full article text (with include_full_text=True)
  • metadata:
    • pmid - PubMed ID
    • title - Article title
    • journal - Journal name
    • pub_year - Publication year
    • doi - DOI identifier
    • url - Link to PubMed
    • has_full_text - Boolean indicating if full text was fetched
    • full_text_length - Character count when available

🔥 FULL TEXT AVAILABLE! Unlike most research APIs, Built-Simple provides complete article text:

# Get full articles (15K-70K chars each!)
retriever = BuiltSimplePubMedRetriever(limit=5, include_full_text=True)
docs = retriever.invoke("cancer immunotherapy")

for doc in docs:
    print(f"Full text: {len(doc.page_content)} chars")  # ~15,000-70,000!

ArXiv Documents

  • page_content: Title + authors + abstract
  • metadata:
    • arxiv_id - ArXiv ID (e.g., "2301.12345")
    • title - Paper title
    • authors - Author list
    • year - Publication year
    • url - ArXiv page link
    • pdf_url - Direct PDF link

⚠️ Abstracts only - Full PDFs are not downloaded. Use pdf_url to fetch if needed.

Installation

pip install langchain-builtsimple

For development with examples:

pip install langchain-builtsimple[dev]

Quick Start

Basic Retrieval

from langchain_builtsimple import BuiltSimplePubMedRetriever, BuiltSimpleArxivRetriever

# Search PubMed
pubmed = BuiltSimplePubMedRetriever(limit=5)
docs = pubmed.invoke("CRISPR gene therapy")

for doc in docs:
    print(f"Title: {doc.metadata['title']}")
    print(f"Journal: {doc.metadata['journal']}")
    print(f"URL: {doc.metadata['url']}\n")

# Search ArXiv
arxiv = BuiltSimpleArxivRetriever(limit=5)
docs = arxiv.invoke("transformer neural networks")

for doc in docs:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {doc.metadata['authors']}")
    print(f"ArXiv ID: {doc.metadata['arxiv_id']}\n")

RAG Chain with ChatOpenAI

from langchain_builtsimple import BuiltSimplePubMedRetriever
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Create retriever
retriever = BuiltSimplePubMedRetriever(limit=5)

# Format documents for context
def format_docs(docs):
    return "\n\n".join(
        f"[{i+1}] {doc.metadata['title']} ({doc.metadata.get('pub_year', 'N/A')})\n{doc.page_content}"
        for i, doc in enumerate(docs)
    )

# Create RAG prompt
prompt = ChatPromptTemplate.from_template("""
Answer the question based on the following research papers. 
Cite papers by number [1], [2], etc.

Papers:
{context}

Question: {question}

Answer:""")

# Build chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Ask a question
answer = chain.invoke("What are the latest developments in CAR-T cell therapy?")
print(answer)

Agent with Research Tools

from langchain_builtsimple import BuiltSimplePubMedTool, BuiltSimpleArxivTool
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate

# Create tools
tools = [
    BuiltSimplePubMedTool(),  # For biomedical research
    BuiltSimpleArxivTool(),   # For CS/ML/physics papers
]

# Create agent
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a research assistant with access to scientific databases.
    Use pubmed_search for medical/biological topics.
    Use arxiv_search for AI/ML/physics/math topics.
    Always cite your sources."""),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# Run agent
response = executor.invoke({
    "input": "Find recent papers on using transformers for drug discovery"
})
print(response["output"])

API Reference

Retrievers

All retrievers inherit from langchain_core.retrievers.BaseRetriever and return List[Document].

BuiltSimplePubMedRetriever

BuiltSimplePubMedRetriever(
    base_url: str = "https://pubmed.built-simple.ai",
    limit: int = 10,
    timeout: float = 30.0
)

Document Metadata:

  • source: "pubmed"
  • pmid: PubMed ID
  • title: Paper title
  • journal: Journal name
  • pub_year: Publication year
  • doi: DOI (if available)
  • url: Link to PubMed page

BuiltSimpleArxivRetriever

BuiltSimpleArxivRetriever(
    base_url: str = "https://arxiv.built-simple.ai",
    limit: int = 10,
    timeout: float = 30.0
)

Document Metadata:

  • source: "arxiv"
  • arxiv_id: ArXiv identifier
  • title: Paper title
  • authors: List of author names
  • year: Publication year
  • url: Link to ArXiv page

BuiltSimpleResearchRetriever

Searches both PubMed and ArXiv, interleaving results.

BuiltSimpleResearchRetriever(
    pubmed_url: str = "https://pubmed.built-simple.ai",
    arxiv_url: str = "https://arxiv.built-simple.ai",
    limit_per_source: int = 5,
    timeout: float = 30.0
)

Tools

All tools inherit from langchain_core.tools.BaseTool and can be used with LangChain agents.

BuiltSimplePubMedTool

  • Name: pubmed_search
  • Description: Search PubMed for peer-reviewed biomedical literature
  • Input: query (str), limit (int, default=5)

BuiltSimpleArxivTool

  • Name: arxiv_search
  • Description: Search ArXiv for preprints in physics, math, CS, ML
  • Input: query (str), limit (int, default=5)

BuiltSimpleResearchTool

  • Name: research_search
  • Description: Search both PubMed and ArXiv simultaneously
  • Input: query (str), limit (int, default=5)

Examples

See the examples/ directory for complete working examples:

  • basic_retrieval.py - Simple retriever usage
  • rag_chain.py - RAG chain with ChatOpenAI
  • agent_with_tools.py - Agent with research tools

License

MIT License - see LICENSE for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_builtsimple-0.1.0.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_builtsimple-0.1.0-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file langchain_builtsimple-0.1.0.tar.gz.

File metadata

  • Download URL: langchain_builtsimple-0.1.0.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for langchain_builtsimple-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c8b7145a9ba86fe73a8b7f90fd8b9fe5d7fa590254fc61a399afd1e6a065517e
MD5 da28ebccdae11c0ab7ff0e73a0ea6a81
BLAKE2b-256 4c7d910908e5833082a23d418a0b6010f600c43bd5ba79486f0946d1faf1d0fc

See more details on using hashes here.

File details

Details for the file langchain_builtsimple-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_builtsimple-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5580901a584b8912c4ffee509e0b5496f24dc7f8a28024f128a33ef46225c951
MD5 6edd01d8e4733f890c9c4527055f3a1c
BLAKE2b-256 6ca339a8addc02a6b9879dc31312ef7c59bf3b90c7235e7f706dd4b3ed271538

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page