Haystack integration for Built-Simple research APIs (PubMed, ArXiv)
Project description
haystack-builtsimple
Haystack integration for Built-Simple research APIs. Search PubMed and ArXiv scientific literature directly from your Haystack pipelines.
Features
- 🔬 PubMed Retriever - Hybrid search over 35M+ biomedical articles
- 📄 ArXiv Retriever - Search preprints in physics, math, CS, and more
- 🔗 Combined Retriever - Search both sources simultaneously
- 📖 Full Text Support - Optionally fetch full article text (PubMed)
- ⚡ Pipeline Ready - Drop-in components for Haystack 2.x pipelines
Installation
pip install haystack-builtsimple
Or with development dependencies:
pip install haystack-builtsimple[dev]
Quick Start
Basic Usage
from haystack_builtsimple import BuiltSimplePubMedRetriever, BuiltSimpleArxivRetriever
# Search PubMed
pubmed = BuiltSimplePubMedRetriever(top_k=5)
results = pubmed.run(query="CRISPR gene therapy clinical trials")
for doc in results["documents"]:
print(f"[PMID {doc.meta['pmid']}] {doc.meta['title']}")
# Search ArXiv
arxiv = BuiltSimpleArxivRetriever(top_k=5)
results = arxiv.run(query="large language models reasoning")
for doc in results["documents"]:
print(f"[{doc.meta['arxiv_id']}] {doc.meta['title']}")
In a Haystack Pipeline
from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack_builtsimple import BuiltSimplePubMedRetriever
# Create a RAG pipeline
pipeline = Pipeline()
pipeline.add_component("retriever", BuiltSimplePubMedRetriever(top_k=5))
pipeline.add_component("prompt", PromptBuilder(template="""
Based on these research papers:
{% for doc in documents %}
- {{ doc.meta.title }} (PMID: {{ doc.meta.pmid }})
{{ doc.content[:500] }}
{% endfor %}
Answer: {{ query }}
"""))
pipeline.add_component("llm", OpenAIGenerator())
pipeline.connect("retriever.documents", "prompt.documents")
pipeline.connect("prompt", "llm")
# Run
result = pipeline.run({
"retriever": {"query": "mRNA vaccine efficacy"},
"prompt": {"query": "What factors affect mRNA vaccine efficacy?"}
})
print(result["llm"]["replies"][0])
Combined Search
Search both PubMed and ArXiv at once:
from haystack_builtsimple import BuiltSimpleCombinedRetriever
retriever = BuiltSimpleCombinedRetriever(
top_k=10,
merge_strategy="score", # or "interleave", "pubmed_first", "arxiv_first"
)
results = retriever.run(query="machine learning drug discovery")
for doc in results["documents"]:
source = doc.meta["source"] # "pubmed" or "arxiv"
print(f"[{source}] {doc.meta['title']}")
Components
BuiltSimplePubMedRetriever
Retrieves documents from PubMed using hybrid search (semantic + keyword).
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
api_base |
str | https://pubmed.built-simple.ai |
API base URL |
top_k |
int | 10 | Number of documents to retrieve |
fetch_full_text |
bool | False | Fetch full article text |
timeout |
float | 30.0 | Request timeout in seconds |
Outputs:
documents: List of HaystackDocumentobjects
Document Metadata:
pmid- PubMed IDtitle- Article titleauthors- Comma-separated author namesjournal- Journal nameyear- Publication yeardoi- DOI if availablesource- Always "pubmed"
BuiltSimpleArxivRetriever
Retrieves documents from ArXiv.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
api_base |
str | https://arxiv.built-simple.ai |
API base URL |
top_k |
int | 10 | Number of documents to retrieve |
timeout |
float | 30.0 | Request timeout in seconds |
Outputs:
documents: List of HaystackDocumentobjects
Document Metadata:
arxiv_id- ArXiv paper IDtitle- Paper titleauthors- Comma-separated author namescategories- ArXiv categoriespublished- Publication dateurl- Link to ArXiv abstract pagesource- Always "arxiv"
BuiltSimpleCombinedRetriever
Searches both PubMed and ArXiv, merging results.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
top_k |
int | 10 | Total documents to return |
pubmed_weight |
float | 1.0 | Score weight for PubMed results |
arxiv_weight |
float | 1.0 | Score weight for ArXiv results |
merge_strategy |
str | "score" | How to merge: "score", "interleave", "pubmed_first", "arxiv_first" |
fetch_full_text |
bool | False | Fetch full text for PubMed |
timeout |
float | 30.0 | Request timeout |
Advanced Usage
Full Text Retrieval
For PubMed articles, you can fetch full text when available:
pubmed = BuiltSimplePubMedRetriever(
top_k=3,
fetch_full_text=True # Slower, but includes full text
)
Custom Merge Strategies
When using the combined retriever:
# Prioritize PubMed results
retriever = BuiltSimpleCombinedRetriever(
merge_strategy="pubmed_first"
)
# Weight ArXiv higher
retriever = BuiltSimpleCombinedRetriever(
merge_strategy="score",
pubmed_weight=0.8,
arxiv_weight=1.2
)
Using with DocumentJoiner
For more control, use separate retrievers with Haystack's DocumentJoiner:
from haystack import Pipeline
from haystack.components.joiners import DocumentJoiner
from haystack_builtsimple import BuiltSimplePubMedRetriever, BuiltSimpleArxivRetriever
pipeline = Pipeline()
pipeline.add_component("pubmed", BuiltSimplePubMedRetriever(top_k=5))
pipeline.add_component("arxiv", BuiltSimpleArxivRetriever(top_k=5))
pipeline.add_component("joiner", DocumentJoiner())
pipeline.connect("pubmed.documents", "joiner.documents")
pipeline.connect("arxiv.documents", "joiner.documents")
result = pipeline.run({
"pubmed": {"query": "protein folding"},
"arxiv": {"query": "protein folding"},
})
Examples
See the examples/ directory for complete working examples:
basic_retrieval.py- Simple standalone usagerag_pipeline.py- Full RAG pipeline with LLMcombined_search.py- Multi-source search patterns
API Reference
Built-Simple APIs
This package uses Built-Simple's hosted research APIs:
-
PubMed API:
https://pubmed.built-simple.aiPOST /hybrid-search- Hybrid semantic + keyword searchGET /article/{pmid}/full_text- Fetch full text
-
ArXiv API:
https://arxiv.built-simple.aiGET /api/search?q=...&limit=N- Search papers
No API key required. Rate limits apply for heavy usage.
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
License
MIT License - see LICENSE for details.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file haystack_builtsimple-0.1.0.tar.gz.
File metadata
- Download URL: haystack_builtsimple-0.1.0.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
47c0b7d8e8b3ba96269292d90199b9a464c91b742c2e419eb38b7acc125390b6
|
|
| MD5 |
65bac96e66c23250d33baa21f28533de
|
|
| BLAKE2b-256 |
349bb9998b973ed56f788483326e2488d69cf208902c84721bd6aaddf5cb18e8
|
File details
Details for the file haystack_builtsimple-0.1.0-py3-none-any.whl.
File metadata
- Download URL: haystack_builtsimple-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36fd29d7ce7377b47661e0adfa62eb5ee1f672902dac87b7378a96ef1789e269
|
|
| MD5 |
9ed0e46016398d2612d18b200a5ec544
|
|
| BLAKE2b-256 |
24bb2b276681813105cd233e332b5a4b3a90771fa8799bfc3a3c3bd5138fad44
|