Skip to main content

Haystack integration for Built-Simple research APIs (PubMed, ArXiv)

Project description

haystack-builtsimple

PyPI version License: MIT

Haystack integration for Built-Simple research APIs. Search PubMed and ArXiv scientific literature directly from your Haystack pipelines.

Features

  • 🔬 PubMed Retriever - Hybrid search over 35M+ biomedical articles
  • 📄 ArXiv Retriever - Search preprints in physics, math, CS, and more
  • 🔗 Combined Retriever - Search both sources simultaneously
  • 📖 Full Text Support - Optionally fetch full article text (PubMed)
  • Pipeline Ready - Drop-in components for Haystack 2.x pipelines

Installation

pip install haystack-builtsimple

Or with development dependencies:

pip install haystack-builtsimple[dev]

Quick Start

Basic Usage

from haystack_builtsimple import BuiltSimplePubMedRetriever, BuiltSimpleArxivRetriever

# Search PubMed
pubmed = BuiltSimplePubMedRetriever(top_k=5)
results = pubmed.run(query="CRISPR gene therapy clinical trials")
for doc in results["documents"]:
    print(f"[PMID {doc.meta['pmid']}] {doc.meta['title']}")

# Search ArXiv
arxiv = BuiltSimpleArxivRetriever(top_k=5)
results = arxiv.run(query="large language models reasoning")
for doc in results["documents"]:
    print(f"[{doc.meta['arxiv_id']}] {doc.meta['title']}")

In a Haystack Pipeline

from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack_builtsimple import BuiltSimplePubMedRetriever

# Create a RAG pipeline
pipeline = Pipeline()
pipeline.add_component("retriever", BuiltSimplePubMedRetriever(top_k=5))
pipeline.add_component("prompt", PromptBuilder(template="""
Based on these research papers:
{% for doc in documents %}
- {{ doc.meta.title }} (PMID: {{ doc.meta.pmid }})
  {{ doc.content[:500] }}
{% endfor %}

Answer: {{ query }}
"""))
pipeline.add_component("llm", OpenAIGenerator())

pipeline.connect("retriever.documents", "prompt.documents")
pipeline.connect("prompt", "llm")

# Run
result = pipeline.run({
    "retriever": {"query": "mRNA vaccine efficacy"},
    "prompt": {"query": "What factors affect mRNA vaccine efficacy?"}
})
print(result["llm"]["replies"][0])

Combined Search

Search both PubMed and ArXiv at once:

from haystack_builtsimple import BuiltSimpleCombinedRetriever

retriever = BuiltSimpleCombinedRetriever(
    top_k=10,
    merge_strategy="score",  # or "interleave", "pubmed_first", "arxiv_first"
)

results = retriever.run(query="machine learning drug discovery")
for doc in results["documents"]:
    source = doc.meta["source"]  # "pubmed" or "arxiv"
    print(f"[{source}] {doc.meta['title']}")

Components

BuiltSimplePubMedRetriever

Retrieves documents from PubMed using hybrid search (semantic + keyword).

Parameters:

Parameter Type Default Description
api_base str https://pubmed.built-simple.ai API base URL
top_k int 10 Number of documents to retrieve
fetch_full_text bool False Fetch full article text
timeout float 30.0 Request timeout in seconds

Outputs:

  • documents: List of Haystack Document objects

Document Metadata:

  • pmid - PubMed ID
  • title - Article title
  • authors - Comma-separated author names
  • journal - Journal name
  • year - Publication year
  • doi - DOI if available
  • source - Always "pubmed"

BuiltSimpleArxivRetriever

Retrieves documents from ArXiv.

Parameters:

Parameter Type Default Description
api_base str https://arxiv.built-simple.ai API base URL
top_k int 10 Number of documents to retrieve
timeout float 30.0 Request timeout in seconds

Outputs:

  • documents: List of Haystack Document objects

Document Metadata:

  • arxiv_id - ArXiv paper ID
  • title - Paper title
  • authors - Comma-separated author names
  • categories - ArXiv categories
  • published - Publication date
  • url - Link to ArXiv abstract page
  • source - Always "arxiv"

BuiltSimpleCombinedRetriever

Searches both PubMed and ArXiv, merging results.

Parameters:

Parameter Type Default Description
top_k int 10 Total documents to return
pubmed_weight float 1.0 Score weight for PubMed results
arxiv_weight float 1.0 Score weight for ArXiv results
merge_strategy str "score" How to merge: "score", "interleave", "pubmed_first", "arxiv_first"
fetch_full_text bool False Fetch full text for PubMed
timeout float 30.0 Request timeout

Advanced Usage

Full Text Retrieval

For PubMed articles, you can fetch full text when available:

pubmed = BuiltSimplePubMedRetriever(
    top_k=3,
    fetch_full_text=True  # Slower, but includes full text
)

Custom Merge Strategies

When using the combined retriever:

# Prioritize PubMed results
retriever = BuiltSimpleCombinedRetriever(
    merge_strategy="pubmed_first"
)

# Weight ArXiv higher
retriever = BuiltSimpleCombinedRetriever(
    merge_strategy="score",
    pubmed_weight=0.8,
    arxiv_weight=1.2
)

Using with DocumentJoiner

For more control, use separate retrievers with Haystack's DocumentJoiner:

from haystack import Pipeline
from haystack.components.joiners import DocumentJoiner
from haystack_builtsimple import BuiltSimplePubMedRetriever, BuiltSimpleArxivRetriever

pipeline = Pipeline()
pipeline.add_component("pubmed", BuiltSimplePubMedRetriever(top_k=5))
pipeline.add_component("arxiv", BuiltSimpleArxivRetriever(top_k=5))
pipeline.add_component("joiner", DocumentJoiner())

pipeline.connect("pubmed.documents", "joiner.documents")
pipeline.connect("arxiv.documents", "joiner.documents")

result = pipeline.run({
    "pubmed": {"query": "protein folding"},
    "arxiv": {"query": "protein folding"},
})

Examples

See the examples/ directory for complete working examples:

  • basic_retrieval.py - Simple standalone usage
  • rag_pipeline.py - Full RAG pipeline with LLM
  • combined_search.py - Multi-source search patterns

API Reference

Built-Simple APIs

This package uses Built-Simple's hosted research APIs:

  • PubMed API: https://pubmed.built-simple.ai

    • POST /hybrid-search - Hybrid semantic + keyword search
    • GET /article/{pmid}/full_text - Fetch full text
  • ArXiv API: https://arxiv.built-simple.ai

    • GET /api/search?q=...&limit=N - Search papers

No API key required. Rate limits apply for heavy usage.

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

License

MIT License - see LICENSE for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haystack_builtsimple-0.1.0.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

haystack_builtsimple-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file haystack_builtsimple-0.1.0.tar.gz.

File metadata

  • Download URL: haystack_builtsimple-0.1.0.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for haystack_builtsimple-0.1.0.tar.gz
Algorithm Hash digest
SHA256 47c0b7d8e8b3ba96269292d90199b9a464c91b742c2e419eb38b7acc125390b6
MD5 65bac96e66c23250d33baa21f28533de
BLAKE2b-256 349bb9998b973ed56f788483326e2488d69cf208902c84721bd6aaddf5cb18e8

See more details on using hashes here.

File details

Details for the file haystack_builtsimple-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for haystack_builtsimple-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 36fd29d7ce7377b47661e0adfa62eb5ee1f672902dac87b7378a96ef1789e269
MD5 9ed0e46016398d2612d18b200a5ec544
BLAKE2b-256 24bb2b276681813105cd233e332b5a4b3a90771fa8799bfc3a3c3bd5138fad44

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page