Skip to main content

LlamaIndex readers for Built-Simple research APIs (PubMed, ArXiv, Wikipedia)

Project description

llama-index-readers-builtsimple

LlamaIndex readers for Built-Simple research APIs, providing semantic search over scientific literature.

PyPI version License: MIT

Features

  • PubMed Reader - 4.5M+ biomedical articles with hybrid semantic/keyword search
  • ArXiv Reader - 2.7M+ preprints in physics, math, CS, and ML
  • Wikipedia Reader - Semantic search over Wikipedia articles
  • No API key required - Free tier available for all endpoints
  • Rich metadata - Full citation info for all documents

What Data is Included

PubMed Reader

Each document contains:

  • Text: Title + abstract (default) OR full article text (with include_full_text=True)
  • Metadata:
    • pmid - PubMed ID (e.g., "31041627")
    • title - Full article title
    • journal - Publication journal name
    • year - Publication year
    • doi - DOI identifier
    • doi_url - Direct DOI link
    • url - Link to PubMed page
    • has_full_text - Boolean indicating if full text was fetched
    • full_text_length - Character count of full text (when available)

🔥 FULL TEXT AVAILABLE! Unlike most research APIs that only provide abstracts, Built-Simple has full article text for millions of papers:

# Get full article text (15K-70K chars per article)
reader = BuiltSimplePubMedReader(include_full_text=True)
docs = reader.load_data("cancer immunotherapy", limit=5)

for doc in docs:
    print(f"Full text length: {len(doc.text)} chars")  # ~15,000-70,000 chars!

ArXiv Reader

Each document contains:

  • Text: Title + authors + full abstract
  • Metadata:
    • arxiv_id - ArXiv identifier (e.g., "2301.12345" or "cs/0308031")
    • title - Paper title
    • authors - Author names
    • year - Publication year
    • url - Link to ArXiv abstract page
    • pdf_url - Direct PDF download link
    • similarity_score - Semantic relevance score (0-1)

Note: Full paper PDFs are NOT downloaded—only abstracts. Use pdf_url to fetch the full PDF if needed.

Wikipedia Reader

Each document contains:

  • Text: Article title + summary/intro section
  • Metadata:
    • title - Article title
    • url - Link to Wikipedia page

Note: Only article summaries, not full articles.

Installation

pip install llama-index-readers-builtsimple

Quick Start

Basic Usage

from llama_index.readers.builtsimple import (
    BuiltSimplePubMedReader,
    BuiltSimpleArxivReader,
)

# Search PubMed for medical literature
pubmed_reader = BuiltSimplePubMedReader()
pubmed_docs = pubmed_reader.load_data("CRISPR gene therapy", limit=10)

for doc in pubmed_docs:
    print(f"Title: {doc.metadata['title']}")
    print(f"Journal: {doc.metadata['journal']}")
    print(f"Year: {doc.metadata['pub_year']}")
    print(f"URL: {doc.metadata['url']}\n")

# Search ArXiv for ML papers
arxiv_reader = BuiltSimpleArxivReader()
arxiv_docs = arxiv_reader.load_data("transformer architecture attention", limit=10)

for doc in arxiv_docs:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {doc.metadata['authors']}")
    print(f"ArXiv ID: {doc.metadata['arxiv_id']}\n")

Build a RAG Index

from llama_index.core import VectorStoreIndex
from llama_index.readers.builtsimple import BuiltSimplePubMedReader

# Load documents
reader = BuiltSimplePubMedReader()
documents = reader.load_data("immunotherapy cancer treatment", limit=20)

# Build index
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("What are the side effects of CAR-T therapy?")
print(response)

Combine Multiple Sources

from llama_index.core import VectorStoreIndex
from llama_index.readers.builtsimple import (
    BuiltSimplePubMedReader,
    BuiltSimpleArxivReader,
)

# Load from multiple sources
pubmed = BuiltSimplePubMedReader()
arxiv = BuiltSimpleArxivReader()

# Combine documents
documents = []
documents.extend(pubmed.load_data("drug discovery machine learning", limit=10))
documents.extend(arxiv.load_data("drug discovery deep learning", limit=10))

# Build unified index
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

response = query_engine.query(
    "How is machine learning being used for drug discovery?"
)
print(response)

API Reference

BuiltSimplePubMedReader

BuiltSimplePubMedReader(
    api_key: Optional[str] = None,  # Optional for higher rate limits
    timeout: int = 30,
)

def load_data(
    query: str,
    limit: int = 10,
) -> List[Document]

Document Metadata:

  • source: "builtsimple-pubmed"
  • pmid: PubMed ID
  • title: Paper title
  • journal: Journal name
  • pub_year: Publication year
  • doi: DOI identifier
  • url: Link to PubMed

BuiltSimpleArxivReader

BuiltSimpleArxivReader(
    api_key: Optional[str] = None,
    timeout: int = 30,
)

def load_data(
    query: str,
    limit: int = 10,
) -> List[Document]

Document Metadata:

  • source: "builtsimple-arxiv"
  • arxiv_id: ArXiv identifier (e.g., "2301.12345")
  • title: Paper title
  • authors: Author list
  • year: Publication year
  • url: Link to ArXiv

BuiltSimpleWikipediaReader

BuiltSimpleWikipediaReader(
    api_key: Optional[str] = None,
    timeout: int = 30,
)

def load_data(
    query: str,
    limit: int = 10,
) -> List[Document]

Document Metadata:

  • source: "builtsimple-wikipedia"
  • title: Article title
  • url: Link to Wikipedia

Rate Limits

Tier Rate Limit Notes
Free 10 req/min No API key needed
Pro 100 req/min Requires API key

Get an API key at pubmed.built-simple.ai or arxiv.built-simple.ai.

Why Built-Simple?

Unlike scraping or official APIs:

  • Pre-indexed vectors - No embedding costs, instant semantic search
  • Hybrid search - Combines BM25 + vector similarity
  • Always available - No rate limit hell from upstream providers
  • Structured data - Clean JSON responses with full metadata

Contributing

This package is part of the LlamaIndex ecosystem. To contribute:

  1. Fork the repo
  2. Create a feature branch
  3. Submit a PR to run-llama/llama_index

License

MIT License - see LICENSE for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_builtsimple-0.1.0.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_readers_builtsimple-0.1.0-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_readers_builtsimple-0.1.0.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_builtsimple-0.1.0.tar.gz
Algorithm Hash digest
SHA256 01e2e66076c3f3925d92a712d56f2899e214d465f8ddb86e276fe4c20d843e6e
MD5 3044814f3184c6aa6a3bbfc4f64ab3d2
BLAKE2b-256 b56d97eec0de38e83d7af72abb0f7e28d4baeaeadf2ac09a674ab301768c23da

See more details on using hashes here.

File details

Details for the file llama_index_readers_builtsimple-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_builtsimple-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d84406aec57f519dca41766bca57fb4990b6910a795f364f68662fd4bf96b1af
MD5 f9b89728c133b549990096a07a95bb99
BLAKE2b-256 d7f5cf4a868e689e05041bd71415400a3fdf88e65961b3ce58d2c6f9381a5ef8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page