LlamaIndex readers for Built-Simple research APIs (PubMed, ArXiv, Wikipedia)
Project description
llama-index-readers-builtsimple
LlamaIndex readers for Built-Simple research APIs, providing semantic search over scientific literature.
Features
- PubMed Reader - 4.5M+ biomedical articles with hybrid semantic/keyword search
- ArXiv Reader - 2.7M+ preprints in physics, math, CS, and ML
- Wikipedia Reader - Semantic search over Wikipedia articles
- No API key required - Free tier available for all endpoints
- Rich metadata - Full citation info for all documents
What Data is Included
PubMed Reader
Each document contains:
- Text: Title + abstract (default) OR full article text (with
include_full_text=True) - Metadata:
pmid- PubMed ID (e.g., "31041627")title- Full article titlejournal- Publication journal nameyear- Publication yeardoi- DOI identifierdoi_url- Direct DOI linkurl- Link to PubMed pagehas_full_text- Boolean indicating if full text was fetchedfull_text_length- Character count of full text (when available)
🔥 FULL TEXT AVAILABLE! Unlike most research APIs that only provide abstracts, Built-Simple has full article text for millions of papers:
# Get full article text (15K-70K chars per article)
reader = BuiltSimplePubMedReader(include_full_text=True)
docs = reader.load_data("cancer immunotherapy", limit=5)
for doc in docs:
print(f"Full text length: {len(doc.text)} chars") # ~15,000-70,000 chars!
ArXiv Reader
Each document contains:
- Text: Title + authors + full abstract
- Metadata:
arxiv_id- ArXiv identifier (e.g., "2301.12345" or "cs/0308031")title- Paper titleauthors- Author namesyear- Publication yearurl- Link to ArXiv abstract pagepdf_url- Direct PDF download linksimilarity_score- Semantic relevance score (0-1)
Note: Full paper PDFs are NOT downloaded—only abstracts. Use pdf_url to fetch the full PDF if needed.
Wikipedia Reader
Each document contains:
- Text: Article title + summary/intro section
- Metadata:
title- Article titleurl- Link to Wikipedia page
Note: Only article summaries, not full articles.
Installation
pip install llama-index-readers-builtsimple
Quick Start
Basic Usage
from llama_index.readers.builtsimple import (
BuiltSimplePubMedReader,
BuiltSimpleArxivReader,
)
# Search PubMed for medical literature
pubmed_reader = BuiltSimplePubMedReader()
pubmed_docs = pubmed_reader.load_data("CRISPR gene therapy", limit=10)
for doc in pubmed_docs:
print(f"Title: {doc.metadata['title']}")
print(f"Journal: {doc.metadata['journal']}")
print(f"Year: {doc.metadata['pub_year']}")
print(f"URL: {doc.metadata['url']}\n")
# Search ArXiv for ML papers
arxiv_reader = BuiltSimpleArxivReader()
arxiv_docs = arxiv_reader.load_data("transformer architecture attention", limit=10)
for doc in arxiv_docs:
print(f"Title: {doc.metadata['title']}")
print(f"Authors: {doc.metadata['authors']}")
print(f"ArXiv ID: {doc.metadata['arxiv_id']}\n")
Build a RAG Index
from llama_index.core import VectorStoreIndex
from llama_index.readers.builtsimple import BuiltSimplePubMedReader
# Load documents
reader = BuiltSimplePubMedReader()
documents = reader.load_data("immunotherapy cancer treatment", limit=20)
# Build index
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("What are the side effects of CAR-T therapy?")
print(response)
Combine Multiple Sources
from llama_index.core import VectorStoreIndex
from llama_index.readers.builtsimple import (
BuiltSimplePubMedReader,
BuiltSimpleArxivReader,
)
# Load from multiple sources
pubmed = BuiltSimplePubMedReader()
arxiv = BuiltSimpleArxivReader()
# Combine documents
documents = []
documents.extend(pubmed.load_data("drug discovery machine learning", limit=10))
documents.extend(arxiv.load_data("drug discovery deep learning", limit=10))
# Build unified index
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query(
"How is machine learning being used for drug discovery?"
)
print(response)
API Reference
BuiltSimplePubMedReader
BuiltSimplePubMedReader(
api_key: Optional[str] = None, # Optional for higher rate limits
timeout: int = 30,
)
def load_data(
query: str,
limit: int = 10,
) -> List[Document]
Document Metadata:
source: "builtsimple-pubmed"pmid: PubMed IDtitle: Paper titlejournal: Journal namepub_year: Publication yeardoi: DOI identifierurl: Link to PubMed
BuiltSimpleArxivReader
BuiltSimpleArxivReader(
api_key: Optional[str] = None,
timeout: int = 30,
)
def load_data(
query: str,
limit: int = 10,
) -> List[Document]
Document Metadata:
source: "builtsimple-arxiv"arxiv_id: ArXiv identifier (e.g., "2301.12345")title: Paper titleauthors: Author listyear: Publication yearurl: Link to ArXiv
BuiltSimpleWikipediaReader
BuiltSimpleWikipediaReader(
api_key: Optional[str] = None,
timeout: int = 30,
)
def load_data(
query: str,
limit: int = 10,
) -> List[Document]
Document Metadata:
source: "builtsimple-wikipedia"title: Article titleurl: Link to Wikipedia
Rate Limits
| Tier | Rate Limit | Notes |
|---|---|---|
| Free | 10 req/min | No API key needed |
| Pro | 100 req/min | Requires API key |
Get an API key at pubmed.built-simple.ai or arxiv.built-simple.ai.
Why Built-Simple?
Unlike scraping or official APIs:
- Pre-indexed vectors - No embedding costs, instant semantic search
- Hybrid search - Combines BM25 + vector similarity
- Always available - No rate limit hell from upstream providers
- Structured data - Clean JSON responses with full metadata
Contributing
This package is part of the LlamaIndex ecosystem. To contribute:
- Fork the repo
- Create a feature branch
- Submit a PR to run-llama/llama_index
License
MIT License - see LICENSE for details.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_index_readers_builtsimple-0.1.0.tar.gz.
File metadata
- Download URL: llama_index_readers_builtsimple-0.1.0.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01e2e66076c3f3925d92a712d56f2899e214d465f8ddb86e276fe4c20d843e6e
|
|
| MD5 |
3044814f3184c6aa6a3bbfc4f64ab3d2
|
|
| BLAKE2b-256 |
b56d97eec0de38e83d7af72abb0f7e28d4baeaeadf2ac09a674ab301768c23da
|
File details
Details for the file llama_index_readers_builtsimple-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llama_index_readers_builtsimple-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d84406aec57f519dca41766bca57fb4990b6910a795f364f68662fd4bf96b1af
|
|
| MD5 |
f9b89728c133b549990096a07a95bb99
|
|
| BLAKE2b-256 |
d7f5cf4a868e689e05041bd71415400a3fdf88e65961b3ce58d2c6f9381a5ef8
|