A modular RAG SDK for ingesting web, document, and API sources, chunking them, and storing embeddings in pluggable vector databases.
Project description
Ragrails
Ragrails is a modular RAG SDK for ingesting web pages, local documents, and REST API responses, converting them into clean markdown, chunking them for retrieval, and storing embeddings in pluggable vector databases.
It is built for retrieval-augmented generation workflows that need source ingestion, markdown normalization, chunking, semantic search, vector storage, and evaluation as separate stages.
The public SDK starts with one class:
from ragrails import RagRails
rag = RagRails()
Current SDK
The ingestion, chunking, and vector storage SDK surfaces are available now.
rag.scrape(...) # web pages and websites
rag.parse(...) # local files and folders
rag.fetch(...) # REST API responses
rag.chunk(...) # markdown files to RAG chunks
rag.store(...) # chunk JSON files to a vector DB
Detailed SDK docs:
Installation
Ragrails requires Python 3.13 or newer.
Check your system Python:
python3 --version
If it prints an older version, such as Python 3.9.6, create a Python 3.13
virtual environment for your project. A virtual environment keeps Ragrails and
its dependencies inside your project instead of installing them globally.
Recommended setup with uv:
uv venv --python 3.13 .venv
source .venv/bin/activate
uv pip install ragrails
After activation, check the environment Python:
python --version
It should print Python 3.13.x.
Verify the install:
from ragrails import RagRails
print(RagRails)
If you already manage Python environments yourself, you can install directly:
pip install ragrails
Install extras for the stage or provider you need.
| Stage | Install |
|---|---|
| Document + API ingestion | included with pip install ragrails |
| URL ingestion | uv pip install "ragrails[url]" |
| Chunking | uv pip install "ragrails[chunk]" |
| Store in Qdrant | uv pip install "ragrails[store-qdrant]" |
| Store in Pinecone | uv pip install "ragrails[store-pinecone]" |
| Store in Weaviate | uv pip install "ragrails[store-weaviate]" |
Provider extras are also available separately:
| Provider | Install |
|---|---|
| Voyage embeddings | uv pip install "ragrails[voyage]" |
| Qdrant | uv pip install "ragrails[qdrant]" |
| Pinecone | uv pip install "ragrails[pinecone]" |
| Weaviate | uv pip install "ragrails[weaviate]" |
| OpenAI | uv pip install "ragrails[openai]" |
| Anthropic | uv pip install "ragrails[anthropic]" |
| Reranking | uv pip install "ragrails[rerank]" |
| Everything | uv pip install "ragrails[all]" |
Common workflow installs:
| Workflow | Install |
|---|---|
| Scrape URLs, chunk, store in Qdrant | uv pip install "ragrails[url,chunk,voyage,qdrant]" |
| Parse documents, chunk, store in Pinecone | uv pip install "ragrails[chunk,voyage,pinecone]" |
| Fetch APIs, chunk, store in Weaviate | uv pip install "ragrails[chunk,voyage,weaviate]" |
| Qdrant storage shortcut | uv pip install "ragrails[store-qdrant]" |
| Pinecone storage shortcut | uv pip install "ragrails[store-pinecone]" |
| Weaviate storage shortcut | uv pip install "ragrails[store-weaviate]" |
crawl4ai pulls in Playwright as a package dependency for URL ingestion. You may
still need to install browser binaries:
playwright install
Requirements
Later RAG stages use provider API keys:
export VOYAGE_API_KEY="..."
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."
Embedding and retrieval also need a vector database. Ragrails currently ships with Qdrant, Pinecone, and Weaviate adapters behind the same vector store interface.
Vector DB Providers
Choose a vector DB provider before storing, retrieving, chatting, or running evals.
Provider names:
qdrant
pinecone
weaviate
Qdrant
Qdrant is the easiest local option while developing.
docker run -p 6333:6333 qdrant/qdrant
export VECTOR_DB_PROVIDER=qdrant
export VECTOR_DB_URL=http://localhost:6333
export VECTOR_DB_COLLECTION=rag_chunks
Pinecone
Pinecone is the managed vector DB option. Ragrails uses its existing embedding model and stores dense vectors in a Pinecone serverless index.
export PINECONE_API_KEY="..."
export VECTOR_DB_PROVIDER=pinecone
export VECTOR_DB_COLLECTION=rag-chunks
Optional Pinecone settings:
export PINECONE_CLOUD=aws
export PINECONE_REGION=us-east-1
export PINECONE_NAMESPACE=
For Pinecone, VECTOR_DB_COLLECTION maps to the Pinecone index name. Use
lowercase letters, digits, and hyphens only, for example rag-chunks.
Weaviate
Weaviate is another managed or self-hosted vector DB option. Ragrails uses its own embedding model and stores dense vectors in a Weaviate collection configured for self-provided vectors.
For local Weaviate, expose both HTTP and gRPC ports:
docker run -p 8080:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:1.36.9
export VECTOR_DB_PROVIDER=weaviate
export VECTOR_DB_URL=http://localhost:8080
export VECTOR_DB_COLLECTION=RagChunks
For Weaviate Cloud:
export WEAVIATE_API_KEY="..."
export VECTOR_DB_PROVIDER=weaviate
export VECTOR_DB_URL="https://your-cluster.weaviate.cloud"
export VECTOR_DB_COLLECTION=RagChunks
For Weaviate, VECTOR_DB_COLLECTION maps to the collection name. Use a name
that starts with an uppercase letter, for example RagChunks.
After choosing a provider, store chunks with the SDK:
from ragrails import RagRails
result = RagRails().store(
input_dir="files/output/chunks",
vector_db="qdrant",
collection="rag_chunks",
)
print(result.files)
print(result.chunks)
print(result.provider)
print(result.collection)
The lower-level stage runner also reads the same environment variables:
uv run python -m ragrails.pipeline.stg_03_embedder
uv run python -m ragrails.pipeline.stg_04_retriever "your query"
URL Ingestion
Scrape one exact page:
from ragrails import RagRails
result = RagRails().scrape(
url="https://example.com/about",
mode="each",
output_dir="files/output/web_crawled",
)
print(result.pages)
print(result.files)
print(result.errors)
Crawl a website:
result = RagRails().scrape(
url="https://example.com",
mode="full",
output_dir="files/output/web_crawled",
max_depth=3,
max_pages=200,
)
Document Ingestion
Parse a folder of local documents into markdown:
from ragrails import RagRails
result = RagRails().parse(
folder="files/input",
output_dir="files/output/docs",
)
print(result.documents)
print(result.files)
print(result.errors)
Parse selected files with custom metadata:
result = RagRails().parse(
files=[
{
"filename": "guide.pdf",
"title": "Product Guide",
"description": "Internal product guide.",
}
],
input_dir="files/input",
output_dir="files/output/docs",
)
Supported discovery extensions for folders:
.csv, .docx, .epub, .html, .htm, .ipynb, .json, .md, .msg,
.pdf, .pptx, .rss, .tsv, .txt, .xls, .xlsx, .xml, .zip
API Ingestion
Fetch a REST API response into markdown:
from ragrails import RagRails
result = RagRails().fetch(
url="https://api.example.com/v1/products",
title="Products",
description="Product catalog from the API.",
output_dir="files/output/api",
)
print(result.pages)
print(result.items)
print(result.files)
print(result.errors)
Pass request options when needed:
result = RagRails().fetch(
url="https://api.example.com/v1/search",
method="POST",
headers={"Authorization": "Bearer <token>"},
body={"query": "payments"},
title="Search Results",
)
Chunking
Split markdown files into RAG-ready JSON chunks:
from ragrails import RagRails
result = RagRails().chunk(
input_dir="files/output/web_crawled",
output_dir="files/output/chunks",
)
print(result.files)
print(result.chunks)
print(result.output_files)
print(result.errors)
Chunk markdown created by any ingestion method:
result = RagRails().chunk(
input_dir="files/output/docs",
output_dir="files/output/chunks/docs",
chunk_size=1200,
chunk_overlap=150,
)
Preview one markdown file in memory:
chunks = RagRails().chunk_file(
"files/output/docs/guide.md",
)
print(len(chunks))
print(chunks[0]["metadata"])
Store
Embed every chunk JSON file in a folder and store the vectors:
from ragrails import RagRails
result = RagRails().store(
input_dir="files/output/chunks/docs",
vector_db="qdrant",
collection="rag_chunks",
)
print(result.files)
print(result.chunks)
print(result.errors)
Store in Pinecone:
result = RagRails().store(
input_dir="files/output/chunks/docs",
vector_db="pinecone",
collection="rag-chunks",
)
Store in Weaviate:
result = RagRails().store(
input_dir="files/output/chunks/docs",
vector_db="weaviate",
url="http://localhost:8080",
collection="RagChunks",
)
Store selected chunk files from a folder:
result = RagRails().store(
input_dir="files/output/chunks/docs",
files=["001_overview.json", "002_auth.json"],
vector_db="qdrant",
collection="rag_chunks",
)
Provider-specific failures to check first:
qdrant Qdrant is not running, or port 6333 is not exposed.
pinecone PINECONE_API_KEY is missing, or the index name uses underscores.
weaviate Weaviate is not running, gRPC 50051 is not exposed, or the collection name is invalid.
Output
Ragrails writes markdown files to the output directory you choose:
files/output/web_crawled/
files/output/docs/
files/output/api/
files/output/chunks/
By default, files include Ragrails frontmatter metadata. Disable it with
frontmatter=False when you only want the markdown body.
Result Types
ScrapeResult(
pages=int,
failed=int,
output_dir=str,
files=list[str],
dlq_path=str,
errors=list[str],
)
ParseResult(
documents=int,
failed=int,
output_dir=str,
files=list[str],
errors=list[str],
)
ApiIngestResult(
pages=int,
items=int,
failed=int,
output_dir=str,
files=list[str],
errors=list[str],
)
ChunkResult(
files=int,
chunks=int,
output_dir=str,
output_files=list[str],
failed=int,
errors=list[str],
)
StoreResult(
files=int,
chunks=int,
input_dir=str,
provider=str,
collection=str,
errors=list[str],
)
API Reference
RagRails().scrape(
url,
*,
mode="each",
output_dir="files/output/web_crawled",
frontmatter=True,
dlq_path="files/output/dlq.json",
max_depth=3,
max_pages=200,
)
RagRails().parse(
files=None,
*,
folder=None,
input_dir="files/input",
output_dir="files/output/docs",
frontmatter=True,
)
RagRails().fetch(
url,
*,
title="API Response",
description="",
method="GET",
headers=None,
params=None,
body=None,
pagination=None,
max_pages=100,
output_dir="files/output/api",
frontmatter=True,
)
RagRails().chunk(
*,
input_dir="files/output/web_crawled",
output_dir="files/output/chunks",
chunk_size=2000,
chunk_overlap=200,
min_chunk_length=100,
)
RagRails().chunk_file(
path,
*,
chunk_size=2000,
chunk_overlap=200,
min_chunk_length=100,
)
RagRails().store(
*,
input_dir="files/output/chunks",
vector_db="qdrant",
collection=None,
url=None,
files=None,
batch_size=64,
embedder="voyage",
model="voyage-3",
)
Supported vector_db values:
qdrant
pinecone
weaviate
Status
Ingestion, chunking, and vector storage are available through the public SDK. Retrieval, chat, and eval already exist internally and will be exposed next.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragrails-0.1.3.tar.gz.
File metadata
- Download URL: ragrails-0.1.3.tar.gz
- Upload date:
- Size: 321.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f149d71a9a54f493d585a54c9e1ae939f2562a70e92edf926295f668eac87ca
|
|
| MD5 |
f625ef5a680de51295b0c99d9ab604bb
|
|
| BLAKE2b-256 |
c2c61e8656ca5f0d5ae80d0af2fcd9ccc5f41719ab7200e2a5df775279194fe7
|
File details
Details for the file ragrails-0.1.3-py3-none-any.whl.
File metadata
- Download URL: ragrails-0.1.3-py3-none-any.whl
- Upload date:
- Size: 115.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ff118ea154b2e0a7f7796e9750cb7f1cd488f4336d1fedeeaafd92940a08b46
|
|
| MD5 |
9398bc54c6d7f94b8dacaa4726056c96
|
|
| BLAKE2b-256 |
12d2cae0699d79327b1fb53b747d306adc0029ae1e06b2772fb765797a14c127
|