A modular RAG SDK for ingesting web, document, and API sources, chunking them, and storing embeddings in pluggable vector databases.
Project description
Ragrails
Ragrails is a modular RAG SDK for turning web pages, local documents, and REST API responses into retrieval-ready knowledge bases.
Documentation: https://dev.ragrails.com
It gives you one Python interface for:
- ingesting URLs, documents, and API responses into markdown
- chunking markdown into RAG-ready JSON chunks
- embedding and storing chunks in pluggable vector databases
- building toward retrieval, chat, and evaluation workflows
from ragrails import RagRails
rag = RagRails()
Install
Ragrails requires Python 3.10 or newer. The macOS system Python is 3.9 and will not work. Install a supported version from python.org or via your package manager before running the install command.
pip install ragrails
Document and API ingestion are included in the base install. Install extras only for heavier stages or providers.
| Need | Install |
|---|---|
| URL ingestion | pip install "ragrails[url]" |
| Chunking | pip install "ragrails[chunk]" |
| REST API server | pip install "ragrails[server]" |
| Store in Qdrant | pip install "ragrails[store-qdrant]" |
| Store in Pinecone | pip install "ragrails[store-pinecone]" |
| Store in Weaviate | pip install "ragrails[store-weaviate]" |
| Everything | pip install "ragrails[all]" |
Provider extras are also available separately:
| Provider | Install |
|---|---|
| Voyage embeddings | pip install "ragrails[voyage]" |
| Qdrant | pip install "ragrails[qdrant]" |
| Pinecone | pip install "ragrails[pinecone]" |
| Weaviate | pip install "ragrails[weaviate]" |
| OpenAI | pip install "ragrails[openai]" |
| Anthropic | pip install "ragrails[anthropic]" |
| Reranking | pip install "ragrails[rerank]" |
Quick Start
URL to Vector DB
pip install "ragrails[url,chunk,voyage,qdrant]"
URL scraping uses Playwright through crawl4ai. Run browser setup once in the
same environment:
from ragrails import RagRails
rag = RagRails()
rag.setup_url()
Then run the pipeline:
from ragrails import RagRails
rag = RagRails()
scraped = rag.scrape(
url="https://example.com",
mode="full",
output_dir="files/output/web_crawled",
)
chunks = rag.chunk(
input_dir=scraped.output_dir,
output_dir="files/output/chunks/web",
)
embedded = rag.embed(
input_dir=chunks.output_dir,
vector_db="qdrant",
collection="rag_chunks",
)
print(embedded.chunks)
Documents to Vector DB
pip install "ragrails[chunk,voyage,qdrant]"
from ragrails import RagRails
rag = RagRails()
parsed = rag.parse(
folder="files/input",
output_dir="files/output/docs",
)
chunks = rag.chunk(
input_dir=parsed.output_dir,
output_dir="files/output/chunks/docs",
)
embedded = rag.embed(
input_dir=chunks.output_dir,
vector_db="qdrant",
collection="rag_chunks",
)
print(embedded.chunks)
API to Markdown
from ragrails import RagRails
result = RagRails().fetch(
url="https://api.example.com/v1/products",
title="Products",
output_dir="files/output/api",
)
print(result.files)
CLI
Ragrails ships with a CLI so you can run ingestion without writing Python.
ragrails setup-url
ragrails scrape https://example.com --mode full
ragrails scrape https://example.com/about https://example.com/pricing
ragrails parse --folder files/input
ragrails parse --files guide.pdf --files pricing.csv --input-dir files/input
ragrails fetch https://api.example.com/v1/products --title "Products"
ragrails fetch https://api.example.com/v1/products \
--header "Authorization:Bearer <token>" \
--header "X-Api-Key:my-key"
See the full CLI reference.
REST API
Ragrails also ships an optional REST API server for language-agnostic HTTP usage.
pip install "ragrails[server]"
ragrails-api
curl -X POST http://127.0.0.1:8000/v1/ingest/api \
-H "Content-Type: application/json" \
-d '{"url":"https://api.example.com/v1/products","title":"Products"}'
See the full REST API reference.
SDK Stages
| Stage | Method | Output |
|---|---|---|
| URL ingestion | rag.scrape(...) |
Markdown files |
| URL retry | rag.retry_scrape(...) |
Retried markdown files |
| Document ingestion | rag.parse(...) |
Markdown files |
| API ingestion | rag.fetch(...) |
Markdown files |
| Chunking | rag.chunk(...) |
JSON chunk files |
| Single-file chunk preview | rag.chunk_file(...) |
In-memory chunk dictionaries |
| Embedding | rag.embed(...) |
Embedded vectors in a vector DB |
| Vector storage | rag.store(...) |
Alias for embedding and storing chunks |
| Retrieval | rag.retrieve(...) |
Ranked retrieved chunks |
The usage interfaces are organized in the package under ragrails/usage/:
ragrails/usage/
sdk/
cli/
server/
Hosted documentation:
Repository docs:
| Usage | Overview | Ingestion | Chunking | Embedding | Storing | Retrieval |
|---|---|---|---|---|---|---|
| SDK | Overview | Ingestion | Chunking | Embedding | Storing | Retrieval |
| CLI | Overview | Ingestion | Chunking | Embedding | Storing | Retrieval |
| REST API server | Overview | Ingestion | Chunking | Embedding | Storing | Retrieval |
Specialized SDK ingestion docs:
Ingestion
URL Ingestion
result = RagRails().scrape(
url="https://example.com/about",
mode="each",
output_dir="files/output/web_crawled",
)
For full-site crawling:
result = RagRails().scrape(
url="https://example.com",
mode="full",
output_dir="files/output/web_crawled",
max_depth=3,
max_pages=200,
)
Failed URL attempts are written to dlq.json inside the output folder by
default:
files/output/web_crawled/dlq.json
Retry failed URLs:
result = RagRails().retry_scrape(
"files/output/web_crawled/dlq.json",
)
Document Ingestion
result = RagRails().parse(
folder="files/input",
output_dir="files/output/docs",
)
Supported folder discovery extensions:
.csv, .docx, .epub, .html, .htm, .ipynb, .json, .md, .msg,
.pdf, .pptx, .rss, .tsv, .txt, .xls, .xlsx, .xml, .zip
API Ingestion
result = RagRails().fetch(
url="https://api.example.com/v1/search",
method="POST",
headers={
"Authorization": "Bearer <token>",
"X-Api-Key": "my-key",
},
body={"query": "payments"},
title="Search Results",
output_dir="files/output/api",
)
Chunking
result = RagRails().chunk(
input_dir="files/output/docs",
output_dir="files/output/chunks/docs",
chunk_size=2000,
chunk_overlap=200,
)
Preview one markdown file in memory:
chunks = RagRails().chunk_file(
"files/output/docs/guide.md",
)
Embedding And Vector Storage
Ragrails currently supports Qdrant, Pinecone, and Weaviate as storage providers.
Set provider credentials as needed:
export VOYAGE_API_KEY="..."
export PINECONE_API_KEY="..."
export WEAVIATE_API_KEY="..."
Qdrant local example:
docker run -p 6333:6333 qdrant/qdrant
result = RagRails().embed(
input_dir="files/output/chunks/docs",
vector_db="qdrant",
url="http://localhost:6333",
collection="rag_chunks",
)
Pinecone example:
result = RagRails().embed(
input_dir="files/output/chunks/docs",
vector_db="pinecone",
collection="rag-chunks",
)
Weaviate example:
result = RagRails().embed(
input_dir="files/output/chunks/docs",
vector_db="weaviate",
url="http://localhost:8080",
collection="RagChunks",
)
Provider naming rules:
| Provider | Collection name |
|---|---|
| Qdrant | Any valid Qdrant collection name, for example rag_chunks |
| Pinecone | Lowercase letters, digits, and hyphens, for example rag-chunks |
| Weaviate | Starts with an uppercase letter, for example RagChunks |
store(...) is kept as an alias for embed(...) when you prefer storage-oriented naming.
Retrieval
result = RagRails().retrieve(
"How do payouts work?",
vector_db="qdrant",
collection="rag_chunks",
top_k=10,
)
for item in result.results:
print(item.score, item.metadata.get("title"), item.text[:200])
Result Types
ScrapeResult(
pages=int,
failed=int,
output_dir=str,
files=list[str],
dlq_path=str,
errors=list[str],
)
ParseResult(
documents=int,
failed=int,
output_dir=str,
files=list[str],
errors=list[str],
)
ApiIngestResult(
pages=int,
items=int,
failed=int,
output_dir=str,
files=list[str],
errors=list[str],
)
ChunkResult(
files=int,
chunks=int,
output_dir=str,
output_files=list[str],
failed=int,
errors=list[str],
)
EmbedResult(
files=int,
chunks=int,
input_dir=str,
provider=str,
collection=str,
errors=list[str],
)
StoreResult(
files=int,
chunks=int,
input_dir=str,
provider=str,
collection=str,
errors=list[str],
)
RetrieveResult(
query=str,
results=list[RetrievedChunk],
)
RetrievedChunk(
id=str,
score=float,
text=str,
metadata=dict,
rerank_score=float | None,
)
Parameter Reference
setup_url()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
browser |
str |
"chromium" |
No | Playwright browser binary to install for URL scraping. |
scrape()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
url |
str | list[str] |
- | Yes | URL or URLs to scrape. |
mode |
"each" | "full" |
"each" |
No | Scrape exact URLs or crawl full sites. |
output_dir |
str |
"files/output/web_crawled" |
No | Markdown output folder. |
frontmatter |
bool |
True |
No | Add source metadata to markdown files. |
dlq_path |
str | None |
None |
No | Custom DLQ file. Defaults to <output_dir>/dlq.json. |
max_depth |
int |
3 |
No | Crawl depth for mode="full". |
max_pages |
int |
200 |
No | Maximum pages per site. |
retry_scrape()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
dlq_path |
str |
- | Yes | DLQ file to retry. |
mode |
"each" | "full" |
"each" |
No | Retry as exact pages or full-site crawls. |
max_depth |
int |
3 |
No | Crawl depth for mode="full". |
max_pages |
int |
200 |
No | Maximum pages per site. |
max_attempts |
int |
3 |
No | Retry entries below this attempt count. |
parse()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
files |
str | list[str | dict] | None |
None |
Conditional | Specific files to parse. |
folder |
str | None |
None |
Conditional | Folder of supported files to parse. |
input_dir |
str |
"files/input" |
No | Base folder for files. |
output_dir |
str |
"files/output/docs" |
No | Markdown output folder. |
frontmatter |
bool |
True |
No | Add document metadata to markdown files. |
fetch()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
url |
str |
- | Yes | API endpoint URL. |
title |
str |
"API Response" |
No | Output metadata title. |
description |
str |
"" |
No | Output metadata description. |
method |
str |
"GET" |
No | HTTP method. |
headers |
dict | None |
None |
No | Request headers. Multiple headers are supported. |
params |
dict | None |
None |
No | Query parameters. |
body |
dict | None |
None |
No | JSON request body. |
pagination |
dict | None |
None |
No | Pagination configuration. |
max_pages |
int |
100 |
No | Maximum API pages to fetch. |
output_dir |
str |
"files/output/api" |
No | Markdown output folder. |
frontmatter |
bool |
True |
No | Add API metadata to markdown files. |
chunk()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
input_dir |
str |
"files/output/web_crawled" |
No | Folder containing markdown files. |
output_dir |
str |
"files/output/chunks" |
No | JSON chunk output folder. |
chunk_size |
int |
2000 |
No | Target maximum chunk size. |
chunk_overlap |
int |
200 |
No | Overlap between chunks. |
min_chunk_length |
int |
100 |
No | Minimum chunk length to keep. |
chunk_file()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
path |
str |
- | Yes | Markdown file path to chunk in memory. |
chunk_size |
int |
2000 |
No | Target maximum chunk size. |
chunk_overlap |
int |
200 |
No | Overlap between chunks. |
min_chunk_length |
int |
100 |
No | Minimum chunk length to keep. |
embed()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
input_dir |
str |
"files/output/chunks" |
No | Folder of chunk JSON files. |
vector_db |
"qdrant" | "pinecone" | "weaviate" |
"qdrant" |
No | Vector database provider. |
collection |
str | None |
None |
No | Collection, index, or class name. |
url |
str | None |
None |
No | Vector database URL. |
files |
str | list[str] | None |
None |
No | Selected chunk files to embed. |
batch_size |
int |
64 |
No | Chunks per embedding/storage batch. |
embedder |
str |
"voyage" |
No | Embedding provider. |
model |
str |
"voyage-3" |
No | Embedding model name. |
store()
store() accepts the same parameters as embed() and returns StoreResult.
retrieve()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
query |
str |
- | Yes | Query text to search for. |
vector_db |
"qdrant" | "pinecone" | "weaviate" |
"qdrant" |
No | Vector database provider. |
collection |
str | None |
None |
No | Collection, index, or class name. |
url |
str | None |
None |
No | Vector database URL. |
top_k |
int |
10 |
No | Number of vector search candidates. |
embedder |
str |
"voyage" |
No | Query embedding provider. |
model |
str |
"voyage-3" |
No | Query embedding model. |
rerank |
bool |
False |
No | Rerank retrieved candidates. |
reranker |
str |
"voyage" |
No | Reranker provider. |
reranker_model |
str |
"rerank-2-lite" |
No | Reranker model. |
rerank_top_k |
int |
5 |
No | Number of reranked results to return. |
Status
The public SDK currently covers ingestion, chunking, embedding, vector storage, and retrieval. Chat and eval exist internally and will be exposed as public SDK surfaces later.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragrails-0.1.10.tar.gz.
File metadata
- Download URL: ragrails-0.1.10.tar.gz
- Upload date:
- Size: 493.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4de1a1ef7cb4ab2d73b66526b8d90bd799fe997a5c5aef7df84c2f7a363f3065
|
|
| MD5 |
58bc0996dfdda389d47485a2eef060bb
|
|
| BLAKE2b-256 |
83bfcd95880a25584a5765f4a5dd85c2449820e5174584374c36bc1b15671e04
|
File details
Details for the file ragrails-0.1.10-py3-none-any.whl.
File metadata
- Download URL: ragrails-0.1.10-py3-none-any.whl
- Upload date:
- Size: 143.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
197c1f9cbbb792dfa819e9f3ddb4d414c350824dff00fe5b36e98e3d4cdfdff3
|
|
| MD5 |
a7e81094c4f735448f5a4d1467146a07
|
|
| BLAKE2b-256 |
22a75d7ae16f85ba6b1e20a3656926ed908d53444db0fff09840461fa1afacc5
|