A modular RAG SDK for ingesting web, document, and API sources, chunking them, and storing embeddings in pluggable vector databases.
Project description
Ragrails
Ragrails is a modular RAG SDK for turning web pages, local documents, and REST API responses into retrieval-ready knowledge bases.
Documentation: https://dev.ragrails.com
It gives you one Python interface for:
- ingesting URLs, documents, and API responses into markdown
- chunking markdown into RAG-ready JSON chunks
- embedding and storing chunks in pluggable vector databases
- building toward retrieval, chat, and evaluation workflows
from ragrails import RagRails
rag = RagRails()
Install
Ragrails requires Python 3.10 or newer.
Recommended setup:
uv venv --python 3.10 .venv
source .venv/bin/activate
uv pip install ragrails
If you already manage your Python environment:
pip install ragrails
Document and API ingestion are included in the base install. Install extras only for heavier stages or providers.
| Need | Install |
|---|---|
| URL ingestion | uv pip install "ragrails[url]" |
| Chunking | uv pip install "ragrails[chunk]" |
| Store in Qdrant | uv pip install "ragrails[store-qdrant]" |
| Store in Pinecone | uv pip install "ragrails[store-pinecone]" |
| Store in Weaviate | uv pip install "ragrails[store-weaviate]" |
| Everything | uv pip install "ragrails[all]" |
Provider extras are also available separately:
| Provider | Install |
|---|---|
| Voyage embeddings | uv pip install "ragrails[voyage]" |
| Qdrant | uv pip install "ragrails[qdrant]" |
| Pinecone | uv pip install "ragrails[pinecone]" |
| Weaviate | uv pip install "ragrails[weaviate]" |
| OpenAI | uv pip install "ragrails[openai]" |
| Anthropic | uv pip install "ragrails[anthropic]" |
| Reranking | uv pip install "ragrails[rerank]" |
Quick Start
URL to Vector DB
uv pip install "ragrails[url,chunk,voyage,qdrant]"
URL scraping uses Playwright through crawl4ai. Run browser setup once in the
same environment:
from ragrails import RagRails
rag = RagRails()
rag.setup_url()
Then run the pipeline:
from ragrails import RagRails
rag = RagRails()
scraped = rag.scrape(
url="https://example.com",
mode="full",
output_dir="files/output/web_crawled",
)
chunks = rag.chunk(
input_dir=scraped.output_dir,
output_dir="files/output/chunks/web",
)
stored = rag.store(
input_dir=chunks.output_dir,
vector_db="qdrant",
collection="rag_chunks",
)
print(stored.chunks)
Documents to Vector DB
uv pip install "ragrails[chunk,voyage,qdrant]"
from ragrails import RagRails
rag = RagRails()
parsed = rag.parse(
folder="files/input",
output_dir="files/output/docs",
)
chunks = rag.chunk(
input_dir=parsed.output_dir,
output_dir="files/output/chunks/docs",
)
stored = rag.store(
input_dir=chunks.output_dir,
vector_db="qdrant",
collection="rag_chunks",
)
print(stored.chunks)
API to Markdown
from ragrails import RagRails
result = RagRails().fetch(
url="https://api.example.com/v1/products",
title="Products",
output_dir="files/output/api",
)
print(result.files)
SDK Stages
| Stage | Method | Output |
|---|---|---|
| URL ingestion | rag.scrape(...) |
Markdown files |
| URL retry | rag.retry_scrape(...) |
Retried markdown files |
| Document ingestion | rag.parse(...) |
Markdown files |
| API ingestion | rag.fetch(...) |
Markdown files |
| Chunking | rag.chunk(...) |
JSON chunk files |
| Single-file chunk preview | rag.chunk_file(...) |
In-memory chunk dictionaries |
| Vector storage | rag.store(...) |
Embedded vectors in a vector DB |
Hosted documentation:
Repository docs:
Ingestion
URL Ingestion
result = RagRails().scrape(
url="https://example.com/about",
mode="each",
output_dir="files/output/web_crawled",
)
For full-site crawling:
result = RagRails().scrape(
url="https://example.com",
mode="full",
output_dir="files/output/web_crawled",
max_depth=3,
max_pages=200,
)
Failed URL attempts are written to dlq.json inside the output folder by
default:
files/output/web_crawled/dlq.json
Retry failed URLs:
result = RagRails().retry_scrape(
"files/output/web_crawled/dlq.json",
)
Document Ingestion
result = RagRails().parse(
folder="files/input",
output_dir="files/output/docs",
)
Supported folder discovery extensions:
.csv, .docx, .epub, .html, .htm, .ipynb, .json, .md, .msg,
.pdf, .pptx, .rss, .tsv, .txt, .xls, .xlsx, .xml, .zip
API Ingestion
result = RagRails().fetch(
url="https://api.example.com/v1/search",
method="POST",
headers={"Authorization": "Bearer <token>"},
body={"query": "payments"},
title="Search Results",
output_dir="files/output/api",
)
Chunking
result = RagRails().chunk(
input_dir="files/output/docs",
output_dir="files/output/chunks/docs",
chunk_size=2000,
chunk_overlap=200,
)
Preview one markdown file in memory:
chunks = RagRails().chunk_file(
"files/output/docs/guide.md",
)
Vector Storage
Ragrails currently supports Qdrant, Pinecone, and Weaviate as storage providers.
Set provider credentials as needed:
export VOYAGE_API_KEY="..."
export PINECONE_API_KEY="..."
export WEAVIATE_API_KEY="..."
Qdrant local example:
docker run -p 6333:6333 qdrant/qdrant
result = RagRails().store(
input_dir="files/output/chunks/docs",
vector_db="qdrant",
url="http://localhost:6333",
collection="rag_chunks",
)
Pinecone example:
result = RagRails().store(
input_dir="files/output/chunks/docs",
vector_db="pinecone",
collection="rag-chunks",
)
Weaviate example:
result = RagRails().store(
input_dir="files/output/chunks/docs",
vector_db="weaviate",
url="http://localhost:8080",
collection="RagChunks",
)
Provider naming rules:
| Provider | Collection name |
|---|---|
| Qdrant | Any valid Qdrant collection name, for example rag_chunks |
| Pinecone | Lowercase letters, digits, and hyphens, for example rag-chunks |
| Weaviate | Starts with an uppercase letter, for example RagChunks |
Result Types
ScrapeResult(
pages=int,
failed=int,
output_dir=str,
files=list[str],
dlq_path=str,
errors=list[str],
)
ParseResult(
documents=int,
failed=int,
output_dir=str,
files=list[str],
errors=list[str],
)
ApiIngestResult(
pages=int,
items=int,
failed=int,
output_dir=str,
files=list[str],
errors=list[str],
)
ChunkResult(
files=int,
chunks=int,
output_dir=str,
output_files=list[str],
failed=int,
errors=list[str],
)
StoreResult(
files=int,
chunks=int,
input_dir=str,
provider=str,
collection=str,
errors=list[str],
)
Parameter Reference
setup_url()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
browser |
str |
"chromium" |
No | Playwright browser binary to install for URL scraping. |
scrape()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
url |
str | list[str] |
- | Yes | URL or URLs to scrape. |
mode |
"each" | "full" |
"each" |
No | Scrape exact URLs or crawl full sites. |
output_dir |
str |
"files/output/web_crawled" |
No | Markdown output folder. |
frontmatter |
bool |
True |
No | Add source metadata to markdown files. |
dlq_path |
str | None |
None |
No | Custom DLQ file. Defaults to <output_dir>/dlq.json. |
max_depth |
int |
3 |
No | Crawl depth for mode="full". |
max_pages |
int |
200 |
No | Maximum pages per site. |
retry_scrape()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
dlq_path |
str |
- | Yes | DLQ file to retry. |
mode |
"each" | "full" |
"each" |
No | Retry as exact pages or full-site crawls. |
max_depth |
int |
3 |
No | Crawl depth for mode="full". |
max_pages |
int |
200 |
No | Maximum pages per site. |
max_attempts |
int |
3 |
No | Retry entries below this attempt count. |
parse()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
files |
str | list[str | dict] | None |
None |
Conditional | Specific files to parse. |
folder |
str | None |
None |
Conditional | Folder of supported files to parse. |
input_dir |
str |
"files/input" |
No | Base folder for files. |
output_dir |
str |
"files/output/docs" |
No | Markdown output folder. |
frontmatter |
bool |
True |
No | Add document metadata to markdown files. |
fetch()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
url |
str |
- | Yes | API endpoint URL. |
title |
str |
"API Response" |
No | Output metadata title. |
description |
str |
"" |
No | Output metadata description. |
method |
str |
"GET" |
No | HTTP method. |
headers |
dict | None |
None |
No | Request headers. |
params |
dict | None |
None |
No | Query parameters. |
body |
dict | None |
None |
No | JSON request body. |
pagination |
dict | None |
None |
No | Pagination configuration. |
max_pages |
int |
100 |
No | Maximum API pages to fetch. |
output_dir |
str |
"files/output/api" |
No | Markdown output folder. |
frontmatter |
bool |
True |
No | Add API metadata to markdown files. |
chunk()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
input_dir |
str |
"files/output/web_crawled" |
No | Folder containing markdown files. |
output_dir |
str |
"files/output/chunks" |
No | JSON chunk output folder. |
chunk_size |
int |
2000 |
No | Target maximum chunk size. |
chunk_overlap |
int |
200 |
No | Overlap between chunks. |
min_chunk_length |
int |
100 |
No | Minimum chunk length to keep. |
chunk_file()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
path |
str |
- | Yes | Markdown file path to chunk in memory. |
chunk_size |
int |
2000 |
No | Target maximum chunk size. |
chunk_overlap |
int |
200 |
No | Overlap between chunks. |
min_chunk_length |
int |
100 |
No | Minimum chunk length to keep. |
store()
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
input_dir |
str |
"files/output/chunks" |
No | Folder of chunk JSON files. |
vector_db |
"qdrant" | "pinecone" | "weaviate" |
"qdrant" |
No | Vector database provider. |
collection |
str | None |
None |
No | Collection, index, or class name. |
url |
str | None |
None |
No | Vector database URL. |
files |
str | list[str] | None |
None |
No | Selected chunk files to store. |
batch_size |
int |
64 |
No | Chunks per embedding/storage batch. |
embedder |
str |
"voyage" |
No | Embedding provider. |
model |
str |
"voyage-3" |
No | Embedding model name. |
Status
The public SDK currently covers ingestion, chunking, and vector storage. Retrieval, chat, and eval exist internally and will be exposed as public SDK surfaces next.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragrails-0.1.6.tar.gz.
File metadata
- Download URL: ragrails-0.1.6.tar.gz
- Upload date:
- Size: 326.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a437c189f8011a23ba9d25eb07f4539f588f3e408f914b531e3491884fabfdf9
|
|
| MD5 |
e027e163868f0d602cf0934fc7b9ff05
|
|
| BLAKE2b-256 |
d0b429ced953249e4284fb0dca5fe74ba7c16ed89a540769c472be15595a4a2d
|
File details
Details for the file ragrails-0.1.6-py3-none-any.whl.
File metadata
- Download URL: ragrails-0.1.6-py3-none-any.whl
- Upload date:
- Size: 115.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4f0be984349497633c8719fea958c29e74880d167d1b436f98a203180d480f0
|
|
| MD5 |
135e7ff2099c2ccbf4790cb1bc01943a
|
|
| BLAKE2b-256 |
5a2b7e6dcf47e6556ce56c943fed30bfb110c152793f6c3cb7bd1839ff04d688
|