Skip to main content

A modular RAG SDK for ingesting web, document, and API sources, chunking them, and storing embeddings in pluggable vector databases.

Project description

Ragrails

PyPI Python Downloads License

Ragrails is a modular RAG SDK for turning web pages, local documents, and REST API responses into retrieval-ready knowledge bases.

Documentation: https://dev.ragrails.com

It gives you one Python interface for:

  • ingesting URLs, documents, and API responses into markdown
  • chunking markdown into RAG-ready JSON chunks
  • embedding and storing chunks in pluggable vector databases
  • building toward retrieval, chat, and evaluation workflows
from ragrails import RagRails

rag = RagRails()

Install

Ragrails requires Python 3.10 or newer. The macOS system Python is 3.9 and will not work. Install a supported version from python.org or via your package manager before running the install command.

pip install ragrails

Document and API ingestion are included in the base install. Install extras only for heavier stages or providers.

Need Install
URL ingestion pip install "ragrails[url]"
Chunking pip install "ragrails[chunk]"
REST API server pip install "ragrails[server]"
Store in Qdrant pip install "ragrails[store-qdrant]"
Store in Pinecone pip install "ragrails[store-pinecone]"
Store in Weaviate pip install "ragrails[store-weaviate]"
Everything pip install "ragrails[all]"

Provider extras are also available separately:

Provider Install
Voyage embeddings pip install "ragrails[voyage]"
Qdrant pip install "ragrails[qdrant]"
Pinecone pip install "ragrails[pinecone]"
Weaviate pip install "ragrails[weaviate]"
OpenAI pip install "ragrails[openai]"
Anthropic pip install "ragrails[anthropic]"
Reranking pip install "ragrails[rerank]"

Quick Start

URL to Vector DB

pip install "ragrails[url,chunk,voyage,qdrant]"

URL scraping uses Playwright through crawl4ai. Run browser setup once in the same environment:

from ragrails import RagRails

rag = RagRails()
rag.setup_url()

Then run the pipeline:

from ragrails import RagRails

rag = RagRails()

scraped = rag.scrape(
    url="https://example.com",
    mode="full",
    output_dir="files/output/web_crawled",
)

chunks = rag.chunk(
    input_dir=scraped.output_dir,
    output_dir="files/output/chunks/web",
)

embedded = rag.embed(
    input_dir=chunks.output_dir,
    vector_db="qdrant",
    collection="rag_chunks",
)

print(embedded.chunks)

Documents to Vector DB

pip install "ragrails[chunk,voyage,qdrant]"
from ragrails import RagRails

rag = RagRails()

parsed = rag.parse(
    folder="files/input",
    output_dir="files/output/docs",
)

chunks = rag.chunk(
    input_dir=parsed.output_dir,
    output_dir="files/output/chunks/docs",
)

embedded = rag.embed(
    input_dir=chunks.output_dir,
    vector_db="qdrant",
    collection="rag_chunks",
)

print(embedded.chunks)

API to Markdown

from ragrails import RagRails

result = RagRails().fetch(
    url="https://api.example.com/v1/products",
    title="Products",
    output_dir="files/output/api",
)

print(result.files)

CLI

Ragrails ships with a CLI so you can run ingestion without writing Python.

ragrails setup-url
ragrails scrape https://example.com --mode full
ragrails scrape https://example.com/about https://example.com/pricing
ragrails parse --folder files/input
ragrails parse --files guide.pdf --files pricing.csv --input-dir files/input
ragrails fetch https://api.example.com/v1/products --title "Products"
ragrails fetch https://api.example.com/v1/products \
  --header "Authorization:Bearer <token>" \
  --header "X-Api-Key:my-key"

See the full CLI reference.

REST API

Ragrails also ships an optional REST API server for language-agnostic HTTP usage.

pip install "ragrails[server]"
ragrails-api
curl -X POST http://127.0.0.1:8000/v1/ingest/api \
  -H "Content-Type: application/json" \
  -d '{"url":"https://api.example.com/v1/products","title":"Products"}'

See the full REST API reference.

SDK Stages

Stage Method Output
URL ingestion rag.scrape(...) Markdown files
URL retry rag.retry_scrape(...) Retried markdown files
Document ingestion rag.parse(...) Markdown files
API ingestion rag.fetch(...) Markdown files
Chunking rag.chunk(...) JSON chunk files
Single-file chunk preview rag.chunk_file(...) In-memory chunk dictionaries
Embedding rag.embed(...) Embedded vectors in a vector DB
Vector storage rag.store(...) Alias for embedding and storing chunks
Retrieval rag.retrieve(...) Ranked retrieved chunks

The usage interfaces are organized in the package under ragrails/usage/:

ragrails/usage/
  sdk/
  cli/
  server/

Hosted documentation:

Repository docs:

Usage Overview Ingestion Chunking Embedding Storing Retrieval
SDK Overview Ingestion Chunking Embedding Storing Retrieval
CLI Overview Ingestion Chunking Embedding Storing Retrieval
REST API server Overview Ingestion Chunking Embedding Storing Retrieval

Specialized SDK ingestion docs:

Ingestion

URL Ingestion

result = RagRails().scrape(
    url="https://example.com/about",
    mode="each",
    output_dir="files/output/web_crawled",
)

For full-site crawling:

result = RagRails().scrape(
    url="https://example.com",
    mode="full",
    output_dir="files/output/web_crawled",
    max_depth=3,
    max_pages=200,
)

Failed URL attempts are written to dlq.json inside the output folder by default:

files/output/web_crawled/dlq.json

Retry failed URLs:

result = RagRails().retry_scrape(
    "files/output/web_crawled/dlq.json",
)

Document Ingestion

result = RagRails().parse(
    folder="files/input",
    output_dir="files/output/docs",
)

Supported folder discovery extensions:

.csv, .docx, .epub, .html, .htm, .ipynb, .json, .md, .msg,
.pdf, .pptx, .rss, .tsv, .txt, .xls, .xlsx, .xml, .zip

API Ingestion

result = RagRails().fetch(
    url="https://api.example.com/v1/search",
    method="POST",
    headers={
        "Authorization": "Bearer <token>",
        "X-Api-Key": "my-key",
    },
    body={"query": "payments"},
    title="Search Results",
    output_dir="files/output/api",
)

Chunking

result = RagRails().chunk(
    input_dir="files/output/docs",
    output_dir="files/output/chunks/docs",
    chunk_size=2000,
    chunk_overlap=200,
)

Preview one markdown file in memory:

chunks = RagRails().chunk_file(
    "files/output/docs/guide.md",
)

Embedding And Vector Storage

Ragrails currently supports Qdrant, Pinecone, and Weaviate as storage providers.

Set provider credentials as needed:

export VOYAGE_API_KEY="..."
export PINECONE_API_KEY="..."
export WEAVIATE_API_KEY="..."

Qdrant local example:

docker run -p 6333:6333 qdrant/qdrant
result = RagRails().embed(
    input_dir="files/output/chunks/docs",
    vector_db="qdrant",
    url="http://localhost:6333",
    collection="rag_chunks",
)

Pinecone example:

result = RagRails().embed(
    input_dir="files/output/chunks/docs",
    vector_db="pinecone",
    collection="rag-chunks",
)

Weaviate example:

result = RagRails().embed(
    input_dir="files/output/chunks/docs",
    vector_db="weaviate",
    url="http://localhost:8080",
    collection="RagChunks",
)

Provider naming rules:

Provider Collection name
Qdrant Any valid Qdrant collection name, for example rag_chunks
Pinecone Lowercase letters, digits, and hyphens, for example rag-chunks
Weaviate Starts with an uppercase letter, for example RagChunks

store(...) is kept as an alias for embed(...) when you prefer storage-oriented naming.

Retrieval

result = RagRails().retrieve(
    "How do payouts work?",
    vector_db="qdrant",
    collection="rag_chunks",
    top_k=10,
)

for item in result.results:
    print(item.score, item.metadata.get("title"), item.text[:200])

Result Types

ScrapeResult(
    pages=int,
    failed=int,
    output_dir=str,
    files=list[str],
    dlq_path=str,
    errors=list[str],
)
ParseResult(
    documents=int,
    failed=int,
    output_dir=str,
    files=list[str],
    errors=list[str],
)
ApiIngestResult(
    pages=int,
    items=int,
    failed=int,
    output_dir=str,
    files=list[str],
    errors=list[str],
)
ChunkResult(
    files=int,
    chunks=int,
    output_dir=str,
    output_files=list[str],
    failed=int,
    errors=list[str],
)
EmbedResult(
    files=int,
    chunks=int,
    input_dir=str,
    provider=str,
    collection=str,
    errors=list[str],
)
StoreResult(
    files=int,
    chunks=int,
    input_dir=str,
    provider=str,
    collection=str,
    errors=list[str],
)
RetrieveResult(
    query=str,
    results=list[RetrievedChunk],
)
RetrievedChunk(
    id=str,
    score=float,
    text=str,
    metadata=dict,
    rerank_score=float | None,
)

Parameter Reference

setup_url()

Parameter Type Default Required Description
browser str "chromium" No Playwright browser binary to install for URL scraping.

scrape()

Parameter Type Default Required Description
url str | list[str] - Yes URL or URLs to scrape.
mode "each" | "full" "each" No Scrape exact URLs or crawl full sites.
output_dir str "files/output/web_crawled" No Markdown output folder.
frontmatter bool True No Add source metadata to markdown files.
dlq_path str | None None No Custom DLQ file. Defaults to <output_dir>/dlq.json.
max_depth int 3 No Crawl depth for mode="full".
max_pages int 200 No Maximum pages per site.

retry_scrape()

Parameter Type Default Required Description
dlq_path str - Yes DLQ file to retry.
mode "each" | "full" "each" No Retry as exact pages or full-site crawls.
max_depth int 3 No Crawl depth for mode="full".
max_pages int 200 No Maximum pages per site.
max_attempts int 3 No Retry entries below this attempt count.

parse()

Parameter Type Default Required Description
files str | list[str | dict] | None None Conditional Specific files to parse.
folder str | None None Conditional Folder of supported files to parse.
input_dir str "files/input" No Base folder for files.
output_dir str "files/output/docs" No Markdown output folder.
frontmatter bool True No Add document metadata to markdown files.

fetch()

Parameter Type Default Required Description
url str - Yes API endpoint URL.
title str "API Response" No Output metadata title.
description str "" No Output metadata description.
method str "GET" No HTTP method.
headers dict | None None No Request headers. Multiple headers are supported.
params dict | None None No Query parameters.
body dict | None None No JSON request body.
pagination dict | None None No Pagination configuration.
max_pages int 100 No Maximum API pages to fetch.
output_dir str "files/output/api" No Markdown output folder.
frontmatter bool True No Add API metadata to markdown files.

chunk()

Parameter Type Default Required Description
input_dir str "files/output/web_crawled" No Folder containing markdown files.
output_dir str "files/output/chunks" No JSON chunk output folder.
chunk_size int 2000 No Target maximum chunk size.
chunk_overlap int 200 No Overlap between chunks.
min_chunk_length int 100 No Minimum chunk length to keep.

chunk_file()

Parameter Type Default Required Description
path str - Yes Markdown file path to chunk in memory.
chunk_size int 2000 No Target maximum chunk size.
chunk_overlap int 200 No Overlap between chunks.
min_chunk_length int 100 No Minimum chunk length to keep.

embed()

Parameter Type Default Required Description
input_dir str "files/output/chunks" No Folder of chunk JSON files.
vector_db "qdrant" | "pinecone" | "weaviate" "qdrant" No Vector database provider.
collection str | None None No Collection, index, or class name.
url str | None None No Vector database URL.
files str | list[str] | None None No Selected chunk files to embed.
batch_size int 64 No Chunks per embedding/storage batch.
embedder str "voyage" No Embedding provider.
model str "voyage-3" No Embedding model name.

store()

store() accepts the same parameters as embed() and returns StoreResult.

retrieve()

Parameter Type Default Required Description
query str - Yes Query text to search for.
vector_db "qdrant" | "pinecone" | "weaviate" "qdrant" No Vector database provider.
collection str | None None No Collection, index, or class name.
url str | None None No Vector database URL.
top_k int 10 No Number of vector search candidates.
embedder str "voyage" No Query embedding provider.
model str "voyage-3" No Query embedding model.
rerank bool False No Rerank retrieved candidates.
reranker str "voyage" No Reranker provider.
reranker_model str "rerank-2-lite" No Reranker model.
rerank_top_k int 5 No Number of reranked results to return.

Status

The public SDK currently covers ingestion, chunking, embedding, vector storage, and retrieval. Chat and eval exist internally and will be exposed as public SDK surfaces later.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragrails-0.1.10.tar.gz (493.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragrails-0.1.10-py3-none-any.whl (143.5 kB view details)

Uploaded Python 3

File details

Details for the file ragrails-0.1.10.tar.gz.

File metadata

  • Download URL: ragrails-0.1.10.tar.gz
  • Upload date:
  • Size: 493.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ragrails-0.1.10.tar.gz
Algorithm Hash digest
SHA256 4de1a1ef7cb4ab2d73b66526b8d90bd799fe997a5c5aef7df84c2f7a363f3065
MD5 58bc0996dfdda389d47485a2eef060bb
BLAKE2b-256 83bfcd95880a25584a5765f4a5dd85c2449820e5174584374c36bc1b15671e04

See more details on using hashes here.

File details

Details for the file ragrails-0.1.10-py3-none-any.whl.

File metadata

  • Download URL: ragrails-0.1.10-py3-none-any.whl
  • Upload date:
  • Size: 143.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ragrails-0.1.10-py3-none-any.whl
Algorithm Hash digest
SHA256 197c1f9cbbb792dfa819e9f3ddb4d414c350824dff00fe5b36e98e3d4cdfdff3
MD5 a7e81094c4f735448f5a4d1467146a07
BLAKE2b-256 22a75d7ae16f85ba6b1e20a3656926ed908d53444db0fff09840461fa1afacc5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page