Skip to main content

A modular RAG SDK for ingesting web, document, and API sources, chunking them, and storing embeddings in pluggable vector databases.

Project description

Ragrails

PyPI Python Downloads License

Ragrails is a modular RAG SDK for turning web pages, local documents, and REST API responses into retrieval-ready knowledge bases.

Documentation: https://dev.ragrails.com

It gives you one Python interface for:

  • ingesting URLs, documents, and API responses into markdown
  • chunking markdown into RAG-ready JSON chunks
  • embedding and storing chunks in pluggable vector databases
  • building toward retrieval, chat, and evaluation workflows
from ragrails import RagRails

rag = RagRails()

Install

Ragrails requires Python 3.10 or newer.

Recommended setup:

uv venv --python 3.10 .venv
source .venv/bin/activate
uv pip install ragrails

If you already manage your Python environment:

pip install ragrails

Document and API ingestion are included in the base install. Install extras only for heavier stages or providers.

Need Install
URL ingestion uv pip install "ragrails[url]"
Chunking uv pip install "ragrails[chunk]"
Store in Qdrant uv pip install "ragrails[store-qdrant]"
Store in Pinecone uv pip install "ragrails[store-pinecone]"
Store in Weaviate uv pip install "ragrails[store-weaviate]"
Everything uv pip install "ragrails[all]"

Provider extras are also available separately:

Provider Install
Voyage embeddings uv pip install "ragrails[voyage]"
Qdrant uv pip install "ragrails[qdrant]"
Pinecone uv pip install "ragrails[pinecone]"
Weaviate uv pip install "ragrails[weaviate]"
OpenAI uv pip install "ragrails[openai]"
Anthropic uv pip install "ragrails[anthropic]"
Reranking uv pip install "ragrails[rerank]"

Quick Start

URL to Vector DB

uv pip install "ragrails[url,chunk,voyage,qdrant]"

URL scraping uses Playwright through crawl4ai. Run browser setup once in the same environment:

from ragrails import RagRails

rag = RagRails()
rag.setup_url()

Then run the pipeline:

from ragrails import RagRails

rag = RagRails()

scraped = rag.scrape(
    url="https://example.com",
    mode="full",
    output_dir="files/output/web_crawled",
)

chunks = rag.chunk(
    input_dir=scraped.output_dir,
    output_dir="files/output/chunks/web",
)

stored = rag.store(
    input_dir=chunks.output_dir,
    vector_db="qdrant",
    collection="rag_chunks",
)

print(stored.chunks)

Documents to Vector DB

uv pip install "ragrails[chunk,voyage,qdrant]"
from ragrails import RagRails

rag = RagRails()

parsed = rag.parse(
    folder="files/input",
    output_dir="files/output/docs",
)

chunks = rag.chunk(
    input_dir=parsed.output_dir,
    output_dir="files/output/chunks/docs",
)

stored = rag.store(
    input_dir=chunks.output_dir,
    vector_db="qdrant",
    collection="rag_chunks",
)

print(stored.chunks)

API to Markdown

from ragrails import RagRails

result = RagRails().fetch(
    url="https://api.example.com/v1/products",
    title="Products",
    output_dir="files/output/api",
)

print(result.files)

SDK Stages

Stage Method Output
URL ingestion rag.scrape(...) Markdown files
URL retry rag.retry_scrape(...) Retried markdown files
Document ingestion rag.parse(...) Markdown files
API ingestion rag.fetch(...) Markdown files
Chunking rag.chunk(...) JSON chunk files
Single-file chunk preview rag.chunk_file(...) In-memory chunk dictionaries
Vector storage rag.store(...) Embedded vectors in a vector DB

Hosted documentation:

Repository docs:

Ingestion

URL Ingestion

result = RagRails().scrape(
    url="https://example.com/about",
    mode="each",
    output_dir="files/output/web_crawled",
)

For full-site crawling:

result = RagRails().scrape(
    url="https://example.com",
    mode="full",
    output_dir="files/output/web_crawled",
    max_depth=3,
    max_pages=200,
)

Failed URL attempts are written to dlq.json inside the output folder by default:

files/output/web_crawled/dlq.json

Retry failed URLs:

result = RagRails().retry_scrape(
    "files/output/web_crawled/dlq.json",
)

Document Ingestion

result = RagRails().parse(
    folder="files/input",
    output_dir="files/output/docs",
)

Supported folder discovery extensions:

.csv, .docx, .epub, .html, .htm, .ipynb, .json, .md, .msg,
.pdf, .pptx, .rss, .tsv, .txt, .xls, .xlsx, .xml, .zip

API Ingestion

result = RagRails().fetch(
    url="https://api.example.com/v1/search",
    method="POST",
    headers={"Authorization": "Bearer <token>"},
    body={"query": "payments"},
    title="Search Results",
    output_dir="files/output/api",
)

Chunking

result = RagRails().chunk(
    input_dir="files/output/docs",
    output_dir="files/output/chunks/docs",
    chunk_size=2000,
    chunk_overlap=200,
)

Preview one markdown file in memory:

chunks = RagRails().chunk_file(
    "files/output/docs/guide.md",
)

Vector Storage

Ragrails currently supports Qdrant, Pinecone, and Weaviate as storage providers.

Set provider credentials as needed:

export VOYAGE_API_KEY="..."
export PINECONE_API_KEY="..."
export WEAVIATE_API_KEY="..."

Qdrant local example:

docker run -p 6333:6333 qdrant/qdrant
result = RagRails().store(
    input_dir="files/output/chunks/docs",
    vector_db="qdrant",
    url="http://localhost:6333",
    collection="rag_chunks",
)

Pinecone example:

result = RagRails().store(
    input_dir="files/output/chunks/docs",
    vector_db="pinecone",
    collection="rag-chunks",
)

Weaviate example:

result = RagRails().store(
    input_dir="files/output/chunks/docs",
    vector_db="weaviate",
    url="http://localhost:8080",
    collection="RagChunks",
)

Provider naming rules:

Provider Collection name
Qdrant Any valid Qdrant collection name, for example rag_chunks
Pinecone Lowercase letters, digits, and hyphens, for example rag-chunks
Weaviate Starts with an uppercase letter, for example RagChunks

Result Types

ScrapeResult(
    pages=int,
    failed=int,
    output_dir=str,
    files=list[str],
    dlq_path=str,
    errors=list[str],
)
ParseResult(
    documents=int,
    failed=int,
    output_dir=str,
    files=list[str],
    errors=list[str],
)
ApiIngestResult(
    pages=int,
    items=int,
    failed=int,
    output_dir=str,
    files=list[str],
    errors=list[str],
)
ChunkResult(
    files=int,
    chunks=int,
    output_dir=str,
    output_files=list[str],
    failed=int,
    errors=list[str],
)
StoreResult(
    files=int,
    chunks=int,
    input_dir=str,
    provider=str,
    collection=str,
    errors=list[str],
)

Parameter Reference

setup_url()

Parameter Type Default Required Description
browser str "chromium" No Playwright browser binary to install for URL scraping.

scrape()

Parameter Type Default Required Description
url str | list[str] - Yes URL or URLs to scrape.
mode "each" | "full" "each" No Scrape exact URLs or crawl full sites.
output_dir str "files/output/web_crawled" No Markdown output folder.
frontmatter bool True No Add source metadata to markdown files.
dlq_path str | None None No Custom DLQ file. Defaults to <output_dir>/dlq.json.
max_depth int 3 No Crawl depth for mode="full".
max_pages int 200 No Maximum pages per site.

retry_scrape()

Parameter Type Default Required Description
dlq_path str - Yes DLQ file to retry.
mode "each" | "full" "each" No Retry as exact pages or full-site crawls.
max_depth int 3 No Crawl depth for mode="full".
max_pages int 200 No Maximum pages per site.
max_attempts int 3 No Retry entries below this attempt count.

parse()

Parameter Type Default Required Description
files str | list[str | dict] | None None Conditional Specific files to parse.
folder str | None None Conditional Folder of supported files to parse.
input_dir str "files/input" No Base folder for files.
output_dir str "files/output/docs" No Markdown output folder.
frontmatter bool True No Add document metadata to markdown files.

fetch()

Parameter Type Default Required Description
url str - Yes API endpoint URL.
title str "API Response" No Output metadata title.
description str "" No Output metadata description.
method str "GET" No HTTP method.
headers dict | None None No Request headers.
params dict | None None No Query parameters.
body dict | None None No JSON request body.
pagination dict | None None No Pagination configuration.
max_pages int 100 No Maximum API pages to fetch.
output_dir str "files/output/api" No Markdown output folder.
frontmatter bool True No Add API metadata to markdown files.

chunk()

Parameter Type Default Required Description
input_dir str "files/output/web_crawled" No Folder containing markdown files.
output_dir str "files/output/chunks" No JSON chunk output folder.
chunk_size int 2000 No Target maximum chunk size.
chunk_overlap int 200 No Overlap between chunks.
min_chunk_length int 100 No Minimum chunk length to keep.

chunk_file()

Parameter Type Default Required Description
path str - Yes Markdown file path to chunk in memory.
chunk_size int 2000 No Target maximum chunk size.
chunk_overlap int 200 No Overlap between chunks.
min_chunk_length int 100 No Minimum chunk length to keep.

store()

Parameter Type Default Required Description
input_dir str "files/output/chunks" No Folder of chunk JSON files.
vector_db "qdrant" | "pinecone" | "weaviate" "qdrant" No Vector database provider.
collection str | None None No Collection, index, or class name.
url str | None None No Vector database URL.
files str | list[str] | None None No Selected chunk files to store.
batch_size int 64 No Chunks per embedding/storage batch.
embedder str "voyage" No Embedding provider.
model str "voyage-3" No Embedding model name.

Status

The public SDK currently covers ingestion, chunking, and vector storage. Retrieval, chat, and eval exist internally and will be exposed as public SDK surfaces next.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragrails-0.1.6.tar.gz (326.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragrails-0.1.6-py3-none-any.whl (115.6 kB view details)

Uploaded Python 3

File details

Details for the file ragrails-0.1.6.tar.gz.

File metadata

  • Download URL: ragrails-0.1.6.tar.gz
  • Upload date:
  • Size: 326.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ragrails-0.1.6.tar.gz
Algorithm Hash digest
SHA256 a437c189f8011a23ba9d25eb07f4539f588f3e408f914b531e3491884fabfdf9
MD5 e027e163868f0d602cf0934fc7b9ff05
BLAKE2b-256 d0b429ced953249e4284fb0dca5fe74ba7c16ed89a540769c472be15595a4a2d

See more details on using hashes here.

File details

Details for the file ragrails-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: ragrails-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 115.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ragrails-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 b4f0be984349497633c8719fea958c29e74880d167d1b436f98a203180d480f0
MD5 135e7ff2099c2ccbf4790cb1bc01943a
BLAKE2b-256 5a2b7e6dcf47e6556ce56c943fed30bfb110c152793f6c3cb7bd1839ff04d688

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page