Skip to main content

A staged RAG pipeline for turning websites, documents, and APIs into RAG-ready markdown.

Project description

Ragrails

Ragrails is a staged RAG pipeline for turning external sources into clean markdown, then preparing that content for chunking, embedding, retrieval, chat, and evaluation.

The public SDK starts with one class:

from ragrails import RagRails

rag = RagRails()

Current SDK

The ingestion, chunking, and vector storage SDK surfaces are available now.

rag.scrape(...)  # web pages and websites
rag.parse(...)   # local files and folders
rag.fetch(...)   # REST API responses
rag.chunk(...)   # markdown files to RAG chunks
rag.store(...)   # chunk JSON files to a vector DB

Detailed SDK docs:

  1. Ingestion
  2. Chunking
  3. Embedding
  4. Retrieval

Installation

Install the lightweight SDK:

pip install ragrails

This is enough to import the SDK:

from ragrails import RagRails

Install extras for the stage or provider you need.

Stage Install
URL ingestion pip install "ragrails[url]"
Document ingestion pip install "ragrails[docs]"
API ingestion pip install "ragrails[api]"
Chunking pip install "ragrails[chunk]"
Store in Qdrant pip install "ragrails[store-qdrant]"
Store in Pinecone pip install "ragrails[store-pinecone]"
Store in Weaviate pip install "ragrails[store-weaviate]"

Provider extras are also available separately:

Provider Install
Voyage embeddings pip install "ragrails[voyage]"
Qdrant pip install "ragrails[qdrant]"
Pinecone pip install "ragrails[pinecone]"
Weaviate pip install "ragrails[weaviate]"
OpenAI pip install "ragrails[openai]"
Anthropic pip install "ragrails[anthropic]"
Reranking pip install "ragrails[rerank]"
Everything pip install "ragrails[all]"

Common workflow installs:

Workflow Install
Scrape URLs, chunk, store in Qdrant pip install "ragrails[url,chunk,voyage,qdrant]"
Parse documents, chunk, store in Pinecone pip install "ragrails[docs,chunk,voyage,pinecone]"
Fetch APIs, chunk, store in Weaviate pip install "ragrails[api,chunk,voyage,weaviate]"
Qdrant storage shortcut pip install "ragrails[store-qdrant]"
Pinecone storage shortcut pip install "ragrails[store-pinecone]"
Weaviate storage shortcut pip install "ragrails[store-weaviate]"

crawl4ai pulls in Playwright as a package dependency for URL ingestion. You may still need to install browser binaries:

playwright install

Requirements

Later RAG stages use provider API keys:

export VOYAGE_API_KEY="..."
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."

Embedding and retrieval also need a vector database. Ragrails currently ships with Qdrant, Pinecone, and Weaviate adapters behind the same vector store interface.

Vector DB Providers

Choose a vector DB provider before storing, retrieving, chatting, or running evals.

Provider names:

qdrant
pinecone
weaviate

Qdrant

Qdrant is the easiest local option while developing.

docker run -p 6333:6333 qdrant/qdrant
export VECTOR_DB_PROVIDER=qdrant
export VECTOR_DB_URL=http://localhost:6333
export VECTOR_DB_COLLECTION=rag_chunks

Pinecone

Pinecone is the managed vector DB option. Ragrails uses its existing embedding model and stores dense vectors in a Pinecone serverless index.

export PINECONE_API_KEY="..."
export VECTOR_DB_PROVIDER=pinecone
export VECTOR_DB_COLLECTION=rag-chunks

Optional Pinecone settings:

export PINECONE_CLOUD=aws
export PINECONE_REGION=us-east-1
export PINECONE_NAMESPACE=

For Pinecone, VECTOR_DB_COLLECTION maps to the Pinecone index name. Use lowercase letters, digits, and hyphens only, for example rag-chunks.

Weaviate

Weaviate is another managed or self-hosted vector DB option. Ragrails uses its own embedding model and stores dense vectors in a Weaviate collection configured for self-provided vectors.

For local Weaviate, expose both HTTP and gRPC ports:

docker run -p 8080:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:1.36.9
export VECTOR_DB_PROVIDER=weaviate
export VECTOR_DB_URL=http://localhost:8080
export VECTOR_DB_COLLECTION=RagChunks

For Weaviate Cloud:

export WEAVIATE_API_KEY="..."
export VECTOR_DB_PROVIDER=weaviate
export VECTOR_DB_URL="https://your-cluster.weaviate.cloud"
export VECTOR_DB_COLLECTION=RagChunks

For Weaviate, VECTOR_DB_COLLECTION maps to the collection name. Use a name that starts with an uppercase letter, for example RagChunks.

After choosing a provider, store chunks with the SDK:

from ragrails import RagRails

result = RagRails().store(
    input_dir="files/output/chunks",
    vector_db="qdrant",
    collection="rag_chunks",
)

print(result.files)
print(result.chunks)
print(result.provider)
print(result.collection)

The lower-level stage runner also reads the same environment variables:

uv run python -m ragrails.pipeline.stg_03_embedder
uv run python -m ragrails.pipeline.stg_04_retriever "your query"

URL Ingestion

Scrape one exact page:

from ragrails import RagRails

result = RagRails().scrape(
    url="https://example.com/about",
    mode="each",
    output_dir="files/output/web_crawled",
)

print(result.pages)
print(result.files)
print(result.errors)

Crawl a website:

result = RagRails().scrape(
    url="https://example.com",
    mode="full",
    output_dir="files/output/web_crawled",
    max_depth=3,
    max_pages=200,
)

Document Ingestion

Parse a folder of local documents into markdown:

from ragrails import RagRails

result = RagRails().parse(
    folder="files/input",
    output_dir="files/output/docs",
)

print(result.documents)
print(result.files)
print(result.errors)

Parse selected files with custom metadata:

result = RagRails().parse(
    files=[
        {
            "filename": "guide.pdf",
            "title": "Product Guide",
            "description": "Internal product guide.",
        }
    ],
    input_dir="files/input",
    output_dir="files/output/docs",
)

Supported discovery extensions for folders:

.csv, .docx, .epub, .html, .htm, .ipynb, .json, .md, .msg,
.pdf, .pptx, .rss, .tsv, .txt, .xls, .xlsx, .xml, .zip

API Ingestion

Fetch a REST API response into markdown:

from ragrails import RagRails

result = RagRails().fetch(
    url="https://api.example.com/v1/products",
    title="Products",
    description="Product catalog from the API.",
    output_dir="files/output/api",
)

print(result.pages)
print(result.items)
print(result.files)
print(result.errors)

Pass request options when needed:

result = RagRails().fetch(
    url="https://api.example.com/v1/search",
    method="POST",
    headers={"Authorization": "Bearer <token>"},
    body={"query": "payments"},
    title="Search Results",
)

Chunking

Split markdown files into RAG-ready JSON chunks:

from ragrails import RagRails

result = RagRails().chunk(
    input_dir="files/output/web_crawled",
    output_dir="files/output/chunks",
)

print(result.files)
print(result.chunks)
print(result.output_files)
print(result.errors)

Chunk markdown created by any ingestion method:

result = RagRails().chunk(
    input_dir="files/output/docs",
    output_dir="files/output/chunks/docs",
    chunk_size=1200,
    chunk_overlap=150,
)

Preview one markdown file in memory:

chunks = RagRails().chunk_file(
    "files/output/docs/guide.md",
)

print(len(chunks))
print(chunks[0]["metadata"])

Store

Embed every chunk JSON file in a folder and store the vectors:

from ragrails import RagRails

result = RagRails().store(
    input_dir="files/output/chunks/docs",
    vector_db="qdrant",
    collection="rag_chunks",
)

print(result.files)
print(result.chunks)
print(result.errors)

Store in Pinecone:

result = RagRails().store(
    input_dir="files/output/chunks/docs",
    vector_db="pinecone",
    collection="rag-chunks",
)

Store in Weaviate:

result = RagRails().store(
    input_dir="files/output/chunks/docs",
    vector_db="weaviate",
    url="http://localhost:8080",
    collection="RagChunks",
)

Store selected chunk files from a folder:

result = RagRails().store(
    input_dir="files/output/chunks/docs",
    files=["001_overview.json", "002_auth.json"],
    vector_db="qdrant",
    collection="rag_chunks",
)

Provider-specific failures to check first:

qdrant    Qdrant is not running, or port 6333 is not exposed.
pinecone  PINECONE_API_KEY is missing, or the index name uses underscores.
weaviate  Weaviate is not running, gRPC 50051 is not exposed, or the collection name is invalid.

Output

Ragrails writes markdown files to the output directory you choose:

files/output/web_crawled/
files/output/docs/
files/output/api/
files/output/chunks/

By default, files include Ragrails frontmatter metadata. Disable it with frontmatter=False when you only want the markdown body.

Result Types

ScrapeResult(
    pages=int,
    failed=int,
    output_dir=str,
    files=list[str],
    dlq_path=str,
    errors=list[str],
)
ParseResult(
    documents=int,
    failed=int,
    output_dir=str,
    files=list[str],
    errors=list[str],
)
ApiIngestResult(
    pages=int,
    items=int,
    failed=int,
    output_dir=str,
    files=list[str],
    errors=list[str],
)
ChunkResult(
    files=int,
    chunks=int,
    output_dir=str,
    output_files=list[str],
    failed=int,
    errors=list[str],
)
StoreResult(
    files=int,
    chunks=int,
    input_dir=str,
    provider=str,
    collection=str,
    errors=list[str],
)

API Reference

RagRails().scrape(
    url,
    *,
    mode="each",
    output_dir="files/output/web_crawled",
    frontmatter=True,
    dlq_path="files/output/dlq.json",
    max_depth=3,
    max_pages=200,
)
RagRails().parse(
    files=None,
    *,
    folder=None,
    input_dir="files/input",
    output_dir="files/output/docs",
    frontmatter=True,
)
RagRails().fetch(
    url,
    *,
    title="API Response",
    description="",
    method="GET",
    headers=None,
    params=None,
    body=None,
    pagination=None,
    max_pages=100,
    output_dir="files/output/api",
    frontmatter=True,
)
RagRails().chunk(
    *,
    input_dir="files/output/web_crawled",
    output_dir="files/output/chunks",
    chunk_size=2000,
    chunk_overlap=200,
    min_chunk_length=100,
)
RagRails().chunk_file(
    path,
    *,
    chunk_size=2000,
    chunk_overlap=200,
    min_chunk_length=100,
)
RagRails().store(
    *,
    input_dir="files/output/chunks",
    vector_db="qdrant",
    collection=None,
    url=None,
    files=None,
    batch_size=64,
    embedder="voyage",
    model="voyage-3",
)

Supported vector_db values:

qdrant
pinecone
weaviate

Status

Ingestion, chunking, and vector storage are available through the public SDK. Retrieval, chat, and eval already exist internally and will be exposed next.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragrails-0.1.0.tar.gz (320.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragrails-0.1.0-py3-none-any.whl (114.6 kB view details)

Uploaded Python 3

File details

Details for the file ragrails-0.1.0.tar.gz.

File metadata

  • Download URL: ragrails-0.1.0.tar.gz
  • Upload date:
  • Size: 320.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ragrails-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e04c4a0a45a86a25a518c340b160030aedc87585e001677880c549e6abdece26
MD5 1684e57b00ca0881bf77d6f44c1953a9
BLAKE2b-256 fda503ed7de753d02f2a58241eecc775983deeb4f220adb780b0aa91f71768cd

See more details on using hashes here.

File details

Details for the file ragrails-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ragrails-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 114.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ragrails-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 780f3a1e84fcc8618ae02854fbbeb5dea85c49123adf279fc706c7713c39ce80
MD5 7df5cb32f09e339cd65d6e3837bdafc7
BLAKE2b-256 53d7e58b06929e416eb20451872de42b5047415661ee8e163752daf86a231d06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page