Skip to main content

Installable RAG + MCP skills framework with a reliability-loop workflow.

Project description

rag-ai-scientist

Installable toolkit for local RAG indexing + MCP serving in scientific workflows.

PyPI Python License

rag-ai-scientist gives you:

  • a CLI to initialize and build a local vector database from your references,
  • an MCP server entrypoint for Cursor/agent integrations,
  • packaged reusable skills under rag_ai_scientist/skills/.

Installation

From source (recommended while developing)

uv venv .venv
source .venv/bin/activate
uv pip install -e .

If uv is not available, fallback to:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .

Recommended isolation: keep this in a dedicated environment (for example venvs/rag-ai-scientist) rather than reusing analysis environments such as ecalgnn311.

Verify install

python -m pip show rag-ai-scientist
rag-ai-scientist --help
python -c "import rag_ai_scientist; print(rag_ai_scientist.__version__)"

Quickstart

  1. Initialize configs/references.yaml for your analysis repo:
rag-ai-scientist init-references \
  --project-root . \
  --references-dir /path/to/references
  1. Build the local RAG database:
rag-ai-scientist setup-rag --project-root . --force
  1. Start the MCP server:
rag-ai-scientist mcp --project-root .

CLI Commands

init-references

Creates configs/references.yaml with source paths, chunking, and doc-type rules.

Useful options:

  • --references-dir path containing .pdf/.md/.txt/.tex/.py/.rst
  • --collection-name default: rag-ai-scientist
  • --chunk-size, --chunk-overlap
  • --scientific-chunk-size, --scientific-chunk-overlap
  • --force overwrite existing config

setup-rag

Indexes references and writes ChromaDB to .cursor/rag_db.

Useful options:

  • --force rebuild from scratch
  • --collection-name override config collection
  • --chunk-size, --chunk-overlap runtime overrides

mcp

Starts the stdio MCP server for Cursor or compatible MCP clients.

Cursor MCP Configuration

Example ~/.cursor/mcp.json entry:

{
  "mcpServers": {
    "rag-ai-scientist": {
      "command": "rag-ai-scientist",
      "args": ["mcp", "--project-root", "/absolute/path/to/analysis-repo"]
    }
  }
}

Running Agents With Separate Training Environments

If agents should run training/inference scripts and update configs, use two environments in parallel:

  • rag-ai-scientist environment: runs MCP server and agent logic.
  • analysis/training environment: runs model training and inference commands.

This avoids dependency conflicts while still letting agents orchestrate the full workflow for another repository.

Recommended architecture

  1. Keep a dedicated environment for rag-ai-scientist:
cd /path/to/rag-ai-scientist-installable
uv venv .venv
source .venv/bin/activate
uv pip install -e .
  1. Keep your analysis repository and its own environment separate:
  • repo: /path/to/analysis-repo
  • env: /path/to/analysis-env (conda or venv)
  1. Start MCP from the rag-ai-scientist environment, but point it to the analysis repo:
rag-ai-scientist mcp --project-root /path/to/analysis-repo
  1. Let agents launch analysis commands explicitly inside the analysis environment (for example via conda run -p), instead of relying on ambient shell state.

Safe command wrapper for agent execution

Create a wrapper script in the analysis repo (example: /path/to/analysis-repo/scripts/run_training.sh) and let agents call only this script:

#!/usr/bin/env bash
set -euo pipefail

ANALYSIS_ENV="/path/to/analysis-env"
ANALYSIS_REPO="/path/to/analysis-repo"

cd "$ANALYSIS_REPO"
exec conda run -p "$ANALYSIS_ENV" python scripts/train.py "$@"

This gives deterministic execution and avoids accidental environment drift.

Guardrails for autonomous edits and runs

  • Restrict editable files to a whitelist (for example configs/**/*.yaml).
  • Keep one output directory per run (runs/<timestamp>_<tag>).
  • Save the exact config snapshot and command used for each run.
  • Use a lock file to prevent concurrent training launches.
  • Require human approval before expensive or long GPU jobs.

Package Layout

rag_ai_scientist/
  cli.py                  # Installable CLI entrypoint
  mcp_server.py           # MCP server implementation
  skills/                 # Packaged reusable skills
rag/
  index_documents.py      # Indexing backend used by setup-rag
configs/
  references.example.yaml # Example indexing config

Development

python -m pip install -e .
python -m pip install build
python -m build

License

  • Open-source: AGPL-3.0-or-later (LICENSE)
  • Commercial: see LICENSE-COMMERCIAL.md

Security Notes

  • Never commit secrets (.env, API keys, tokens).
  • Keep local vector stores and credentials in gitignored paths.
  • Review indexed sources before sharing databases externally.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_ai_scientist-0.1.0.tar.gz (21.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_ai_scientist-0.1.0-py3-none-any.whl (25.9 kB view details)

Uploaded Python 3

File details

Details for the file rag_ai_scientist-0.1.0.tar.gz.

File metadata

  • Download URL: rag_ai_scientist-0.1.0.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for rag_ai_scientist-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cfe6b749bc6e5ef8e60bd2544f43adeb3525b06b05697fa0cf944cb56d62d874
MD5 2e9222164ecd44894716db7bbea7add1
BLAKE2b-256 34897ab9a42536462a4b68020066e62d858779b36e6fe806fc1627d4c309e86c

See more details on using hashes here.

File details

Details for the file rag_ai_scientist-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for rag_ai_scientist-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 94e8ae83cf5978c43d2092f592689c5043a98d8dd1d0d2b47ec02e468d6a3a0c
MD5 9491ead90569cb6b0acc1e9142b02b9d
BLAKE2b-256 4112c77f536dbcad2e4404a3559f82df8a16591713498bff13df26d98473d2e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page