Rosetta CLI for publishing knowledge base content to RAGFlow
Project description
Rosetta CLI
Knowledge base publishing and management tools powered by RAGFlow
๐ฏ Overview
This directory contains the Python package for publishing knowledge base content to RAGFlow instances. The CLI supports multi-environment workflows with smart change detection and auto-metadata extraction.
Community
Key Features
- ๐ Smart Publishing - MD5 hash-based change detection (~77% faster republishing)
- ๐๏ธ Modular Architecture - Command pattern with service layer for maintainability
- ๐ท๏ธ Tag-in-Title Format -
[tag1][tag2] filename.extfor powerful server-side filtering - ๐ Parse Status Tracking - Monitor document parsing progress with visual indicators
- ๐ Upsert Semantics - No duplicates, republishing updates existing documents
- โฑ๏ธ Performance Timing - All commands show execution time
- ๐ Multi-Environment - Switch between local, dev, and production configs
- ๐ API Key Auth - Secure authentication via RAGFlow API keys
- ๐ฏ Server-Side Filtering - Reduce network traffic with metadata conditions
Quick Navigation
- Complete Setup Guide: See docs/QUICKSTART.md for detailed setup instructions
- CLI Commands: See CLI Commands for all available commands
- Environment Management: See Environment Management for switching configs
๐ Contents
rosetta-cli/
โโโ pyproject.toml # Package metadata + console entrypoint
โโโ rosetta_cli/ # Installable Python package
โ โโโ cli.py # CLI entry point
โ โโโ commands/ # Command implementations
โ โโโ services/ # Shared business logic
โ โโโ ims_config.py # Configuration management
โ โโโ ims_publisher.py # Publishing orchestration
โ โโโ ragflow_client.py # RAGFlow SDK wrapper
โโโ env.template # Environment configuration template
โโโ tests/ # CLI unit tests
โโโ README.md # This file
๐ Quick Start
Complete setup instructions are in docs/QUICKSTART.md. Here's the quick reference:
Prerequisites
- Python 3.12 (required by ragflow-sdk 0.23.1)
- RAGFlow instance (local via Docker Compose or remote)
uvxfor installed CLI usage- Root virtual environment configured for local CLI development
Installed Usage
uvx rosetta-cli@latest version
uvx rosetta-cli@latest verify
Local Development
python3 -m venv venv
venv/bin/pip install -r requirements.txt
cp rosetta-cli/.env.dev .env
venv/bin/rosetta-cli verify
๐ง CLI Commands
All commands support --env <environment> flag to override the active environment.
Version
uvx rosetta-cli@latest version
Publishing Commands
Publish Knowledge Base Content
# Publish all instructions (only changed files)
uvx rosetta-cli@latest publish ../instructions
# Publish business context
uvx rosetta-cli@latest publish ../business
# Force republish all files (bypass change detection)
uvx rosetta-cli@latest publish ../instructions --force
# Preview changes without publishing
uvx rosetta-cli@latest publish ../instructions --dry-run
# Use different environment
uvx rosetta-cli@latest publish ../instructions --env production
Performance:
- First publish: ~10-15s per file (embedding generation + parsing)
- Subsequent publishes: Only changed files (~77% faster)
- Dry run: Preview in ~2-3s
What gets published:
File: /instructions/agents/r1/agents.md
Published as:
Document ID: b0ec4d56-6cc5-5bbd-9868-5d49afa2a7d8 (UUID from path)
Title: [instructions][agents][r1] agents.md
Dataset: aia-r1 (from template: aia-{release})
Tags: ["instructions", "agents", "r1"] (in metadata)
Domain: instructions (first folder)
Release: r1 (auto-detected from path)
Content Hash: abc123... (MD5 of content)
Trigger Document Parsing
Re-parse documents without re-uploading (useful for changing parser settings):
# Parse all unparsed documents
uvx rosetta-cli@latest parse
# Parse specific dataset
uvx rosetta-cli@latest parse --dataset aia-r1
# Force re-parse ALL documents
uvx rosetta-cli@latest parse --dataset aia-r1 --force
# Preview without parsing (dry run)
uvx rosetta-cli@latest parse --dataset aia-r1 --dry-run
List Documents
# List documents in default dataset
uvx rosetta-cli@latest list-dataset
# List specific dataset
uvx rosetta-cli@latest list-dataset --dataset aia-r1
Output shows:
- Document title (with tag prefixes)
- Document ID, file size, parse status, chunk count
- Metadata (tags, domain, release, source path)
Cleanup Dataset
# Preview cleanup without deleting
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --dry-run
# Cleanup documents with specific prefix
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --prefix "aqa-phase" --dry-run
# Cleanup documents with specific tags (space-separated)
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --tags "r1 agents" --dry-run
# Cleanup documents with specific tags (comma-separated)
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --tags "r1,agents" --dry-run
# Force cleanup without confirmation
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --force
# Force cleanup with prefix
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --prefix "aqa-phase" --force
# Force cleanup with tags
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --tags "r1,agents" --force
โ ๏ธ Warning: Without --prefix or --tags, this deletes ALL documents. Use --dry-run first.
Filtering Options:
--prefix: Match documents by title prefix (e.g.,"[instructions]")--tags: Match documents by metadata tags (e.g.,"r1 agents"or"r1,agents")- Uses OR logic: finds documents with ANY of the specified tags
- Server-side filtering for efficiency
Verification Commands
Verify Connection
uvx rosetta-cli@latest verify
# Check production environment
uvx rosetta-cli@latest verify --env production
Checks:
- API key validity
- RAGFlow server connectivity
- System health (database, Redis, document engine)
- Available datasets
๐ Environment Management
Configuration Files
| File | Environment | Purpose |
|---|---|---|
env.template |
Template | Create new environments |
.env |
Active | Current configuration (gitignored) |
.env.local |
Local | Local RAGFlow development |
.env.remote |
Remote | Production RAGFlow instance |
Switch Environments
Method 1: Copy file to .env (recommended)
# Switch to local
cp .env.local .env
# Switch to production
cp .env.remote .env
# Check current environment
grep "ENVIRONMENT=" .env
Method 2: Use --env flag (temporary override)
uvx rosetta-cli@latest list-dataset --env local
uvx rosetta-cli@latest publish ../instructions --env production
Environment Variables
# Required
RAGFLOW_BASE_URL=http://your-ragflow-instance
RAGFLOW_API_KEY=ragflow-xxx...
ENVIRONMENT=local
# Dataset Configuration
RAGFLOW_DATASET_DEFAULT=aia
RAGFLOW_DATASET_TEMPLATE=aia-{release}
# Embedding Model (optional)
RAGFLOW_EMBEDDING_MODEL=text-embedding-3-small@OpenAI
# Chunking Configuration (optional)
RAGFLOW_CHUNK_METHOD=naive
RAGFLOW_CHUNK_TOKEN_NUM=512
RAGFLOW_DELIMITER=\n
RAGFLOW_AUTO_KEYWORDS=0
RAGFLOW_AUTO_QUESTIONS=0
Creating New Environments
cp env.template .env.staging
nano .env.staging
uvx rosetta-cli@latest verify --env staging
๐๏ธ Architecture
Key Components
RAGFlowClient (ragflow_client.py)
Wrapper around ragflow-sdk:
from rosetta_cli.ragflow_client import RAGFlowClient, DocumentMetadata
client = RAGFlowClient(api_key="ragflow-xxx", base_url="http://your-ragflow-instance")
# Dataset management
client.create_dataset(name="aia-r1", description="Release 1")
client.get_dataset(name="aia-r1")
client.list_datasets()
# Document upload with change detection
client.upload_document(
file_path=Path("agents.md"),
metadata=DocumentMetadata(...),
dataset_id="dataset-id",
force=False # Skip if unchanged
)
# Health check
client.verify_connection()
client.get_system_health()
IMSConfig (ims_config.py)
Configuration management with smart .env discovery:
from rosetta_cli.ims_config import IMSConfig
# Auto-discover .env (searches cwd, script dir, git root)
config = IMSConfig.from_env()
# Use specific environment
config = IMSConfig.from_env(environment="production")
# Validate configuration
config.validate()
ContentPublisher (ims_publisher.py)
Publishing logic with metadata extraction:
from rosetta_cli.ims_publisher import ContentPublisher
publisher = ContentPublisher(client, config, workspace_root)
results = publisher.publish(
content_path=Path("../instructions"),
force=False, # Skip unchanged files
dry_run=False, # Preview mode
no_parse=False, # Skip parsing after upload
parse_timeout=300 # Parse timeout in seconds
)
print(f"Published: {results.published_count}")
print(f"Skipped: {results.skipped_count}")
print(f"Failed: {results.failed_count}")
Metadata Extraction:
File: /instructions/agents/r1/bootstrap.md
Extracted:
Tags: ["instructions", "agents", "r1"]
Domain: instructions
Release: r1
Title: bootstrap.md
Content Hash: abc123... (MD5)
Document ID: uuid-from-path
๐ฏ Tag-in-Title Format
What is Tag-in-Title?
Documents are stored with tags as prefixes for server-side filtering:
Format: [tag1][tag2][tag3] filename.ext
Examples:
[instructions][agents][r1] agents.md
[business][project] RFP.pdf
Why Two Locations?
Tags are stored in both title and metadata:
Title: Fast server-side keyword search Metadata: Precise client-side filtering with complex queries
How Tags are Generated
Tags come from folder structure only:
File: /instructions/agents/r1/bootstrap.md
Folders: instructions / agents / r1 / (file)
Tags: [instructions][agents][r1]
Using Tags for Filtering
# Delete all instruction documents
uvx rosetta-cli@latest cleanup-dataset --prefix "[instructions]"
# Delete all r1 agent documents
uvx rosetta-cli@latest cleanup-dataset --prefix "[instructions][agents][r1]"
๐ป Usage Examples
Example 1: First-Time Setup
python3 -m venv venv
venv/bin/pip install -r requirements.txt
cp rosetta-cli/env.template .env
nano .env # Add RAGFLOW_BASE_URL and RAGFLOW_API_KEY
uvx rosetta-cli@latest verify
uvx rosetta-cli@latest publish instructions
Example 2: Daily Publishing Workflow
uvx rosetta-cli@latest publish ../instructions --dry-run
uvx rosetta-cli@latest publish ../instructions
uvx rosetta-cli@latest list-dataset
Example 3: Multi-Environment Publishing
# Publish to dev
uvx rosetta-cli@latest publish ../instructions --env dev
# Verify on dev
uvx rosetta-cli@latest verify --env dev
# Publish to production
uvx rosetta-cli@latest publish ../instructions --env prod
Example 4: Cleanup and Republish
# Preview deletion
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --dry-run
# Delete all documents
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --force
# Republish everything
uvx rosetta-cli@latest publish ../instructions --force
Example 5: Programmatic Usage
from pathlib import Path
from rosetta_cli.ragflow_client import RAGFlowClient, DocumentMetadata
from rosetta_cli.ims_config import IMSConfig
from rosetta_cli.ims_publisher import ContentPublisher
config = IMSConfig.from_env()
client = RAGFlowClient(
api_key=config.api_key,
base_url=config.base_url,
embedding_model=config.embedding_model,
chunk_method=config.chunk_method,
parser_config=config.parser_config
)
client.verify_connection()
publisher = ContentPublisher(client, config, Path("/path/to/workspace"))
results = publisher.publish(
content_path=Path("/path/to/workspace") / "instructions",
force=False,
dry_run=False
)
print(f"Published: {results.published_count}, Skipped: {results.skipped_count}")
๐ Troubleshooting
Error: "api_key cannot be empty"
Set RAGFLOW_API_KEY in .env:
nano .env
# Add: RAGFLOW_API_KEY=ragflow-xxxxxxxxxxxxxxxxxxxx
Error: "Invalid API key or expired token"
Generate new API key:
- Login to RAGFlow
- Profile โ API Keys โ Generate New Key
- Update
.envfile
Error: "Connection refused"
- Check RAGFlow is running:
docker ps | grep ragflow - Verify URL:
grep RAGFLOW_BASE_URL .env - Test:
curl http://your-ragflow-instance/v1/system/healthz
Error: "Module 'ragflow_sdk' not found"
venv/bin/pip install -r requirements.txt
Error: "No .env file found"
cp rosetta-cli/env.template .env
nano .env
Parse Status Shows "FAIL"
- Check document format (PDF, MD, TXT supported)
- Re-trigger parsing:
uvx rosetta-cli@latest parse --dataset aia-r1 --force - Check RAGFlow logs:
docker logs ragflow-server
Slow Publishing Performance
- Use faster embedding model:
RAGFLOW_EMBEDDING_MODEL=text-embedding-3-small@OpenAI - Ensure change detection works (don't use
--force) - Reduce chunk size:
RAGFLOW_CHUNK_TOKEN_NUM=256
Documents Not Showing Tags
Tags should appear in title with format [tag1][tag2]:
uvx rosetta-cli@latest list-dataset
# Output: 1. [instructions][agents][r1] agents.md
๐ฆ Performance Tips
1. Use Change Detection
# Good: Only publishes changed files (~77% faster)
uvx rosetta-cli@latest publish ../instructions
# Bad: Republishes everything
uvx rosetta-cli@latest publish ../instructions --force
2. Use Dry Run to Preview
# Preview (fast)
uvx rosetta-cli@latest publish ../instructions --dry-run
# Then publish for real
uvx rosetta-cli@latest publish ../instructions
3. Optimize Chunking
# Faster parsing
RAGFLOW_CHUNK_TOKEN_NUM=256
# Better context
RAGFLOW_CHUNK_TOKEN_NUM=1024
4. Use Selective Cleanup
# Fast: Delete specific documents
uvx rosetta-cli@latest cleanup-dataset --prefix "[instructions][agents]" --force
# Slow: Delete and republish everything
uvx rosetta-cli@latest cleanup-dataset --force
uvx rosetta-cli@latest publish ../instructions --force
5. Monitor Parse Status
uvx rosetta-cli@latest list-dataset | grep "Parse Status"
๐ Advanced Topics
Custom Dataset Naming
The RAGFLOW_DATASET_TEMPLATE supports {release} placeholder:
RAGFLOW_DATASET_TEMPLATE=aia-{release}
# /instructions/r1/file.md โ aia-r1
# /instructions/r2/file.md โ aia-r2
# /instructions/file.md โ aia (default)
Supported File Types
Text files (extracted and chunked):
- Markdown (
.md) - Plain text (
.txt)
Binary files (uploaded for storage):
- PDF, Excel, Word, PowerPoint
Environment File Discovery
When running commands without specifying config, search order:
- Current directory:
.env.{environment}or.env - Script directory:
.env.{environment}or.env - Git root:
.env.{environment}or.env
๐ Related Documentation
- Complete Setup: docs/QUICKSTART.md - Comprehensive setup guide
- Architecture: docs/CONTEXT.md - System architecture
- Environment Template:
env.template- Configuration options - Requirements:
requirements.txt- Python dependencies
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rosetta_cli-2.0.9.tar.gz.
File metadata
- Download URL: rosetta_cli-2.0.9.tar.gz
- Upload date:
- Size: 50.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
626aa50742f655dfe492b25b1167f8075bb79371d2ef28c1b839ec9cee399892
|
|
| MD5 |
25347b847ed04fea0da2d378be22c256
|
|
| BLAKE2b-256 |
e5095602ed705100211812e3aa1d4c36df3b618c4f68d840f2d00e1f7ac8b2e2
|
File details
Details for the file rosetta_cli-2.0.9-py3-none-any.whl.
File metadata
- Download URL: rosetta_cli-2.0.9-py3-none-any.whl
- Upload date:
- Size: 53.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64f5fa0404c425b45ca9a7eb05faaa2a3bdd726af34304c30a48f4a66fd552d2
|
|
| MD5 |
c382c5ca501b69bbca274e166d3acf24
|
|
| BLAKE2b-256 |
b545c0f8aa7ccbca088e848abdc223a2f2026af1b09f1bb97df19a499db85a19
|