Skip to main content

A CLI tool for creating, managing, and querying .arag files for RAG applications

Project description

aRAG CLI Tool

aRAG, or arag, is a command-line interface (CLI) tool for creating, managing, and querying a custom file type called .arag. This tool enables users to package content into a structured format, process it into a searchable corpus, generate embeddings for vector-based querying, and retrieve information efficiently. It currently supports both local and OpenAI-based embedding methods and includes features for content management, packaging, and interactive usage.

The goal of the arag file type is to create a simple, self-contained method for creating localized vector databases that can be easily implemented for use with RAG and LLMs. Imagine, for example, if you could download the entire documentation for some coding language or package, generate an arag with a couple clicks, and then drag and drop this file into your AI chats, giving it the information effectively without compromising your context window. The current plan is that support for arag files will be added to popular LLM chats (like chatgpt.com) via a browser extension to further increase their usefulness.

Table of Contents

Features

  • Create .arag Files: Generate .arag directories or packaged archives with a custom structure.
  • Content Management: Add, delete, list, and clean content within .arag files.
  • Corpus Processing: Convert content into a SQLite-based searchable corpus with chunking support.
  • Embedding Generation: Index content using OpenAI or local SentenceTransformer models for vector search.
  • Vector Querying: Perform similarity searches on indexed content using a query string.
  • Packaging: Compress .arag directories into .arag archives and unpackage them as needed.
  • Interactive Mode: Open an .arag file and manage it interactively.
  • Custom VFS: Query packaged .arag files directly using a SQLite Virtual File System (VFS).

Installation

Prerequisites

  • Python (3.11 was used for development)
  • pip package manager

Steps

  1. Clone the Repository:

    git clone https://github.com/jmelovich/arag-cli.git
    cd arag
    
  2. Install the Package:

    pip install .
    

    To include optional support for local embeddings (SentenceTransformer), use:

    pip install ".[local_embeddings]"
    
  3. Verify Installation:

    arag --help
    

    This should display the CLI help message with available commands.

Usage

The arag CLI provides a variety of subcommands to manage arag files. Below is an overview of the commands and their usage.

Commands

create

Create a new arag directory, spec file, or packaged .arag from a spec.

  • Create a Directory:

    arag create dir myarag /path/to/directory
    

    Creates myarag-arag directory at the specified path. An arag directory works the same as an .arag file, but is not read only (until packaged into a file). This is the principle way to create an arag.

  • Create a Spec File:

    arag create spec /path/to/example.arag-json
    

    Generates a template .arag-json file. You can modify this template to set all the settings needed to create a .arag file.

  • Create from Spec:

    arag create from-spec /path/to/spec.arag-json
    

    Builds a packaged .arag file based on the spec file. This is the easiest way to create an arag file.

content

Manage content within an .arag directory (not supported for packaged files). Content is whatever you want to be indexed, so (for now) any sort of text information, pdfs, or docx files.

  • Add Content:

    arag content add myfile.txt --arag /path/to/myarag-arag
    

    Adds myfile.txt to the content folder. This also supports directories, and will add all files in a pointed directory recursively.

  • Delete Content:

    arag content del myfile.txt --arag /path/to/myarag-arag
    

    Removes myfile.txt from the content folder. Also works with directories.

  • List Contents:

    arag content ls --arag /path/to/myarag-arag
    

    Lists all files in the content folder.

  • Corpify Content:

    arag content corpify --arag /path/to/myarag-arag --chunk-size 8192 --force
    

    Processes content into corpus.db with specified chunk size. The --force flag overwrites any existing corpus. The --chunk-size argument determines how often each entry (file) being added to the corpus should be split into its own row, in bytes (the default is typically fine).

  • Clean Content:

    arag content clean --arag /path/to/myarag-arag
    

    Removes files from content not present in corpus.db. This is always recommended as to not waste space.

index

Generate embeddings for the corpus.

  • Index with OpenAI:

    arag index --arag /path/to/myarag-arag --method openai --api-key YOUR_API_KEY
    

    Indexes using OpenAI embeddings. The --api-key flag is optional if you have an api key set as an evironmental variable called OPENAI_API_KEY.

  • Index Locally:

    arag index --arag /path/to/myarag-arag --method local
    

    Uses the default SentenceTransformer model. Pass the --model argument to determine the model to use, given as a huggingface name such as sentence-transformers/all-MiniLM-L6-v2.

query

Search the corpus with a query string.

  • Query with Results:

    arag query "search term" --arag /path/to/myarag.arag --topk 3
    

    Returns top 3 matching chunks with content. --topk defaults to 1.

  • Query with File Paths:

    arag query "search term" --arag /path/to/myarag.arag --get-file
    

    Returns just file paths instead of content.

package

Package an .arag directory into a .arag file.

  • Package Directory:
    arag package /path/to/myarag-arag --remove-original
    
    Creates myarag.arag and removes the original directory.

unpackage

Unpackage a .arag file into a directory.

  • Unpackage File:
    arag unpackage /path/to/myarag.arag --remove-original
    
    Extracts to myarag-arag and removes the original file.

open

Enter interactive mode with an .arag file or directory.

  • Open a File:
    arag open /path/to/myarag.arag
    
    Starts an interactive shell for managing the .arag.

Interactive Mode

Run arag open <path> to interact with an .arag file or directory. Commands can be entered without the arag prefix, or an --arag argument:

> content ls
> content add myfile.txt
> query "find this" --topk 2
> close

Type quit or close to exit.

Spec File Creation

Use arag create spec <destination> to generate a template .arag-json file at the destination, then modify it:

{
    "arag_name": "myarag",
    "arag_dest": "./myarag.arag",
    "content_include": ["file1.txt", "dir/docs"],
    "clean_content": true,
    "chunk_size": 8192,
    "index_method": "openai",
    "index_model": "text-embedding-3-small",
    "api_key": "YOUR_API_KEY",
    "openai_endpoint": "https://api.openai.com/v1",
    "arag_version": "0.1.0",
    "should_package": true,
    "remove_arag_dir": true
}

Run arag create from-spec <.arag-json-path> to build the .arag file.

Examples

  1. Full Workflow:

    # Create an arag directory
    arag create dir myarag ./data
    # Open the arag directory in interactive mode
    arag open ./data/myarag-arag
    # Add content
    content add document.pdf
    # Corpify
    content corpify --clean
    # Index locally
    index --method local
    # Package the file and remove this directoru
    package --remove-original
    # Open the .arag file
    arag open ./data/myarag.arag
    # Query
    query "important info"
    
  2. Using a Spec File:

    arag create spec myarag.arag-json
    
    # Edit myarag.arag-json as needed
    nano myarag.arag-json
    
    arag create from-spec myarag.arag-json
    

File Structure

An arag directory/file has the following structure:

  • content/: Stores raw files and directories.
  • content_list.txt: Lists all files in content/.
  • corpus.db: SQLite database with chunked content & vector embeddings.
  • index.json: Metadata about embeddings (method, model, etc.).

A packaged .arag file is a special ZIP archive containing these components. (In a .arag file, only the content folder is compressed. The rest is stored directly for direct access.)

Dependencies

  • Required:

    • apsw: SQLite with custom VFS support.
    • numpy: For vector operations.
    • openai: For OpenAI embeddings.
    • pypdf: For PDF processing.
    • spire.doc: For DOCX processing.
  • Optional:

    • sentence-transformers: For local embeddings (pip install ".[local_embeddings]").

Install additional dependencies as needed for specific file types.

Configuration

  • OpenAI API Key: Set via --api-key or the OPENAI_API_KEY environment variable.
  • Embedding Models: Default models are sentence-transformers/all-MiniLM-L6-v2 (local) and text-embedding-3-small (OpenAI). Override with --model.
  • Chunk Size: Default is 8192 bytes; adjust with --chunk-size.

Contributing

Contributions are welcome! Please:

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature/yourfeature).
  3. Commit changes (git commit -m "Add your feature").
  4. Push to the branch (git push origin feature/yourfeature).
  5. Open a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arag-0.1.0.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arag-0.1.0-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file arag-0.1.0.tar.gz.

File metadata

  • Download URL: arag-0.1.0.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.6

File hashes

Hashes for arag-0.1.0.tar.gz
Algorithm Hash digest
SHA256 318eb555353196a51a5465d0c88e0414058cee54af9d867d093bfc49c86ca55a
MD5 edf6bb3c3e83fadbde2fdd65efdcbe5b
BLAKE2b-256 00682cf2d926150f03967dabebd0f2620306fe77f37db98a5bdbb4c0e0c0ec3d

See more details on using hashes here.

File details

Details for the file arag-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: arag-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.6

File hashes

Hashes for arag-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1eeea4129e03a0b3918bcc51d8098b09b37da0814247bec40c08ed2727eea83d
MD5 0a8de7b6a48c2d1ba8f4309b238ee532
BLAKE2b-256 2f1f914b27c7b443aeff4440495845b18adba0297ed137ba750aadbdf3cf30d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page