A CLI tool for creating, managing, and querying .arag files for RAG applications
Project description
aRAG CLI Tool
aRAG, or arag, is a command-line interface (CLI) tool for creating, managing, and querying a custom file type called .arag. This tool enables users to package content into a structured format, process it into a searchable corpus, generate embeddings for vector-based querying, and retrieve information efficiently. It currently supports both local and OpenAI-based embedding methods and includes features for content management, packaging, and interactive usage.
The goal of the arag file type is to create a simple, self-contained method for creating localized vector databases that can be easily implemented for use with RAG and LLMs. Imagine, for example, if you could download the entire documentation for some coding language or package, generate an arag with a couple clicks, and then drag and drop this file into your AI chats, giving it the information effectively without compromising your context window. The current plan is that support for arag files will be added to popular LLM chats (like chatgpt.com) via a browser extension to further increase their usefulness.
Table of Contents
Features
- Create
.aragFiles: Generate.aragdirectories or packaged archives with a custom structure. - Content Management: Add, delete, list, and clean content within
.aragfiles. - Corpus Processing: Convert content into a SQLite-based searchable corpus with chunking support.
- Embedding Generation: Index content using OpenAI or local SentenceTransformer models for vector search.
- Vector Querying: Perform similarity searches on indexed content using a query string.
- Packaging: Compress
.aragdirectories into.aragarchives and unpackage them as needed. - Interactive Mode: Open an
.aragfile and manage it interactively. - Custom VFS: Query packaged
.aragfiles directly using a SQLite Virtual File System (VFS).
Installation
Prerequisites
- Python (3.11 was used for development)
pippackage manager
Steps
-
Clone the Repository:
git clone https://github.com/jmelovich/arag-cli.git cd arag
-
Install the Package:
pip install .
To include optional support for local embeddings (SentenceTransformer), use:
pip install ".[local_embeddings]"
-
Verify Installation:
arag --helpThis should display the CLI help message with available commands.
Usage
The arag CLI provides a variety of subcommands to manage arag files. Below is an overview of the commands and their usage.
Commands
create
Create a new arag directory, spec file, or packaged .arag from a spec.
-
Create a Directory:
arag create dir myarag /path/to/directory
Creates
myarag-aragdirectory at the specified path. Anaragdirectory works the same as an.aragfile, but is not read only (until packaged into a file). This is the principle way to create anarag. -
Create a Spec File:
arag create spec /path/to/example.arag-json
Generates a template
.arag-jsonfile. You can modify this template to set all the settings needed to create a.aragfile. -
Create from Spec:
arag create from-spec /path/to/spec.arag-json
Builds a packaged
.aragfile based on the spec file. This is the easiest way to create anaragfile.
content
Manage content within an .arag directory (not supported for packaged files). Content is whatever you want to be indexed, so (for now) any sort of text information, pdfs, or docx files.
-
Add Content:
arag content add myfile.txt --arag /path/to/myarag-arag
Adds
myfile.txtto thecontentfolder. This also supports directories, and will add all files in a pointed directory recursively. -
Delete Content:
arag content del myfile.txt --arag /path/to/myarag-arag
Removes
myfile.txtfrom thecontentfolder. Also works with directories. -
List Contents:
arag content ls --arag /path/to/myarag-arag
Lists all files in the
contentfolder. -
Corpify Content:
arag content corpify --arag /path/to/myarag-arag --chunk-size 8192 --force
Processes content into
corpus.dbwith specified chunk size. The--forceflag overwrites any existing corpus. The--chunk-sizeargument determines how often each entry (file) being added to the corpus should be split into its own row, in bytes (the default is typically fine). -
Clean Content:
arag content clean --arag /path/to/myarag-arag
Removes files from
contentnot present incorpus.db. This is always recommended as to not waste space.
index
Generate embeddings for the corpus.
-
Index with OpenAI:
arag index --arag /path/to/myarag-arag --method openai --api-key YOUR_API_KEY
Indexes using OpenAI embeddings. The
--api-keyflag is optional if you have an api key set as an evironmental variable calledOPENAI_API_KEY. -
Index Locally:
arag index --arag /path/to/myarag-arag --method local
Uses the default SentenceTransformer model. Pass the
--modelargument to determine the model to use, given as a huggingface name such assentence-transformers/all-MiniLM-L6-v2.
query
Search the corpus with a query string.
-
Query with Results:
arag query "search term" --arag /path/to/myarag.arag --topk 3
Returns top 3 matching chunks with content.
--topkdefaults to 1. -
Query with File Paths:
arag query "search term" --arag /path/to/myarag.arag --get-file
Returns just file paths instead of content.
package
Package an .arag directory into a .arag file.
- Package Directory:
arag package /path/to/myarag-arag --remove-original
Createsmyarag.aragand removes the original directory.
unpackage
Unpackage a .arag file into a directory.
- Unpackage File:
arag unpackage /path/to/myarag.arag --remove-original
Extracts tomyarag-aragand removes the original file.
open
Enter interactive mode with an .arag file or directory.
- Open a File:
arag open /path/to/myarag.arag
Starts an interactive shell for managing the.arag.
Interactive Mode
Run arag open <path> to interact with an .arag file or directory. Commands can be entered without the arag prefix, or an --arag argument:
> content ls
> content add myfile.txt
> query "find this" --topk 2
> close
Type quit or close to exit.
Spec File Creation
Use arag create spec <destination> to generate a template .arag-json file at the destination, then modify it:
{
"arag_name": "myarag",
"arag_dest": "./myarag.arag",
"content_include": ["file1.txt", "dir/docs"],
"clean_content": true,
"chunk_size": 8192,
"index_method": "openai",
"index_model": "text-embedding-3-small",
"api_key": "YOUR_API_KEY",
"openai_endpoint": "https://api.openai.com/v1",
"arag_version": "0.1.0",
"should_package": true,
"remove_arag_dir": true
}
Run arag create from-spec <.arag-json-path> to build the .arag file.
Examples
-
Full Workflow:
# Create an arag directory arag create dir myarag ./data # Open the arag directory in interactive mode arag open ./data/myarag-arag # Add content content add document.pdf # Corpify content corpify --clean # Index locally index --method local # Package the file and remove this directoru package --remove-original # Open the .arag file arag open ./data/myarag.arag # Query query "important info"
-
Using a Spec File:
arag create spec myarag.arag-json # Edit myarag.arag-json as needed nano myarag.arag-json arag create from-spec myarag.arag-json
File Structure
An arag directory/file has the following structure:
content/: Stores raw files and directories.content_list.txt: Lists all files incontent/.corpus.db: SQLite database with chunked content & vector embeddings.index.json: Metadata about embeddings (method, model, etc.).
A packaged .arag file is a special ZIP archive containing these components. (In a .arag file, only the content folder is compressed. The rest is stored directly for direct access.)
Dependencies
-
Required:
apsw: SQLite with custom VFS support.numpy: For vector operations.openai: For OpenAI embeddings.pypdf: For PDF processing.spire.doc: For DOCX processing.
-
Optional:
sentence-transformers: For local embeddings (pip install ".[local_embeddings]").
Install additional dependencies as needed for specific file types.
Configuration
- OpenAI API Key: Set via
--api-keyor theOPENAI_API_KEYenvironment variable. - Embedding Models: Default models are
sentence-transformers/all-MiniLM-L6-v2(local) andtext-embedding-3-small(OpenAI). Override with--model. - Chunk Size: Default is 8192 bytes; adjust with
--chunk-size.
Contributing
Contributions are welcome! Please:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/yourfeature). - Commit changes (
git commit -m "Add your feature"). - Push to the branch (
git push origin feature/yourfeature). - Open a pull request.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arag-0.1.0.tar.gz.
File metadata
- Download URL: arag-0.1.0.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
318eb555353196a51a5465d0c88e0414058cee54af9d867d093bfc49c86ca55a
|
|
| MD5 |
edf6bb3c3e83fadbde2fdd65efdcbe5b
|
|
| BLAKE2b-256 |
00682cf2d926150f03967dabebd0f2620306fe77f37db98a5bdbb4c0e0c0ec3d
|
File details
Details for the file arag-0.1.0-py3-none-any.whl.
File metadata
- Download URL: arag-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1eeea4129e03a0b3918bcc51d8098b09b37da0814247bec40c08ed2727eea83d
|
|
| MD5 |
0a8de7b6a48c2d1ba8f4309b238ee532
|
|
| BLAKE2b-256 |
2f1f914b27c7b443aeff4440495845b18adba0297ed137ba750aadbdf3cf30d3
|