Skip to main content

Lightweight semantic search engine for text files. Zero dependencies, JSONL storage.

Project description

Cicada Vector 🦗

A lightweight semantic search engine for text files.

Cicada Vector is a simple, zero-dependency semantic search engine and RAG database. Search any text content semantically - code, documentation, commits, or custom data.

Why this exists?

The original Cicada is powerful because it deeply understands code structure (SCIP, ASTs). However, that power often requires heavy dependencies and longer setup times.

Cicada Vector takes a complementary path: It focuses on Semantic Awareness for any text content. By combining local LLM embeddings (via Ollama) with a hybrid database, it provides robust search capabilities with a minimal footprint and maximum flexibility.

Features

  • Lightweight: Minimal Python codebase. Zero dependencies (Standard Library only) for the core engine.
  • Instant Install: No waiting for heavy ML libraries to compile.
  • Semantic Intelligence: Understands intent. Searching for "auth" finds login logic, even if the word "auth" isn't present.
  • Hybrid Search: Combines Vector semantic search with Keyword exact matching. Won't miss exact terms while understanding meaning.
  • Simple RAG: A "Search Broad -> Scan Specific" pipeline that pinpoints relevant content snippets.
  • Universal: Works on code, docs, commits, configs - any text content.
  • MCP Ready: Built-in Model Context Protocol server for immediate use with AI assistants.

Database Classes

Cicada Vector provides four database classes:

  • VectorDB: Pure semantic vector search
  • KeywordDB: Traditional keyword-based search
  • HybridDB: Combines vector + keyword search (Recommended)
  • RagDB: RAG database for file-based search with line numbers

Tools

1. cigrep (Semantic File Search)

Zero-config semantic search for any text files - code, docs, configs, anything.

cigrep "how do I handle authentication"     # Search code
cigrep "installation steps" docs/           # Search docs
cigrep "database config" .                  # Search everything

Automatically indexes changed files in the background and searches instantly.

2. cilog (Semantic Git Commit Search)

Search your git commit history semantically.

cilog "authentication bug fix"
cilog "refactor API" --limit 500
cilog "performance improvements" --since "1 month ago"

Indexes commit messages for fast semantic search. Use --no-diff (recommended) for faster indexing.

MCP Server

AI assistants can use your local knowledge base directly:

pip install 'cicada-vector[server]'
export CICADA_HYBRID_DIR=./my_db
cicada-vec-server

Tools:

  • search_vectors: Pure semantic search.
  • search_hybrid: Vector + Keyword search (Recommended).
  • search_code_context: RAG search returning file snippets with line numbers.
  • index_directory: Incrementally index a local directory into the database.

Configuration: If using uv or uvx, ensure you include the [server] extra:

uv tool install "cicada-vector[server]"

For manual configuration (e.g., in Claude Desktop or Gemini), set the command to: uvx --from "cicada-vector[server]" cicada-vec-server And set the environment variable CICADA_HYBRID_DIR to your database path.

Quick Start

1. Generate Embeddings

Cicada Vector requires you to provide embeddings (vectors). Use Ollama to generate them:

import json
import urllib.request

def get_embedding(text, model="nomic-embed-text"):
    """Get embedding from Ollama API"""
    url = "http://localhost:11434/api/embeddings"
    data = json.dumps({"model": model, "prompt": text}).encode('utf-8')
    
    req = urllib.request.Request(
        url,
        data=data,
        headers={'Content-Type': 'application/json'}
    )
    
    with urllib.request.urlopen(req, timeout=30) as response:
        result = json.loads(response.read().decode('utf-8'))
        return result['embedding']

2. Create Database and Add Data

from cicada_vector import HybridDB

# Initialize HybridDB (combines vector + keyword search)
db = HybridDB("./my_knowledge_base")

# Add data with embeddings
auth_text = "def login(username, password):\n    ..."
auth_vector = get_embedding(auth_text)
db.add(id="auth.py", vector=auth_vector, text=auth_text, meta={"path": "src/auth.py"})

user_text = "class User:\n    ..."
user_vector = get_embedding(user_text)
db.add(id="user.py", vector=user_vector, text=user_text, meta={"path": "src/user.py"})

# Persist to disk (optional - data is written on add, but this rewrites the file)
db.persist()

3. Search

# Generate embedding for query
query = "how to authenticate users"
query_vector = get_embedding(query)

# Hybrid search (recommended - combines vector + keyword)
results = db.search(query_text=query, query_vector=query_vector, k=5)
for doc_id, score, meta in results:
    print(f"[{score:.4f}] {doc_id}: {meta.get('path')}")

Indexing Custom Data

Cicada Vector isn't just for code files - index any text data:

from cicada_vector import HybridDB
import subprocess

# Example: Index git commits
db = HybridDB("./git_commits_db")

result = subprocess.run(
    ["git", "log", "--format=%H|%an|%s|%b", "-10"],
    capture_output=True, text=True
)

for line in result.stdout.strip().split('\n'):
    sha, author, subject, body = line.split('|', 3)
    commit_text = f"{subject}\n{body}"
    commit_vector = get_embedding(commit_text)

    db.add(
        id=sha,
        vector=commit_vector,
        text=commit_text,
        meta={"author": author, "subject": subject, "type": "commit"}
    )

# Search commits
query = "authentication bug fix"
query_vector = get_embedding(query)
results = db.search(query_text=query, query_vector=query_vector, k=5)
for sha, score, meta in results:
    print(f"[{score:.4f}] {sha[:8]} - {meta['subject']}")

Other Database Classes

VectorDB (Pure Semantic Search)

from cicada_vector import VectorDB

db = VectorDB("./my_vectors.jsonl")

# Add vectors (no text storage)
db.add(id="doc1", vector=get_embedding("some text"), meta={"path": "doc1.txt"})

# Search (reuses get_embedding() helper from Quick Start)
query_vector = get_embedding("search query")
results = db.search(query=query_vector, k=5)

KeywordDB (Traditional Search)

from cicada_vector import KeywordDB

db = KeywordDB("./my_keywords.jsonl")

# Add documents
db.add(id="doc1", text="some text to index")

# Search (OR search - matches any word)
results = db.search(query="search terms")

RagDB (File-based RAG)

from cicada_vector import RagDB

db = RagDB("./my_rag_db")

# Add files
file_content = open("src/auth.py").read()
# Reuse get_embedding() helper from Quick Start
file_vector = get_embedding(file_content)
db.add_file(file_path="src/auth.py", content=file_content, vector=file_vector)

# Search (returns file + line numbers)
query_vector = get_embedding("authentication")
results = db.search(query="authentication", k=3, query_vector=query_vector)

Use cases:

  • Code files (semantic search across your codebase)
  • Git commits and history
  • GitHub PRs and issues
  • Documentation sites
  • Configuration files
  • Support tickets
  • Any text corpus

The Stack

  • Brains: Ollama (Recommended: nomic-embed-text)
  • Storage: JSONL (Human-readable, append-only)
  • Engine: Pure Python (with optional Numpy acceleration)

Part of the Cicada suite. Simple, effective semantic search for text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cicada_vector-0.1.5.tar.gz (93.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cicada_vector-0.1.5-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file cicada_vector-0.1.5.tar.gz.

File metadata

  • Download URL: cicada_vector-0.1.5.tar.gz
  • Upload date:
  • Size: 93.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cicada_vector-0.1.5.tar.gz
Algorithm Hash digest
SHA256 0b98eb98171d8ce4176ddcfbb92b3afe1c3ec094e6407fd59d714e8a493bbbc0
MD5 5ff09b50f9ce273b722ff350d28b0b07
BLAKE2b-256 945b1057640b737acf32417aa6782600e74893e6fda139babfa9b6a43a824741

See more details on using hashes here.

File details

Details for the file cicada_vector-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: cicada_vector-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 29.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cicada_vector-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c7a6b8aa3c83bca265dc00c66627a0c4be6b312a3c5f240f5e0132506a733bbd
MD5 11b6e5e7b45271c3fd103592fa2eb74b
BLAKE2b-256 1e3807b2bc02362a6c0f75ee5b9487beea26e242f6c3abf156e771331469e459

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page