Clone GitHub repos, build embeddings, store in Chroma, and search.

Project description

kno-sdk

A Python library for cloning, indexing, and semantically searching Git repositories using embeddings (OpenAI or SBERT) and Chroma — plus a high-level agent_query for autonomous code agent.

🚀 Features

Clone or update any Git repository with a single call
Extract semantic code chunks via Tree-Sitter grammars (functions, classes, methods, etc.)
Fallback to line-based chunking for unsupported languages or large files
Embed code or text with your choice of:
- OpenAI’s text-embedding-ada-002 via OpenAIEmbeddings
- Local SBERT model (e.g. microsoft/graphcodebert-base) via SBERTEmbeddings
Persist vector store in a .kno/ folder using Chroma
Auto-commit & push the embedding database back to your repo
Fast similarity search over indexed code chunks
Autonomous agent for code analysis via agent_query()

📦 Installation

pip install kno-sdk

🏁 Quickstart

from kno_sdk import clone_and_index, search, EmbeddingMethod

# 1. Clone (or pull) and index a repository
repo_index = clone_and_index(
    repo_url="https://github.com/SyedGhazanferAnwar/NestJs-MovieApp",
    branch="master",
    embedding=EmbeddingMethod.SBERT,      # or EmbeddingMethod.OPENAI
    base_dir="repos"                      # where to clone locally
)
print("Indexed at:", repo_index.path)
print("Directory snapshot:\n", repo_index.digest)

# 2. Perform semantic search
results = search(
    repo_url="https://github.com/SyedGhazanferAnwar/NestJs-MovieApp",
    branch="master",
    embedding=EmbeddingMethod.SBERT,
    base_dir="repos",
    query="NestFactory",
    k=5
)
for i, chunk in enumerate(results, 1):
    print(f"--- Result #{i} ---\n{chunk}\n")

# 3. Autonomous Code-Analysis Agent
from kno_sdk import agent_query, EmbeddingMethod, LLMProvider

result = agent_query(
    repo_url="https://github.com/WebGoat/WebGoat",
    branch="main",
    embedding=EmbeddingMethod.SBERT,
    base_dir="repos",
    llm_provider=LLMProvider.ANTHROPIC,
    llm_model="claude-3-haiku-20240307",
    llm_temperature=0.0,
    llm_max_tokens=4096,
    llm_system_prompt="You are a senior code-analysis agent.",
    prompt="Find issues, bugs and vulnerabilities in this repo, and explain each with exact code locations.",
    MODEL_API_KEY="your_api_key_here"
)

print(result)

📖 API Reference

clone_and_index(...) → RepoIndex

Clone (or pull) a repository, embed its files, and persist a Chroma database in .kno folder. Finally, commit & push the .kno/ folder back to the original repo.

def clone_and_index(
    repo_url: str,
    branch: str = "main",
    embedding: EmbeddingMethod = EmbeddingMethod.SBERT,
    base_dir: str = "."
) -> RepoIndex

repo_url — Git HTTPS/SSH URL
branch — branch to clone or update (default: main)
embedding — EmbeddingMethod.OPENAI or EmbeddingMethod.SBERT
base_dir — local directory to clone into (default: current working dir)

Returns a RepoIndex object with:

path: pathlib.Path — local clone directory
digest: str — textual snapshot of the directory tree
vector_store: Chroma — the Chroma collection instance

search(...) → List[str]

Run a similarity search on an existing .kno/ Chroma database.

def search(
    repo_url: str,
    branch: str = "main",
    embedding: EmbeddingMethod = EmbeddingMethod.SBERT,
    query: str = "",
    k: int = 8,
    base_dir: str = "."
) -> List[str]

query — your natural-language or code search prompt
k — number of top results to return

Returns a list of the top-k matching code/text chunks.

agent_query(...) → str

High-level agent that clones, indexes, and then iteratively uses tools (search_code, read_file, etc.) plus an LLM to fulfill your prompt.

def agent_query(
    repo_url: str,
    branch: str = "main",
    embedding: EmbeddingMethod = EmbeddingMethod.SBERT,
    base_dir: str = str(Path.cwd()),
    llm_provider: LLMProvider = LLMProvider.ANTHROPIC,
    llm_model: str = "claude-3-haiku-20240307",
    llm_temperature: float = 0.0,
    llm_max_tokens: int = 4096,
    llm_system_prompt: str = "",
    prompt: str = "",
    MODEL_API_KEY: str = "",
) -> str

repo_url, branch, embedding, base_dir — same as above
llm_provider — LLMProvider.OPENAI or LLMProvider.ANTHROPIC
llm_model — model name (e.g. "gpt-4" or "claude-3-haiku-20240307")
llm_temperature, llm_max_tokens — sampling params
llm_system_prompt — initial system message for the agent
prompt — your user query/task description
MODEL_API_KEY — sets OPENAI_API_KEY or ANTHROPIC_API_KEY

Returns the agent’s Final Answer as a string.

EmbeddingMethod

class EmbeddingMethod(str, Enum):
    OPENAI = "OpenAIEmbeddings"
    SBERT  = "SBERTEmbeddings"

Choose between OpenAI’s hosted embeddings or a local SBERT model.

🔍 How It Works

Clone or PullUses GitPython to clone depth-1 or pull the latest changes.
Directory SnapshotBuilds a small “digest” of files/folders (up to ~1 K tokens).
Chunk Extraction
- Tree-sitter for language-aware extraction of functions, classes, etc.
- Fallback to fixed-size line chunks for unknown languages or large files.
Embedding
- Streams each chunk into your chosen embedding backend.
- Respects a 16 000-token cap per chunk.
Vector Store
- Persists embeddings in a namespaced Chroma collection under .kno/.
- Only indexes files once (skips already-populated collections).
Commit & Push
- Automatically stages, commits, and pushes .kno/ back to your remote.
Autonomous Agent

RAG prompt
Tool calls (search_code, read_file, …)
Iterative LLM planning & execution
Stops on “Final Answer:” or max iterations

⚙️ Configuration

Skip directories: .git, node_modules, build, dist, target, .vscode, .kno
Skip files: package-lock.json, yarn.lock, .prettierignore
Binary extensions: common image, audio, video, archive, font, and binary file types

All of the above can be modified by forking the source and adjusting the skip_dirs, skip_files, and BINARY_EXTS sets.

🔧 Dependencies

🤝 Contributing

Fork this repo
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Please run pytest before submitting and follow the existing code style.

Project details

Release history Release notifications | RSS feed

1.4.10

Jun 14, 2025

1.4.9

May 12, 2025

1.4.8

May 9, 2025

1.4.7

May 6, 2025

1.4.6

May 6, 2025

1.4.5

May 6, 2025

1.4.3

May 5, 2025

1.4.2

May 5, 2025

1.4.1

May 2, 2025

1.3.10

May 2, 2025

1.3.9

May 1, 2025

1.3.8

Apr 29, 2025

1.3.7

Apr 28, 2025

1.3.6

Apr 26, 2025

1.3.5

Apr 26, 2025

1.3.4

Apr 26, 2025

1.3.3

Apr 25, 2025

This version

1.3.2

Apr 24, 2025

1.3.1

Apr 24, 2025

1.2.3

Apr 24, 2025

1.2.2

Apr 24, 2025

1.2.1

Apr 23, 2025

1.2.0

Apr 23, 2025

1.0.0

Apr 23, 2025

0.1.2

Apr 23, 2025

0.1.1

Apr 23, 2025

0.1.0

Apr 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kno_sdk-1.3.2.tar.gz (19.5 kB view details)

Uploaded Apr 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kno_sdk-1.3.2-py3-none-any.whl (21.7 kB view details)

Uploaded Apr 24, 2025 Python 3

File details

Details for the file kno_sdk-1.3.2.tar.gz.

File metadata

Download URL: kno_sdk-1.3.2.tar.gz
Upload date: Apr 24, 2025
Size: 19.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for kno_sdk-1.3.2.tar.gz
Algorithm	Hash digest
SHA256	`bd9a3f16e92df053663451b0303163ba5f57c748adb153cf2ce80636905aad80`
MD5	`44537c13006bdd78da2f68e1c9d00c7a`
BLAKE2b-256	`5d5f3e227c58be14eacfa70d4d3027d6f35a40053220a5208a9b4163a5ef3025`

See more details on using hashes here.

File details

Details for the file kno_sdk-1.3.2-py3-none-any.whl.

File metadata

Download URL: kno_sdk-1.3.2-py3-none-any.whl
Upload date: Apr 24, 2025
Size: 21.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for kno_sdk-1.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aed17d90d55790d93d4f476a14e1d50c0e4f32cb3675ddac3f70fd7ea5a7f838`
MD5	`928ef6d263a5a243af5e7d595dd0c70a`
BLAKE2b-256	`c34a68f20a6ed2225949d455ff0124fa7133376ed0f07e06594d8aeb99932264`

See more details on using hashes here.

kno-sdk 1.3.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

kno-sdk

🚀 Features

📦 Installation

🏁 Quickstart

📖 API Reference

clone_and_index(...) → RepoIndex

search(...) → List[str]

agent_query(...) → str

EmbeddingMethod

🔍 How It Works

⚙️ Configuration

🔧 Dependencies

🤝 Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes