Clone GitHub repos, build embeddings, store in Chroma, and search.
Project description
kno-sdk
A Python library for cloning, indexing, and semantically searching Git repositories using embeddings (OpenAI or SBERT) and Chroma — plus a high-level agent_query for autonomous code agent.
🚀 Features
- Clone or update any Git repository with a single call
- Extract semantic code chunks via Tree-Sitter grammars (functions, classes, methods, etc.)
- Fallback to line-based chunking for unsupported languages or large files
- Embed code or text with your choice of:
- OpenAI’s
text-embedding-ada-002via OpenAIEmbeddings - Local SBERT model (e.g.
microsoft/graphcodebert-base) via SBERTEmbeddings
- OpenAI’s
- Persist vector store in a
.kno/folder using Chroma - Auto-commit & push the embedding database back to your repo
- Fast similarity search over indexed code chunks
- Autonomous agent for code analysis via
agent_query()
📦 Installation
pip install kno-sdk
🏁 Quickstart
from kno_sdk import clone_and_index, search, EmbeddingMethod
# 1. Clone (or pull) and index a repository
repo_index = clone_and_index(
repo_url="https://github.com/SyedGhazanferAnwar/NestJs-MovieApp",
branch="master",
embedding=EmbeddingMethod.SBERT, # or EmbeddingMethod.OPENAI
base_dir="repos" # where to clone locally
)
print("Indexed at:", repo_index.path)
print("Directory snapshot:\n", repo_index.digest)
# 2. Perform semantic search
results = search(
repo_url="https://github.com/SyedGhazanferAnwar/NestJs-MovieApp",
branch="master",
embedding=EmbeddingMethod.SBERT,
base_dir="repos",
query="NestFactory",
k=5
)
for i, chunk in enumerate(results, 1):
print(f"--- Result #{i} ---\n{chunk}\n")
# 3. Autonomous Code-Analysis Agent
from kno_sdk import agent_query, EmbeddingMethod, LLMProvider
result = agent_query(
repo_url="https://github.com/WebGoat/WebGoat",
branch="main",
embedding=EmbeddingMethod.SBERT,
base_dir="repos",
llm_provider=LLMProvider.ANTHROPIC,
llm_model="claude-3-haiku-20240307",
llm_temperature=0.0,
llm_max_tokens=4096,
llm_system_prompt="You are a senior code-analysis agent.",
prompt="Find issues, bugs and vulnerabilities in this repo, and explain each with exact code locations.",
MODEL_API_KEY="your_api_key_here"
)
print(result)
📖 API Reference
clone_and_index(...) → RepoIndex
Clone (or pull) a repository, embed its files, and persist a Chroma database in .kno folder. Finally, commit & push the .kno/ folder back to the original repo.
def clone_and_index(
repo_url: str,
branch: str = "main",
embedding: EmbeddingMethod = EmbeddingMethod.SBERT,
base_dir: str = "."
) -> RepoIndex
-
repo_url — Git HTTPS/SSH URL
-
branch — branch to clone or update (default: main)
-
embedding — EmbeddingMethod.OPENAI or EmbeddingMethod.SBERT
-
base_dir — local directory to clone into (default: current working dir)
Returns a RepoIndex object with:
-
path: pathlib.Path — local clone directory
-
digest: str — textual snapshot of the directory tree
-
vector_store: Chroma — the Chroma collection instance
search(...) → List[str]
Run a similarity search on an existing .kno/ Chroma database.
def search(
repo_url: str,
branch: str = "main",
embedding: EmbeddingMethod = EmbeddingMethod.SBERT,
query: str = "",
k: int = 8,
base_dir: str = "."
) -> List[str]
-
query — your natural-language or code search prompt
-
k — number of top results to return
Returns a list of the top-k matching code/text chunks.
agent_query(...) → str
High-level agent that clones, indexes, and then iteratively uses tools (search_code, read_file, etc.) plus an LLM to fulfill your prompt.
def agent_query(
repo_url: str,
branch: str = "main",
embedding: EmbeddingMethod = EmbeddingMethod.SBERT,
base_dir: str = str(Path.cwd()),
llm_provider: LLMProvider = LLMProvider.ANTHROPIC,
llm_model: str = "claude-3-haiku-20240307",
llm_temperature: float = 0.0,
llm_max_tokens: int = 4096,
llm_system_prompt: str = "",
prompt: str = "",
MODEL_API_KEY: str = "",
) -> str
-
repo_url, branch, embedding, base_dir — same as above
-
llm_provider — LLMProvider.OPENAI or LLMProvider.ANTHROPIC
-
llm_model — model name (e.g. "gpt-4" or "claude-3-haiku-20240307")
-
llm_temperature, llm_max_tokens — sampling params
-
llm_system_prompt — initial system message for the agent
-
prompt — your user query/task description
-
MODEL_API_KEY — sets OPENAI_API_KEY or ANTHROPIC_API_KEY
Returns the agent’s Final Answer as a string.
EmbeddingMethod
class EmbeddingMethod(str, Enum):
OPENAI = "OpenAIEmbeddings"
SBERT = "SBERTEmbeddings"
Choose between OpenAI’s hosted embeddings or a local SBERT model.
🔍 How It Works
-
Clone or PullUses GitPython to clone depth-1 or pull the latest changes.
-
Directory SnapshotBuilds a small “digest” of files/folders (up to ~1 K tokens).
-
Chunk Extraction
-
Tree-sitter for language-aware extraction of functions, classes, etc.
-
Fallback to fixed-size line chunks for unknown languages or large files.
-
-
Embedding
-
Streams each chunk into your chosen embedding backend.
-
Respects a 16 000-token cap per chunk.
-
-
Vector Store
-
Persists embeddings in a namespaced Chroma collection under .kno/.
-
Only indexes files once (skips already-populated collections).
-
-
Commit & Push
- Automatically stages, commits, and pushes .kno/ back to your remote.
-
Autonomous Agent
-
RAG prompt
-
Tool calls (search_code, read_file, …)
-
Iterative LLM planning & execution
-
Stops on “Final Answer:” or max iterations
⚙️ Configuration
-
Skip directories: .git, node_modules, build, dist, target, .vscode, .kno
-
Skip files: package-lock.json, yarn.lock, .prettierignore
-
Binary extensions: common image, audio, video, archive, font, and binary file types
All of the above can be modified by forking the source and adjusting the skip_dirs, skip_files, and BINARY_EXTS sets.
🔧 Dependencies
🤝 Contributing
-
Fork this repo
-
Create your feature branch (git checkout -b feature/AmazingFeature)
-
Commit your changes (git commit -m 'Add amazing feature')
-
Push to the branch (git push origin feature/AmazingFeature)
-
Open a Pull Request
Please run pytest before submitting and follow the existing code style.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kno_sdk-1.3.2.tar.gz.
File metadata
- Download URL: kno_sdk-1.3.2.tar.gz
- Upload date:
- Size: 19.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd9a3f16e92df053663451b0303163ba5f57c748adb153cf2ce80636905aad80
|
|
| MD5 |
44537c13006bdd78da2f68e1c9d00c7a
|
|
| BLAKE2b-256 |
5d5f3e227c58be14eacfa70d4d3027d6f35a40053220a5208a9b4163a5ef3025
|
File details
Details for the file kno_sdk-1.3.2-py3-none-any.whl.
File metadata
- Download URL: kno_sdk-1.3.2-py3-none-any.whl
- Upload date:
- Size: 21.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aed17d90d55790d93d4f476a14e1d50c0e4f32cb3675ddac3f70fd7ea5a7f838
|
|
| MD5 |
928ef6d263a5a243af5e7d595dd0c70a
|
|
| BLAKE2b-256 |
c34a68f20a6ed2225949d455ff0124fa7133376ed0f07e06594d8aeb99932264
|