Skip to main content

Clone GitHub repos, build embeddings, store in Chroma, and search.

Project description

kno-sdk

A Python library for cloning, indexing, and semantically searching Git repositories using embeddings and Chroma.


🚀 Features

  • Clone or update any Git repository with a single call
  • Extract semantic code chunks via Tree-Sitter grammars (functions, classes, methods, etc.)
  • Fallback to line-based chunking for unsupported languages or large files
  • Embed code or text with your choice of:
    • OpenAI’s text-embedding-ada-002 via OpenAIEmbeddings
    • Local SBERT model (e.g. microsoft/graphcodebert-base) via SBERTEmbeddings
  • Persist vector store in a .kno/ folder using Chroma
  • Auto-commit & push the embedding database back to your repo
  • Fast similarity search over indexed code chunks

📦 Installation

pip install kno-sdk

🔧 Dependencies

🏁 Quickstart

from kno_sdk import clone_and_index, search, EmbeddingMethod

# 1. Clone (or pull) and index a repository
repo_index = clone_and_index(
    repo_url="https://github.com/SyedGhazanferAnwar/NestJs-MovieApp",
    branch="master",
    embedding=EmbeddingMethod.SBERT,      # or EmbeddingMethod.OPENAI
    base_dir="repos"                      # where to clone locally
)
print("Indexed at:", repo_index.path)
print("Directory snapshot:\n", repo_index.digest)

# 2. Perform semantic search
results = search(
    repo_url="https://github.com/SyedGhazanferAnwar/NestJs-MovieApp",
    branch="master",
    embedding=EmbeddingMethod.SBERT,
    base_dir="repos",
    query="NestFactory",
    k=5
)
for i, chunk in enumerate(results, 1):
    print(f"--- Result #{i} ---\n{chunk}\n")

📖 API Reference

clone_and_index(...) → RepoIndex

Clone (or pull) a repository, embed its files, and persist a Chroma database in .//.kno/. Finally, commit & push the .kno/ folder back to the original repo.

def clone_and_index(
    repo_url: str,
    branch: str = "main",
    embedding: EmbeddingMethod = EmbeddingMethod.SBERT,
    base_dir: str = "."
) -> RepoIndex
  • repo_url — Git HTTPS/SSH URL

  • branch — branch to clone or update (default: main)

  • embedding — EmbeddingMethod.OPENAI or EmbeddingMethod.SBERT

  • base_dir — local directory to clone into (default: current working dir)

Returns a RepoIndex object with:

  • path: pathlib.Path — local clone directory

  • digest: str — textual snapshot of the directory tree

  • vector_store: Chroma — the Chroma collection instance

search(...) → List[str]

Run a similarity search on an existing .kno/ Chroma database.

def search(
    repo_url: str,
    branch: str = "main",
    embedding: EmbeddingMethod = EmbeddingMethod.SBERT,
    query: str = "",
    k: int = 8,
    base_dir: str = "."
) -> List[str]
  • query — your natural-language or code search prompt

  • k — number of top results to return

Returns a list of the top-k matching code/text chunks.

EmbeddingMethod

class EmbeddingMethod(str, Enum):
    OPENAI = "OpenAIEmbeddings"
    SBERT  = "SBERTEmbeddings"

Choose between OpenAI’s hosted embeddings or a local SBERT model.

RepoIndex

class EmbeddingMethod(str, Enum):
    OPENAI = "OpenAIEmbeddings"
    SBERT  = "SBERTEmbeddings"
  • path — where the repository was cloned

  • vector_store — live Chroma client for further queries

  • digest — human-readable directory listing (useful for context)

🔍 How It Works

  1. Clone or PullUses GitPython to clone depth-1 or pull the latest changes.

  2. Directory SnapshotBuilds a small “digest” of files/folders (up to ~1 K tokens).

  3. Chunk Extraction

    • Tree-sitter for language-aware extraction of functions, classes, etc.

    • Fallback to fixed-size line chunks for unknown languages or large files.

  4. Embedding

    • Streams each chunk into your chosen embedding backend.

    • Respects a 16 000-token cap per chunk.

  5. Vector Store

    • Persists embeddings in a namespaced Chroma collection under .kno/.

    • Only indexes files once (skips already-populated collections).

  6. Commit & Push

    • Automatically stages, commits, and pushes .kno/ back to your remote.

⚙️ Configuration

  • Skip directories: .git, node_modules, build, dist, target, .vscode, .kno

  • Skip files: package-lock.json, yarn.lock, .prettierignore

  • Binary extensions: common image, audio, video, archive, font, and binary file types

All of the above can be modified by forking the source and adjusting the skip_dirs, skip_files, and BINARY_EXTS sets.

🤝 Contributing

  1. Fork this repo

  2. Create your feature branch (git checkout -b feature/AmazingFeature)

  3. Commit your changes (git commit -m 'Add amazing feature')

  4. Push to the branch (git push origin feature/AmazingFeature)

  5. Open a Pull Request

Please run pytest before submitting and follow the existing code style.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kno_sdk-1.0.0.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kno_sdk-1.0.0-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file kno_sdk-1.0.0.tar.gz.

File metadata

  • Download URL: kno_sdk-1.0.0.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for kno_sdk-1.0.0.tar.gz
Algorithm Hash digest
SHA256 09740e6204f50a7a4d8bbeab2194933202df420759a2b018d34ab13311bbf926
MD5 4fba598010f3e44c166189daa62d314f
BLAKE2b-256 47e035ed63f0986cc3bf506832843fcc8388e45e6d989017b76e5482d1207f28

See more details on using hashes here.

File details

Details for the file kno_sdk-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: kno_sdk-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 7.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for kno_sdk-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a5aa9a7e68e7e836d4b941111585d6af05892d7f63de81660b205377f5ab3006
MD5 93cae73553faf2766fb8eafbb7256d68
BLAKE2b-256 156f163eabef2b018c0042945861a4cd281ae3c15802515dc67f414836c48883

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page