Skip to main content

Semantic file search powered by Google Gemini embeddings

Project description

EmbeddedFinder

Semantic file search for your local filesystem.
Ask questions in plain English — find what you need across code, documents, images, audio, and video.

PyPI Python License

Powered by Google Gemini Embedding 2 and ChromaDB.

❯ efind

╭─ ◆ EmbeddedFinder  v0.1.0 ─────────────────────────────────────╮
│   Semantic file search powered by Gemini Embedding 2             │
│   ● 142 files  (387 chunks)  │  .embeddedfinder/db              │
╰──────────────────────────────────────────────────────────────────╯

  Type a query to search, or /help for commands.

❯ functions that validate user authentication tokens

  5 results  (0.3s)  │  "functions that validate user authentication tokens"

   1  95%   PY   auth.py  4K
        src/auth/auth.py
        ▸ def validate_token(token: str) -> bool: ...

   2  87%   PY   middleware.py  2K
        src/middleware/middleware.py
        ▸ class AuthMiddleware: def process_request(self, req)...

Why EmbeddedFinder?

Traditional file search (grep, find, ag) matches exact text. EmbeddedFinder understands meaning. Search for "error handling in payments" and find files about exception catching in billing code — even if those exact words never appear.

It works on everything: source code, config files, PDFs, Word documents, images, audio, and video — all in one index.

Features

  • Natural language search — describe what you're looking for, not keywords
  • Multimodal indexing — code, text, PDFs, DOCX, images, audio, and video files
  • Interactive TUI — rich terminal UI with slash commands, progress bars, and color-coded results
  • First-run setup wizard — guided onboarding with API key validation
  • Incremental indexing — content-hashed, only re-processes changed files
  • Batch embedding — groups chunks into minimal API calls for fast indexing
  • File watching — auto-reindex when files change on disk
  • One-shot CLI — scriptable commands for CI/automation
  • Smart ranking — filename matching, file type relevance, and content-aware scoring

Quick start

Install

pip install embedded-finder

Or from source:

git clone https://github.com/vladmarian20005/EmbeddedFinder.git
cd EmbeddedFinder
pip install .

Run

efind

On first launch, a setup wizard walks you through:

  1. Enter your Google AI API key (free tier available)
  2. The key is validated and saved securely to ~/.config/embeddedfinder/config.json
  3. Optionally index a directory right away

That's it — start searching.

Already have a key?

# Option A: environment variable
export GOOGLE_API_KEY=your-key-here

# Option B: .env file in your project root
echo "GOOGLE_API_KEY=your-key-here" > .env

# Option C: set it interactively
efind
# then type: /key set

Usage

Interactive mode (default)

efind

Type natural language queries at the prompt:

❯ database migration scripts
❯ files that handle image resizing
❯ error handling in the payment module
❯ screenshots of the dashboard
❯ audio files with speech

Results show similarity scores, file types, paths, and content snippets — color-coded by relevance.

Slash commands

Command Description
/index <path> Index a directory
/reindex <path> Re-index only changed files
/status Show index statistics
/clear Clear the entire index
/watch <path> Watch a directory and auto-reindex
/key Show current API key info
/key set Set or change your API key
/key delete Remove saved API key
/key show Reveal the full API key
/help Show available commands
/quit or Ctrl+C Exit

CLI commands

For scripting and one-off use:

# Index a directory
efind index ./src

# Index specific file types only
efind index ./src -e .py -e .ts

# Search
efind search "authentication middleware"

# Search with options
efind search "config parsing" --top 5 --min-score 0.7

# Plain text output (no colors, good for piping)
efind search "database models" --plain

# Re-index changed files only
efind reindex ./src

# Watch for changes
efind watch ./src

# Show index stats
efind status

# Clear the index
efind clear

# Check version
efind --version

Supported file types

Category Extensions
Code .py .js .ts .jsx .tsx .java .c .cpp .h .hpp .go .rs .rb .php .swift .kt .scala .sh .bash .zsh .lua .pl .ex .exs .r .m .sql
Markup .html .css .scss .less .xml .svg
Config .json .yaml .yml .toml .ini .cfg .conf
Text .txt .md .rst .csv
Documents .pdf .docx
Images .png .jpg .jpeg .gif .webp .bmp
Audio .mp3 .wav .ogg .flac .m4a
Video .mp4 .mov .avi .mkv .webm

Images, audio, and video are embedded natively using Gemini's multimodal capabilities — no transcription or OCR needed.

PDFs with 6 or fewer pages are embedded natively; larger PDFs use text extraction for efficiency.

How it works

 Directory          EmbeddedFinder                    ChromaDB
 ─────────     ─────────────────────────     ─────────────────────

  files/ ──→  1. Crawl  (skip .git, etc.)
           ──→  2. Extract  (text / bytes)
           ──→  3. Chunk   (~2000 tokens)
           ──→  4. Hash    (SHA-256 dedup)
           ──→  5. Embed   (Gemini API)   ──→  Store vectors

  query  ──→  6. Embed query              ──→  Nearest-neighbor
           ──→  7. Deduplicate by file             search
           ──→  8. Re-rank & boost        ──→  Results
  • Content hashing — files are fingerprinted with SHA-256; re-indexing skips anything unchanged
  • Batch embedding — text chunks are grouped into batches (up to 100 per API call) for throughput
  • Rate limiting — built-in token bucket limiter respects Gemini API quotas
  • Parallel processing — multi-threaded extraction and embedding with up to 4 workers
  • Smart ranking — results are boosted by filename match, file type relevance to query, content overlap, and path depth
  • Directory filtering — hidden directories (starting with .) and common non-content directories (node_modules, __pycache__, .venv, dist, build, etc.) are automatically skipped during crawling

Configuration

Variable Default Description
GOOGLE_API_KEY Google AI API key (required)
EMBEDDEDFINDER_DB_DIR .embeddedfinder/db Path to the ChromaDB database

The API key can also be stored via the setup wizard or /key set, which saves it to ~/.config/embeddedfinder/config.json with owner-only permissions.

Project structure

embedded_finder/
├── cli.py            # Click CLI — subcommands + TUI launcher
├── tui.py            # Interactive Rich-based REPL
├── config.py         # Settings, supported extensions, env vars
├── config_store.py   # Persistent config file management
├── crawler.py        # Recursive file discovery
├── extractor.py      # Text extraction, chunking, MIME detection
├── embedder.py       # Gemini Embedding API client + batching
├── store.py          # ChromaDB vector store
├── indexer.py        # Orchestrates crawl → extract → embed → store
├── search.py         # Query embedding + nearest-neighbor search
├── ranker.py         # Result ranking, dedup, and formatting
├── rate_limiter.py   # Token bucket rate limiter
└── watcher.py        # Filesystem watcher (watchdog)

Development

# Clone and install with dev dependencies
git clone https://github.com/vladmarian20005/EmbeddedFinder.git
cd EmbeddedFinder
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=embedded_finder

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/my-feature)
  3. Commit your changes (git commit -m 'Add my feature')
  4. Push to the branch (git push origin feature/my-feature)
  5. Open a pull request

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedded_finder-0.2.2.tar.gz (52.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embedded_finder-0.2.2-py3-none-any.whl (40.7 kB view details)

Uploaded Python 3

File details

Details for the file embedded_finder-0.2.2.tar.gz.

File metadata

  • Download URL: embedded_finder-0.2.2.tar.gz
  • Upload date:
  • Size: 52.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for embedded_finder-0.2.2.tar.gz
Algorithm Hash digest
SHA256 7934ced6a4b38ea0db9eb3759898829e1dbd5ce1b18fce62a4a818d2a8647f54
MD5 47454ef047a68e0ff4037541ace323ab
BLAKE2b-256 1756729de950aa47b89de8e7b3329858e459e141934c8cbd58d7ad908058b4ae

See more details on using hashes here.

File details

Details for the file embedded_finder-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for embedded_finder-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 293764a9bf28b845d83403f71226769cdafa28f60edd2b45955168d55784d1bb
MD5 5b389c1d5701d2200a5379086acc0fcf
BLAKE2b-256 9f58ce6be0fcdad2cbec5624a99cf8f34fa4495af1571432bb024755a94f11f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page