Skip to main content

A natural language search engine for your personal notes, transactions and images

Project description

Khoj 🦅

A natural language search engine for your personal notes, transactions and images

Table of Contents

Features

  • Natural: Advanced Natural language understanding using Transformer based ML Models
  • Local: Your personal data stays local. All search, indexing is done on your machine*
  • Incremental: Incremental search for a fast, search-as-you-type experience
  • Pluggable: Modular architecture makes it easy to plug in new data sources, frontends and ML models
  • Multiple Sources: Search your Org-mode and Markdown notes, Beancount transactions and Photos
  • Multiple Interfaces: Search using a Web Browser, Emacs or the API

Demo

https://user-images.githubusercontent.com/6413477/181664862-31565b0a-0e64-47e1-a79a-599dfc486c74.mp4

Description

Analysis

  • The results do not have any words used in the query
    • Based on the top result it seems the re-ranking model understands that Emacs is an editor?
  • The results incrementally update as the query is entered
  • The results are re-ranked, for better accuracy, once user is idle

Architecture

Setup

1. Clone

git clone https://github.com/debanjum/khoj && cd khoj

2. Configure

  • Required: Update docker-compose.yml to mount your images, (org-mode or markdown) notes and beancount directories
  • Optional: Edit application configuration in khoj_sample.yml

3. Run

docker-compose up -d

Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings

Use

Upgrade

docker-compose build --pull

Troubleshoot

  • Symptom: Errors out with "Killed" in error message
  • Symptom: Errors out complaining about Tensors mismatch, null etc
    • Mitigation: Delete content-type > image section from khoj_sample.yml

Miscellaneous

  • The experimental chat API endpoint uses the OpenAI API
    • It is disabled by default
    • To use it add your openai-api-key to config.yml

Development Setup

Setup on Local Machine

Using Pip

  1. Install Dependencies

    1. Python3, Pip [Required]
    2. Virualenv [Optional]
    3. Install Exiftool [Optional]
      sudo apt-get -y install libimage-exiftool-perl
      
  2. Install Khoj

    virtualenv -m python3 .venv && source .venv/bin/activate # Optional
    pip install khoj-assistant
    
  3. Configure

    • Configure files/directories to search in content-type section of khoj_sample.yml
    • To run application on test data, update file paths containing /data/ to tests/data/ in khoj_sample.yml
      • Example replace /data/notes/*.org with tests/data/notes/*.org
  4. Run Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

    khoj -c=config/khoj_sample.yml -vv
    

Using Conda

  1. Install Dependencies

    1. Install Python3 [Required]
    2. Install Conda [Required]
    3. Install Exiftool [Optional]
      sudo apt-get -y install libimage-exiftool-perl
      
  2. Install Khoj

    git clone https://github.com/debanjum/khoj && cd khoj
    conda env create -f config/environment.yml
    conda activate khoj
    
  3. Configure

    • Configure files/directories to search in content-type section of khoj_sample.yml
    • To run application on test data, update file paths containing /data/ to tests/data/ in khoj_sample.yml
      • Example replace /data/notes/*.org with tests/data/notes/*.org
  4. Run Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

    python3 -m src.main -c=config/khoj_sample.yml -vv
    

Upgrade On Local Machine

Using Pip

pip install --upgrade khoj-assistant

Using Conda

cd khoj
git pull origin master
conda deactivate khoj
conda env update -f config/environment.yml
conda activate khoj

Run Unit Tests

pytest

Performance

Query performance

  • Semantic search using the bi-encoder is fairly fast at <5 ms
  • Reranking using the cross-encoder is slower at <2s on 15 results. Tweak top_k to tradeoff speed for accuracy of results
  • Applying explicit filters is very slow currently at ~6s. This is because the filters are rudimentary. Considerable speed-ups can be achieved using indexes etc

Indexing performance

  • Indexing is more strongly impacted by the size of the source data
  • Indexing 100K+ line corpus of notes takes 6 minutes
  • Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
  • Once https://github.com/debanjum/khoj/issues/36 is implemented, it should only take this long on first run

Miscellaneous

  • Testing done on a Mac M1 and a >100K line corpus of notes
  • Search, indexing on a GPU has not been tested yet

Credits

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khoj-assistant-0.1.5a1659579052.tar.gz (175.3 kB view details)

Uploaded Source

Built Distribution

khoj_assistant-0.1.5a1659579052-py3-none-any.whl (185.5 kB view details)

Uploaded Python 3

File details

Details for the file khoj-assistant-0.1.5a1659579052.tar.gz.

File metadata

File hashes

Hashes for khoj-assistant-0.1.5a1659579052.tar.gz
Algorithm Hash digest
SHA256 b5ebaa8842f79822b8b37b6024dfaff329d420e3bf2ca1127524c719a96580df
MD5 880e54e0d9d8a88ac3caef00ef82b19d
BLAKE2b-256 423aeddd9e470719b8e07a33d4d2cf48e7a9160a8fb7fe3d762a073673e92c6e

See more details on using hashes here.

File details

Details for the file khoj_assistant-0.1.5a1659579052-py3-none-any.whl.

File metadata

File hashes

Hashes for khoj_assistant-0.1.5a1659579052-py3-none-any.whl
Algorithm Hash digest
SHA256 8de3cc61549aa619e6ffd357e43c00f4d63de32605325988dd4f14250a697cdb
MD5 47116aa3896e5d2cc57d84bc6eb7051c
BLAKE2b-256 e6494d2af8794741f60556d54ae00d56ade6c86b049f514b196b7475db6c5d54

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page