Skip to main content

A natural language search engine for your personal notes, transactions and images

Project description

Khoj 🦅

test dockerize pypi

A natural language search engine for your personal notes, transactions and images

Supported Plugins

Khoj on Obsidian Khoj on Emacs

Table of Contents

Features

  • Natural: Advanced natural language understanding using Transformer based ML Models
  • Local: Your personal data stays local. All search, indexing is done on your machine*
  • Incremental: Incremental search for a fast, search-as-you-type experience
  • Pluggable: Modular architecture makes it easy to plug in new data sources, frontends and ML models
  • Multiple Sources: Search your Org-mode and Markdown notes, Beancount transactions and Photos
  • Multiple Interfaces: Search from your Web Browser, Emacs or Obsidian

Demos

Khoj in Obsidian

https://user-images.githubusercontent.com/6413477/210486007-36ee3407-e6aa-4185-8a26-b0bfc0a4344f.mp4

Description
  • Install Khoj via pip and start Khoj backend in non-gui mode
  • Install Khoj plugin via Community Plugins settings pane on Obsidian app
  • Check the new Khoj plugin settings
  • Let Khoj backend index the markdown files in the current Vault
  • Open Khoj plugin on Obsidian via Search button on Left Pane
  • Search "Announce plugin to folks" in the Obsidian Plugin docs
  • Jump to the search result

Khoj in Emacs, Browser

https://user-images.githubusercontent.com/6413477/184735169-92c78bf1-d827-4663-9087-a1ea194b8f4b.mp4

Description
  • Install Khoj via pip
  • Start Khoj app
  • Add this readme and khoj.el readme as org-mode for Khoj to index
  • Search "Setup editor" on the Web and Emacs. Re-rank the results for better accuracy
  • Top result is what we are looking for, the section to Install Khoj.el on Emacs
Analysis
  • The results do not have any words used in the query
    • Based on the top result it seems the re-ranking model understands that Emacs is an editor?
  • The results incrementally update as the query is entered
  • The results are re-ranked, for better accuracy, once user hits enter

Interfaces

Architecture

Setup

These are the general setup instructions for Khoj.

1. Install

pip install khoj-assistant

2. Start App

khoj

3. Configure

  1. Enable content types and point to files to search in the First Run Screen that pops up on app start
  2. Click Configure and wait. The app will download ML models and index the content for search

Use

Interfaces

Query Filters

Use structured query syntax to filter the natural language search results

  • Word Filter: Get entries that include/exclude a specified term
    • Entries that contain term_to_include: +"term_to_include"
    • Entries that contain term_to_exclude: -"term_to_exclude"
  • Date Filter: Get entries containing dates in YYYY-MM-DD format from specified date (range)
    • Entries from April 1st 1984: dt:"1984-04-01"
    • Entries after March 31st 1984: dt>="1984-04-01"
    • Entries before April 2nd 1984 : dt<="1984-04-01"
  • File Filter: Get entries from a specified file
    • Entries from incoming.org file: file:"incoming.org"
  • Combined Example
    • what is the meaning of life? file:"1984.org" dt>="1984-01-01" dt<="1985-01-01" -"big" -"brother"
    • Adds all filters to the natural language query. It should return entries
      • from the file 1984.org
      • containing dates from the year 1984
      • excluding words "big" and "brother"
      • that best match the natural language query "what is the meaning of life?"

Upgrade

Upgrade Khoj Server

pip install --upgrade khoj-assistant
  • Note: To upgrade to the latest pre-release version of the khoj server run below command
    # Maps to the latest commit on the master branch
    pip install --upgrade --pre khoj-assistant
    

Upgrade Khoj on Emacs

  • Use your Emacs Package Manager to Upgrade
  • See khoj.el readme for details

Upgrade Khoj on Obsidian

  • Upgrade via the Community plugins tab on the settings pane in the Obsidian app
  • See the khoj plugin readme for details

Uninstall Khoj

  1. (Optional) Hit Ctrl-C in the terminal running the khoj server to stop it
  2. Delete the khoj directory in your home folder (i.e ~/.khoj on Linux, Mac or C:\Users\<your-username>\.khoj on Windows)
  3. Uninstall the khoj server with pip uninstall khoj-assistant
  4. (Optional) Uninstall khoj.el or the khoj obsidian plugin in the standard way on Emacs, Obsidian

Troubleshoot

Install fails while building Tokenizer dependency

  • Details: pip install khoj-assistant fails while building the tokenizers dependency. Complains about Rust.
  • Fix: Install Rust to build the tokenizers package. For example on Mac run:
    brew install rustup
    rustup-init
    source ~/.cargo/env
    
  • Refer: Issue with Fix for more details

Search starts giving wonky results

  • Fix: Open /api/update?force=true[^2] in browser to regenerate index from scratch
  • Note: This is a fix for when you percieve the search results have degraded. Not if you think they've always given wonky results

Khoj in Docker errors out with "Killed" in error message

Khoj errors out complaining about Tensors mismatch or null

  • Mitigation: Disable image search using the desktop GUI

Advanced Usage

Access Khoj on Mobile

  1. Setup Khoj on your personal server. This can be any always-on machine, i.e an old computer, RaspberryPi(?) etc
  2. Install Tailscale on your personal server and phone
  3. Open the Khoj web interface of the server from your phone browser.
    It should be http://tailscale-ip-of-server:8000 or http://name-of-server:8000 if you've setup MagicDNS
  4. Click the Add to Homescreen button
  5. Enjoy exploring your notes, transactions and images from your phone!

Chat with Notes

Overview

  • Provides a chat interface to inquire and engage with your notes
  • Chat Types:
    • Summarize: Pulls the most relevant note from your notes and summarizes it
    • Chat: Also does general chat. It guesses whether to give a general response or search, summarizes from your note.
      E.g "how was your day?" will give a general response. But When did I go surfing? should give a response from your notes
  • Note: Your query and top note from search result will be sent to OpenAI for processing

Use

  1. Setup your OpenAI API key in Khoj
  2. Open /chat?t=summarize[^2]
  3. Type your queries, see summarized response by Khoj from your notes

Demo

Use OpenAI Models for Search

Setup

  1. Set encoder-type, encoder and model-directory under asymmetric and/or symmetric search-type in your khoj.yml[^1]:
       asymmetric:
    -    encoder: "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
    +    encoder: text-embedding-ada-002
    +    encoder-type: src.khoj.utils.models.OpenAI
         cross-encoder: "cross-encoder/ms-marco-MiniLM-L-6-v2"
    -    encoder-type: sentence_transformers.SentenceTransformer
    -    model_directory: "~/.khoj/search/asymmetric/"
    +    model-directory: null
    
  2. Setup your OpenAI API key in Khoj
  3. Restart Khoj server to generate embeddings. It will take longer than with offline models.

Warnings

This configuration uses an online model

  • It will send all notes to OpenAI to generate embeddings
  • All queries will be sent to OpenAI when you search with Khoj
  • You will be charged by OpenAI based on the total tokens processed
  • It requires an active internet connection to search and index

Search across Different Languages

To search for notes in multiple, different languages, you can use a multi-lingual model.
For example, the paraphrase-multilingual-MiniLM-L12-v2 supports 50+ languages, has good search quality and speed. To use it:

  1. Manually update search-type > asymmetric > encoder to sentence-transformer/paraphrase-multilingual-MiniLM-L12-v2 in your ~/.khoj/khoj.yml file for now. See diff of khoj.yml below for illustration:
 asymmetric:
- encoder: "sentence-transformers/multi-qa-MiniLM-L6-cos-vi"
+ encoder: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
   cross-encoder: "cross-encoder/ms-marco-MiniLM-L-6-v2"
   model_directory: "~/.khoj/search/asymmetric/"
  1. Regenerate your content index. For example, by opening <khoj-url>/api/update?t=force

Miscellaneous

Set your OpenAI API key in Khoj

If you want, Khoj can be configured to use OpenAI for search and chat.
Add your OpenAI API to Khoj by using either of the two options below:

  • Open the Khoj desktop GUI, add your OpenAI API key and click Configure Ensure khoj is started without the --no-gui flag. Check your system tray to see if Khoj 🦅 is minimized there.
  • Set openai-api-key field under processor.conversation section in your khoj.yml[^1] to your OpenAI API key and restart khoj:
    processor:
      conversation:
    -    openai-api-key: # "YOUR_OPENAI_API_KEY"
    +    openai-api-key: sk-aaaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhhhhhhhhhh
        model: "text-davinci-003"
        conversation-logfile: "~/.khoj/processor/conversation/conversation_logs.json"
    

Warning: This will enable khoj to send your query and note(s) to OpenAI for processing

Beta API

Performance

Query performance

  • Semantic search using the bi-encoder is fairly fast at <50 ms
  • Reranking using the cross-encoder is slower at <2s on 15 results. Tweak top_k to tradeoff speed for accuracy of results
  • Filters in query (e.g by file, word or date) usually add <20ms to query latency

Indexing performance

  • Indexing is more strongly impacted by the size of the source data
  • Indexing 100K+ line corpus of notes takes about 10 minutes
  • Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
  • Note: It should only take this long on the first run as the index is incrementally updated

Miscellaneous

  • Testing done on a Mac M1 and a >100K line corpus of notes
  • Search, indexing on a GPU has not been tested yet

Development

Visualize Codebase

Interactive Visualization

Setup

Using Pip

1. Install
git clone https://github.com/debanjum/khoj && cd khoj
python3 -m venv .venv && source .venv/bin/activate
pip install -e .[dev]
2. Run
  1. Start Khoj
    khoj -vv
    
  2. Configure Khoj
    • Via GUI: Add files, directories to index in the GUI window that pops up on starting Khoj, then Click Configure
    • Manually:
      • Copy the config/khoj_sample.yml to ~/.khoj/khoj.yml
      • Set input-files or input-filter in each relevant content-type section of ~/.khoj/khoj.yml
        • Set input-directories field in image content-type section
      • Delete content-type and processor sub-section(s) irrelevant for your use-case
      • Restart khoj

Note: Wait after configuration for khoj to Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

Using Docker

1. Clone
git clone https://github.com/debanjum/khoj && cd khoj
2. Configure
  • Required: Update docker-compose.yml to mount your images, (org-mode or markdown) notes and beancount directories
  • Optional: Edit application configuration in khoj_docker.yml
3. Run
docker-compose up -d

Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings

4. Upgrade
docker-compose build --pull

Using Conda

1. Install Dependencies
2. Install Khoj
git clone https://github.com/debanjum/khoj && cd khoj
conda env create -f config/environment.yml
conda activate khoj
python3 -m pip install pyqt6  # As conda does not support pyqt6 yet
3. Configure
  • Copy the config/khoj_sample.yml to ~/.khoj/khoj.yml
  • Set input-files or input-filter in each relevant content-type section of ~/.khoj/khoj.yml
    • Set input-directories field in image content-type section
  • Delete content-type, processor sub-sections irrelevant for your use-case
4. Run
python3 -m src.khoj.main -vv

Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

5. Upgrade
cd khoj
git pull origin master
conda deactivate khoj
conda env update -f config/environment.yml
conda activate khoj

Validate

Before Make Changes

  1. Install Git Hooks for Validation
    pre-commit install -t pre-push -t pre-commit
    
    • This ensures standard code formatting fixes and other checks run automatically on every commit and push
    • Note 1: If pre-commit didn't already get installed, install it via pip install pre-commit
    • Note 2: To run the pre-commit changes manually, use pre-commit run --hook-stage manual --all before creating PR

Before Creating PR

  1. Run Tests

    pytest
    
  2. Run MyPy to check types

    mypy --config-file pyproject.toml
    

After Creating PR

  • Automated validation workflows run for every PR.

    Ensure any issues seen by them our fixed

  • Test the python packge created for a PR

    1. Download and extract the zipped .whl artifact generated from the pypi workflow run for the PR.
    2. Install (in your virtualenv) with pip install /path/to/download*.whl>
    3. Start and use the application to see if it works fine

Credits

[^1]: Default Khoj config file @ ~/.khoj/khoj.yml

[^2]: Default Khoj url @ http://localhost:8000

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khoj_assistant-0.3.0a1676672238.tar.gz (418.1 kB view details)

Uploaded Source

Built Distribution

khoj_assistant-0.3.0a1676672238-py3-none-any.whl (435.5 kB view details)

Uploaded Python 3

File details

Details for the file khoj_assistant-0.3.0a1676672238.tar.gz.

File metadata

File hashes

Hashes for khoj_assistant-0.3.0a1676672238.tar.gz
Algorithm Hash digest
SHA256 504c121d3d4a5aef7884ebfe7638b5acd6b10612129b55b5ef2fff7504d86072
MD5 9de1c3b9351e71435163ef6f57607896
BLAKE2b-256 056e8adfaed71173fc56d0c01e9146d868fa567ef852c777bd03d7f09647c312

See more details on using hashes here.

File details

Details for the file khoj_assistant-0.3.0a1676672238-py3-none-any.whl.

File metadata

File hashes

Hashes for khoj_assistant-0.3.0a1676672238-py3-none-any.whl
Algorithm Hash digest
SHA256 466fea3bc01d1acea56a8486e7646fd34552426ead21713f957919ada072bf6a
MD5 b1e6f2329112cc0b0ad0a8c8edbe8011
BLAKE2b-256 175fcac0e7edabf9c5d403e6615be1bb88e200f1045b536d972cda3c0d5affbe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page