Skip to main content

A natural language search engine for your personal notes, transactions and images

Project description

Khoj 🦅

build test publish

A natural language search engine for your personal notes, transactions and images

Table of Contents

Features

  • Natural: Advanced natural language understanding using Transformer based ML Models
  • Local: Your personal data stays local. All search, indexing is done on your machine*
  • Incremental: Incremental search for a fast, search-as-you-type experience
  • Pluggable: Modular architecture makes it easy to plug in new data sources, frontends and ML models
  • Multiple Sources: Search your Org-mode and Markdown notes, Beancount transactions and Photos
  • Multiple Interfaces: Search using a Web Browser, Emacs or the API

Demo

https://user-images.githubusercontent.com/6413477/181664862-31565b0a-0e64-47e1-a79a-599dfc486c74.mp4

Description

Analysis

  • The results do not have any words used in the query
    • Based on the top result it seems the re-ranking model understands that Emacs is an editor?
  • The results incrementally update as the query is entered
  • The results are re-ranked, for better accuracy, once user is idle

Architecture

Setup

1. Install

pip install khoj-assistant

2. Configure

  • Set input-files or input-filter in each relevant content-type section of khoj_sample.yml
    • Set input-directories field in content-type.image section
  • Delete content-type, processor sub-sections irrelevant for your use-case

3. Run

khoj -c=config/khoj_sample.yml -vv

Loads ML model, generates embeddings and exposes API to search notes, images, transactions etc specified in config YAML

Use

Upgrade

pip install --upgrade khoj-assistant

Troubleshoot

  • Symptom: Errors out complaining about Tensors mismatch, null etc

    • Mitigation: Delete content-type > image section from khoj_sample.yml
  • Symptom: Errors out with "Killed" in error message in Docker

Miscellaneous

  • The experimental chat API endpoint uses the OpenAI API
    • It is disabled by default
    • To use it add your openai-api-key to config.yml

Performance

Query performance

  • Semantic search using the bi-encoder is fairly fast at <5 ms
  • Reranking using the cross-encoder is slower at <2s on 15 results. Tweak top_k to tradeoff speed for accuracy of results
  • Applying explicit filters is very slow currently at ~6s. This is because the filters are rudimentary. Considerable speed-ups can be achieved using indexes etc

Indexing performance

  • Indexing is more strongly impacted by the size of the source data
  • Indexing 100K+ line corpus of notes takes 6 minutes
  • Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
  • Once https://github.com/debanjum/khoj/issues/36 is implemented, it should only take this long on first run

Miscellaneous

  • Testing done on a Mac M1 and a >100K line corpus of notes
  • Search, indexing on a GPU has not been tested yet

Development

Setup

Using Pip

1. Install
git clone https://github.com/debanjum/khoj && cd khoj
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
2. Configure
  • Set input-files or input-filter in each relevant content-type section of khoj_sample.yml
    • Set input-directories field in image content-type section
  • Delete content-type, processor sub-sections irrelevant for your use-case
3. Run
khoj -c=config/khoj_sample.yml -vv

Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

4. Upgrade
# To Upgrade To Latest Stable Release
# Maps to the latest tagged version of khoj on master branch
pip install --upgrade khoj-assistant

# To Upgrade To Latest Pre-Release
# Maps to the latest commit on the master branch
pip install --upgrade --pre khoj-assistant

# To Upgrade To Specific Development Release.
# Useful to test, review a PR.
# Note: khoj-assistant is published to test PyPi on creating a PR
pip install -i https://test.pypi.org/simple/ khoj-assistant==0.1.5.dev57166025766

Using Docker

1. Clone
git clone https://github.com/debanjum/khoj && cd khoj
2. Configure
  • Required: Update docker-compose.yml to mount your images, (org-mode or markdown) notes and beancount directories
  • Optional: Edit application configuration in khoj_docker.yml
3. Run
docker-compose up -d

Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings

4. Upgrade
docker-compose build --pull

Using Conda

1. Install Dependencies
  • Install Conda [Required]
  • Install Exiftool [Optional]
    sudo apt -y install libimage-exiftool-perl
    
2. Install Khoj
git clone https://github.com/debanjum/khoj && cd khoj
conda env create -f config/environment.yml
conda activate khoj
3. Configure
  • Set input-files or input-filter in each relevant content-type section of khoj_sample.yml
    • Set input-directories field in image content-type section
  • Delete content-type, processor sub-sections irrelevant for your use-case
4. Run
python3 -m src.main config/khoj_sample.yml -vv

Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

5. Upgrade
cd khoj
git pull origin master
conda deactivate khoj
conda env update -f config/environment.yml
conda activate khoj

Test

pytest

Credits

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khoj-assistant-0.1.5a1660416136.tar.gz (284.0 kB view details)

Uploaded Source

Built Distribution

khoj_assistant-0.1.5a1660416136-py3-none-any.whl (295.7 kB view details)

Uploaded Python 3

File details

Details for the file khoj-assistant-0.1.5a1660416136.tar.gz.

File metadata

File hashes

Hashes for khoj-assistant-0.1.5a1660416136.tar.gz
Algorithm Hash digest
SHA256 813c2e7a7d4205337d749dc01b90aabcba95e9d67e9cc1e049472c4c3a4bfb18
MD5 f69fdf14f1336fe7f610d7136bf0a826
BLAKE2b-256 344360726963d5a558f3f4759cfbeaf23d7a25db8b0568ce9f6700c41b5692db

See more details on using hashes here.

File details

Details for the file khoj_assistant-0.1.5a1660416136-py3-none-any.whl.

File metadata

File hashes

Hashes for khoj_assistant-0.1.5a1660416136-py3-none-any.whl
Algorithm Hash digest
SHA256 c1289484696389e8a9013ad8bb1e703841e98c8d7d0fd737f1f517c43d9a1baa
MD5 bbac84209e66f136a3b412327cb2c97e
BLAKE2b-256 6ba7c6fdc6d5ec7fd5bb0653f5164de185cdddc3e692c1f69bdf9dffead20b90

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page