A natural language search engine for your personal notes, transactions and images
Project description
Khoj 🦅
A natural language search engine for your personal notes, transactions and images
Table of Contents
- Features
- Demo
- Architecture
- Setup
- Use
- Upgrade
- Troubleshoot
- Miscellaneous
- Development Setup
- Performance
- Credits
Features
- Natural: Advanced Natural language understanding using Transformer based ML Models
- Local: Your personal data stays local. All search, indexing is done on your machine*
- Incremental: Incremental search for a fast, search-as-you-type experience
- Pluggable: Modular architecture makes it easy to plug in new data sources, frontends and ML models
- Multiple Sources: Search your Org-mode and Markdown notes, Beancount transactions and Photos
- Multiple Interfaces: Search using a Web Browser, Emacs or the API
Demo
https://user-images.githubusercontent.com/6413477/181664862-31565b0a-0e64-47e1-a79a-599dfc486c74.mp4
Description
- User searches for "Setup editor"
- The demo looks for the most relevant section in this readme and the khoj.el readme
- Top result is what we are looking for, the section to Install Khoj.el on Emacs
Analysis
- The results do not have any words used in the query
- Based on the top result it seems the re-ranking model understands that Emacs is an editor?
- The results incrementally update as the query is entered
- The results are re-ranked, for better accuracy, once user is idle
Architecture
Setup
1. Clone
git clone https://github.com/debanjum/khoj && cd khoj
2. Configure
- Required: Update docker-compose.yml to mount your images, (org-mode or markdown) notes and beancount directories
- Optional: Edit application configuration in khoj_sample.yml
3. Run
docker-compose up -d
Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings
Use
- Khoj via Web
- Go to http://localhost:8000/ or open index.html in your browser
- Khoj via Emacs
- Khoj via API
Upgrade
docker-compose build --pull
Troubleshoot
- Symptom: Errors out with "Killed" in error message
- Fix: Increase RAM available to Docker Containers in Docker Settings
- Refer: StackOverflow Solution, Configure Resources on Docker for Mac
- Symptom: Errors out complaining about Tensors mismatch, null etc
- Mitigation: Delete content-type > image section from
khoj_sample.yml
- Mitigation: Delete content-type > image section from
Miscellaneous
- The experimental chat API endpoint uses the OpenAI API
- It is disabled by default
- To use it add your
openai-api-key
to config.yml
Development Setup
Setup on Local Machine
Using Pip
-
Install Dependencies
- Python3, Pip [Required]
- Virualenv [Optional]
- Install Exiftool [Optional]
sudo apt-get -y install libimage-exiftool-perl
-
Install Khoj
virtualenv -m python3 .venv && source .venv/bin/activate # Optional pip install khoj-assistant
-
Configure
- Configure files/directories to search in
content-type
section ofkhoj_sample.yml
- To run application on test data, update file paths containing
/data/
totests/data/
inkhoj_sample.yml
- Example replace
/data/notes/*.org
withtests/data/notes/*.org
- Example replace
- Configure files/directories to search in
-
Run Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML
khoj -c=config/khoj_sample.yml -vv
Using Conda
-
Install Dependencies
- Install Python3 [Required]
- Install Conda [Required]
- Install Exiftool [Optional]
sudo apt-get -y install libimage-exiftool-perl
-
Install Khoj
git clone https://github.com/debanjum/khoj && cd khoj conda env create -f config/environment.yml conda activate khoj
-
Configure
- Configure files/directories to search in
content-type
section ofkhoj_sample.yml
- To run application on test data, update file paths containing
/data/
totests/data/
inkhoj_sample.yml
- Example replace
/data/notes/*.org
withtests/data/notes/*.org
- Example replace
- Configure files/directories to search in
-
Run Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML
python3 -m src.main -c=config/khoj_sample.yml -vv
Upgrade On Local Machine
Using Pip
pip install --upgrade khoj-assistant
Using Conda
cd khoj
git pull origin master
conda deactivate khoj
conda env update -f config/environment.yml
conda activate khoj
Run Unit Tests
pytest
Performance
Query performance
- Semantic search using the bi-encoder is fairly fast at <5 ms
- Reranking using the cross-encoder is slower at <2s on 15 results. Tweak
top_k
to tradeoff speed for accuracy of results - Applying explicit filters is very slow currently at ~6s. This is because the filters are rudimentary. Considerable speed-ups can be achieved using indexes etc
Indexing performance
- Indexing is more strongly impacted by the size of the source data
- Indexing 100K+ line corpus of notes takes 6 minutes
- Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
- Once https://github.com/debanjum/khoj/issues/36 is implemented, it should only take this long on first run
Miscellaneous
- Testing done on a Mac M1 and a >100K line corpus of notes
- Search, indexing on a GPU has not been tested yet
Credits
- Multi-QA MiniLM Model, All MiniLM Model for Text Search. See SBert Documentation
- OpenAI CLIP Model for Image Search. See SBert Documentation
- Charles Cave for OrgNode Parser
- Org.js to render Org-mode results on the Web interface
- Markdown-it to render Markdown results on the Web interface
- Sven Marnach for PyExifTool
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file khoj-assistant-0.1.5a1659579052.tar.gz
.
File metadata
- Download URL: khoj-assistant-0.1.5a1659579052.tar.gz
- Upload date:
- Size: 175.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5ebaa8842f79822b8b37b6024dfaff329d420e3bf2ca1127524c719a96580df |
|
MD5 | 880e54e0d9d8a88ac3caef00ef82b19d |
|
BLAKE2b-256 | 423aeddd9e470719b8e07a33d4d2cf48e7a9160a8fb7fe3d762a073673e92c6e |
File details
Details for the file khoj_assistant-0.1.5a1659579052-py3-none-any.whl
.
File metadata
- Download URL: khoj_assistant-0.1.5a1659579052-py3-none-any.whl
- Upload date:
- Size: 185.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8de3cc61549aa619e6ffd357e43c00f4d63de32605325988dd4f14250a697cdb |
|
MD5 | 47116aa3896e5d2cc57d84bc6eb7051c |
|
BLAKE2b-256 | e6494d2af8794741f60556d54ae00d56ade6c86b049f514b196b7475db6c5d54 |