A natural language search engine for your personal notes, transactions and images
Project description
Khoj 🦅
A natural language search engine for your personal notes, transactions and images
Table of Contents
- Features
- Demos
- Architecture
- Setup
- Use
- Upgrade
- Troubleshoot
- Advanced Usage
- Miscellaneous
- Performance
- Development
- Credits
Features
- Natural: Advanced natural language understanding using Transformer based ML Models
- Local: Your personal data stays local. All search, indexing is done on your machine*
- Incremental: Incremental search for a fast, search-as-you-type experience
- Pluggable: Modular architecture makes it easy to plug in new data sources, frontends and ML models
- Multiple Sources: Search your Org-mode and Markdown notes, Beancount transactions and Photos
- Multiple Interfaces: Search using a Web Browser, Emacs or the API
Demos
Khoj in Obsidian
https://user-images.githubusercontent.com/6413477/210486007-36ee3407-e6aa-4185-8a26-b0bfc0a4344f.mp4
Description
- Install Khoj via
pip
and start Khoj backend in non-gui mode - Install Khoj plugin via Community Plugins settings pane on Obsidian app
- Check the new Khoj plugin settings
- Let Khoj backend index the markdown files in the current Vault
- Open Khoj plugin on Obsidian via Search button on Left Pane
- Search "Announce plugin to folks" in the Obsidian Plugin docs
- Jump to the search result
Khoj in Emacs, Browser
https://user-images.githubusercontent.com/6413477/184735169-92c78bf1-d827-4663-9087-a1ea194b8f4b.mp4
Description
- Install Khoj via pip
- Start Khoj app
- Add this readme and khoj.el readme as org-mode for Khoj to index
- Search "Setup editor" on the Web and Emacs. Re-rank the results for better accuracy
- Top result is what we are looking for, the section to Install Khoj.el on Emacs
Analysis
- The results do not have any words used in the query
- Based on the top result it seems the re-ranking model understands that Emacs is an editor?
- The results incrementally update as the query is entered
- The results are re-ranked, for better accuracy, once user hits enter
Interfaces
Architecture
Setup
These are the general setup instructions for Khoj.
Check the Khoj Obsidian Readme to setup Khoj with the Obsidian Plugin. Its simpler as it can skip the configure step below.
1. Install
pip install khoj-assistant
2. Start App
khoj
3. Configure
- Enable content types and point to files to search in the First Run Screen that pops up on app start
- Click
Configure
and wait. The app will download ML models and index the content for search
Use
Interfaces
- Khoj via Obsidian
- Install the Khoj Obsidian plugin
- Click the Khoj search icon 🔎 on the Ribbon or Search for Khoj: Search in the Command Palette
- Khoj via Emacs
- Khoj via Web
- Open http://localhost:8000/ via desktop interface or directly
- Khoj via API
- See the Khoj FastAPI Swagger Docs, ReDocs
Query Filters
Use structured query syntax to filter the natural language search results
- Word Filter: Get entries that include/exclude a specified term
- Entries that contain term_to_include:
+"term_to_include"
- Entries that contain term_to_exclude:
-"term_to_exclude"
- Entries that contain term_to_include:
- Date Filter: Get entries containing dates in YYYY-MM-DD format from specified date (range)
- Entries from April 1st 1984:
dt:"1984-04-01"
- Entries after March 31st 1984:
dt>="1984-04-01"
- Entries before April 2nd 1984 :
dt<="1984-04-01"
- Entries from April 1st 1984:
- File Filter: Get entries from a specified file
- Entries from incoming.org file:
file:"incoming.org"
- Entries from incoming.org file:
- Combined Example
what is the meaning of life? file:"1984.org" dt>="1984-01-01" dt<="1985-01-01" -"big" -"brother"
- Adds all filters to the natural language query. It should return entries
- from the file 1984.org
- containing dates from the year 1984
- excluding words "big" and "brother"
- that best match the natural language query "what is the meaning of life?"
Upgrade
Upgrade Khoj Server
pip install --upgrade khoj-assistant
Upgrade Khoj on Emacs
- Use your Emacs Package Manager to Upgrade
- See khoj.el readme for details
Upgrade Khoj on Obsidian
- Upgrade via the Community plugins tab on the settings pane in the Obsidian app
- See the khoj plugin readme for details
Troubleshoot
- Symptom: Errors out complaining about Tensors mismatch, null etc
- Mitigation: Disable
image
search using the desktop GUI
- Mitigation: Disable
- Symptom: Errors out with "Killed" in error message in Docker
- Fix: Increase RAM available to Docker Containers in Docker Settings
- Refer: StackOverflow Solution, Configure Resources on Docker for Mac
- Symptom:
pip install khoj-assistant
fails while building thetokenizers
dependency. Complains about Rust.- Fix: Install Rust to build the tokenizers package. For example on Mac run:
brew install rustup rustup-init source ~/.cargo/env
- Refer: Issue with Fix for more details
- Fix: Install Rust to build the tokenizers package. For example on Mac run:
Advanced Usage
Access Khoj on Mobile
- Setup Khoj on your personal server. This can be any always-on machine, i.e an old computer, RaspberryPi(?) etc
- Install Tailscale on your personal server and phone
- Open the Khoj web interface of the server from your phone browser. It should be
http://tailscale-url-of-server:8000
orhttp://name-of-server:8000
if you've setup MagicDNS - Click the Install/Add to Homescreen button
- Enjoy exploring your notes, transactions and images from your phone!
Miscellaneous
- The beta chat and search API endpoints use OpenAI API
- It is disabled by default
- To use it add your
openai-api-key
via the app configure screen - Warning: If you use the above beta APIs, your query and top result(s) will be sent to OpenAI for processing
Performance
Query performance
- Semantic search using the bi-encoder is fairly fast at <50 ms
- Reranking using the cross-encoder is slower at <2s on 15 results. Tweak
top_k
to tradeoff speed for accuracy of results - Filters in query (e.g by file, word or date) usually add <20ms to query latency
Indexing performance
- Indexing is more strongly impacted by the size of the source data
- Indexing 100K+ line corpus of notes takes about 10 minutes
- Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
- Note: It should only take this long on the first run as the index is incrementally updated
Miscellaneous
- Testing done on a Mac M1 and a >100K line corpus of notes
- Search, indexing on a GPU has not been tested yet
Development
Visualize Codebase
Setup
Using Pip
1. Install
git clone https://github.com/debanjum/khoj && cd khoj
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
2. Configure
- Copy the
config/khoj_sample.yml
to~/.khoj/khoj.yml
- Set
input-files
orinput-filter
in each relevantcontent-type
section of~/.khoj/khoj.yml
- Set
input-directories
field inimage
content-type
section
- Set
- Delete
content-type
andprocessor
sub-section(s) irrelevant for your use-case
3. Run
khoj -vv
Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML
4. Upgrade
# To Upgrade To Latest Stable Release
# Maps to the latest tagged version of khoj on master branch
pip install --upgrade khoj-assistant
# To Upgrade To Latest Pre-Release
# Maps to the latest commit on the master branch
pip install --upgrade --pre khoj-assistant
# To Upgrade To Specific Development Release.
# Useful to test, review a PR.
# Note: khoj-assistant is published to test PyPi on creating a PR
pip install -i https://test.pypi.org/simple/ khoj-assistant==0.1.5.dev57166025766
Using Docker
1. Clone
git clone https://github.com/debanjum/khoj && cd khoj
2. Configure
- Required: Update docker-compose.yml to mount your images, (org-mode or markdown) notes and beancount directories
- Optional: Edit application configuration in khoj_docker.yml
3. Run
docker-compose up -d
Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings
4. Upgrade
docker-compose build --pull
Using Conda
1. Install Dependencies
2. Install Khoj
git clone https://github.com/debanjum/khoj && cd khoj
conda env create -f config/environment.yml
conda activate khoj
python3 -m pip install pyqt6 # As conda does not support pyqt6 yet
3. Configure
- Copy the
config/khoj_sample.yml
to~/.khoj/khoj.yml
- Set
input-files
orinput-filter
in each relevantcontent-type
section of~/.khoj/khoj.yml
- Set
input-directories
field inimage
content-type
section
- Set
- Delete
content-type
,processor
sub-sections irrelevant for your use-case
4. Run
python3 -m src.main -vv
Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML
5. Upgrade
cd khoj
git pull origin master
conda deactivate khoj
conda env update -f config/environment.yml
conda activate khoj
Test
pytest
Credits
- Multi-QA MiniLM Model, All MiniLM Model for Text Search. See SBert Documentation
- OpenAI CLIP Model for Image Search. See SBert Documentation
- Charles Cave for OrgNode Parser
- Org.js to render Org-mode results on the Web interface
- Markdown-it to render Markdown results on the Web interface
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file khoj-assistant-0.2.2a1673481452.tar.gz
.
File metadata
- Download URL: khoj-assistant-0.2.2a1673481452.tar.gz
- Upload date:
- Size: 814.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a74a5f3e6199b9163ecce67eb2566d5d64c5095278d0de9a54decc602089c5f |
|
MD5 | 66689da7818970c1682110c46396fed7 |
|
BLAKE2b-256 | c00e2c57ec1c609771f8801d9b0c174280780f9dc961ca033e15c15601c37f8d |
File details
Details for the file khoj_assistant-0.2.2a1673481452-py3-none-any.whl
.
File metadata
- Download URL: khoj_assistant-0.2.2a1673481452-py3-none-any.whl
- Upload date:
- Size: 831.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fcd3e99a29fc513d2d2e16f1dea0fbebc4d250eb25d455d1cb880cb137ba59a0 |
|
MD5 | 857bc2c4c424d4b1b76827ab4feed768 |
|
BLAKE2b-256 | 63828310e942f85468cd70a2103e1ce7c6114f12b8a8fa69b13d654b49ac281f |