A personal search engine built on top of SQLite's FTS5.
Project description
Housaku (豊作 「ほうさく」)
Housaku is a personal search engine built on top of SQLite's FTS5 that lets you query your documents, books, PDFs, favorite feeds and more all in one place.
Housaku is in early development, so you can expect some incompatible changes and other minor issues when updating. Once version
v1.0.0is reached, my goal is to focus on stability and avoiding breaking changes as much as possible.
Features
- Support for multiple file formats like
.txt,.md,.csv,.pdf,.epub,.docx,.xlsxand.pptx. - Support for RSS/Atom feeds parsing and indexing.
- Parallel file processing.
- Concurrent feed processing.
- Web UI.
- Modern TUI with support for theming.
- Easy-to-use CLI.
- Relevant results powered by the BM25 algorithm.
- Support for incremental updates.
Support for file formats like
.odtis coming as well as the possibility of indexing posts from Bluesky feeds and Mastodon.
Stack
- SQLite's FTS5 extension.
- SQLite.
- Starlette.
- aiohttp.
- click.
- feedparser.
- pydantic.
- pymupdf
- rich.
- textual.
Motivation
The first reason I decided to start working on Housaku was to learn more about the basics of full-text search and how search engines operate under the hood. In fact, if you look at the commit history, you can see that initially, all the parsing, tokenization and TF/IDF calculations were handled "manually" before I opted to use SQLite's FTS5 solution due to performance.
The second and final reason was the large volume of documents I was managing. I have ~5,000 notes in Obsidian, formatted in Markdown, a couple of hundred books in my Calibre library, mainly in .epub, a significant number of PDFs, and PowerPoint presentations from my computer science degree at UNED. Additionally, I also have a vast collection of RSS feeds that I have subscribed to for a long time. So, I wanted/needed an efficient and easy way to search through all of this documents without having to worry about the specifics of where each of them was located or in what format.
Installation
The recommended way of installing Housaku is by using uv:
uv tool install --python 3.13 housaku
Now, you just run:
housaku --help
To upgrade, use:
uv tool upgrade housaku
# Or
uv tool upgrade housaku --reinstall
Using pipx
To install Housaku using pipx, simply run:
pipx install housaku
Just remember that the minimal version of Python required is
>=3.13.
Via pip
You can also install Housaku using pip, but the exact command will depend on how your environment is set up. In this case, the command should look something like this:
python3 -m pip install housaku
Configuration
Before you start using Housaku, the first step is to edit the config.toml file located at your $XDG_CONFIG_HOME/housaku/config.toml. This file is generated automatically the first time you run housaku and will look something like this:
# Welcome! This is the configuration file for Housaku.
# Available themes include:
# - "dracula"
# - "textual-dark"
# - "textual-light"
# - "nord"
# - "gruvbox"
# - "catppuccin-mocha"
# - "textual-ansi"
# - "tokyo-night"
# - "monokai"
# - "flexoki"
# - "catppuccin-latte"
# - "solarized-light"
theme = "dracula"
[files]
# Directories to include for indexing.
# Example: include = ["/home/<user>/documents/notes"]
include = []
# Patterns to exclude from the indexing
# Example: exclude = ["*.tmp", "backup", "*.png"]
exclude = []
[feeds]
# List of RSS/Atom feeds to index
# Example: urls = ["https://example.com/feed", "https://anotherexample.com/rss"]
urls = []
The folder that holds the configuration file as well as the SQLite database is determined by the
get_app_dirutility. You can read more about it here.
An easy way to open your config.toml file is to run the following command:
housaku config
Usage
Help
The best way to see which commands are available is to run housaku with the --help flag.
housaku --help
You can also learn more about what a specific command does by running:
housaku [command] --help
# For example:
housaku index --help
Config
The config command is a very simple command that just open the config.toml file using the default editor.
housaku config
Index
After you have configured the list of directories containing the documents you want to index, as well as the list of feeds from which you want to fetch the posts, you can run:
housaku index
Filtering content
To index only your files, use the following command:
housaku index --include files
To index only your feeds:
housaku index --include feeds
You can specify both options to index files and feeds together, but this is equivalent to simply running the
indexcommand without any options.
Parallelism
You can also change the number of threads being used when indexing your files and documents:
housaku index -t 8
My recommendation is to stick with the default number of threads.
At the moment, indexing files is done in parallel using multi-threading, which makes the process faster but also introduces some complications. For example, cancelling the indexing half-way using ctrl+c will cause some threads to exit while others will continue running in the background and then fail.
Search
The search command
The simplest way to start searching your documents and posts is by using the search command:
houskau searh --query "Django AND Postgres"
You can also limit the number of results by using the --limit option which, by default, is set to 10:
housaku search --query "Django AND Postgres" --limit 20
If you don't specify a query using the --query/-q options you will be prompted to enter one.
You can learn more about the query syntax here.
Using the TUI
My favorite and recommended way to search is by using the TUI. To start it, just run:
housaku tui
To exit the TUI just press
ctrl + q, and to open a search result, pressEnterwhile the result is highlighted.
Using the Web UI
Housaku also has a very simple Web UI that you can access by running:
housaku web
The default port is
4242.
This searching method have some limitations. For example, you can't open results that link to your local documents.
vacuum and purge
The vacuum command is used to optimize the SQLite database by reclaiming unused space and improving performance. To run the vacuum command, simply execute:
housaku vacuum
The purge command is used to completely clear all data from the database. This command is useful when you want to reset the database to its initial state.
housaku purge
Be careful before using both of these commands since they will have a direct impact on the data you hold in your database.
Contributing
Contributions are welcomed! If you have any suggestions feel free to open an issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file housaku-0.7.12.tar.gz.
File metadata
- Download URL: housaku-0.7.12.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8c70165ca87d606eb4396564bffa81e0674f203a0db1386eee6e7d3ba3bb3b9
|
|
| MD5 |
5c33fb18e4971714ac9ffd035f5ecf19
|
|
| BLAKE2b-256 |
473df1f1a6f9fb5a492164a6703d59a40dcab322995db52308bd2874a7902d8f
|
Provenance
The following attestation bundles were made for housaku-0.7.12.tar.gz:
Publisher:
release.yml on dnlzrgz/housaku
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
housaku-0.7.12.tar.gz -
Subject digest:
a8c70165ca87d606eb4396564bffa81e0674f203a0db1386eee6e7d3ba3bb3b9 - Sigstore transparency entry: 154522980
- Sigstore integration time:
-
Permalink:
dnlzrgz/housaku@94a838fbd84a4811fa02daa848abc4ea1e603d46 -
Branch / Tag:
refs/tags/v0.7.12 - Owner: https://github.com/dnlzrgz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@94a838fbd84a4811fa02daa848abc4ea1e603d46 -
Trigger Event:
release
-
Statement type:
File details
Details for the file housaku-0.7.12-py3-none-any.whl.
File metadata
- Download URL: housaku-0.7.12-py3-none-any.whl
- Upload date:
- Size: 38.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f091c35feccb7c42db5ebc97a1386cb102188c46596e38500750807e8757238
|
|
| MD5 |
b36720e66be7a40c5ad08cbd34f75a7b
|
|
| BLAKE2b-256 |
523344d5cee7c47c7fa2c1cca625c53665ff9137777818b79242cda6ca53b7d1
|
Provenance
The following attestation bundles were made for housaku-0.7.12-py3-none-any.whl:
Publisher:
release.yml on dnlzrgz/housaku
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
housaku-0.7.12-py3-none-any.whl -
Subject digest:
4f091c35feccb7c42db5ebc97a1386cb102188c46596e38500750807e8757238 - Sigstore transparency entry: 154522982
- Sigstore integration time:
-
Permalink:
dnlzrgz/housaku@94a838fbd84a4811fa02daa848abc4ea1e603d46 -
Branch / Tag:
refs/tags/v0.7.12 - Owner: https://github.com/dnlzrgz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@94a838fbd84a4811fa02daa848abc4ea1e603d46 -
Trigger Event:
release
-
Statement type: