You Know, for local Search.
Project description
Horsebox
A versatile and autonomous command line tool for search.
Table of contents
Abstract
Anybody faced at least once a situation where searching for some information was required, whether it was from a project folder, or any other place that contains information of interest.
Horsebox is a tool whose purpose is to offer such search feature (thanks to the full-text search engine library Tantivy), without any external dependencies, from the command line.
While it was built with a developer persona in mind, it can be used by anybody who is not afraid of typing few characters in a terminal (samples are here to guide you).
Disclaimer: this tool was tested on Linux (Ubuntu, Debian) and MacOS only.
TL;DR
For the ones who want to go straight to the point.
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
# Install Horsebox
uv tool install horsebox
# Alternative: install from the repository
# For the impatient users who want the latest features before they are published on PyPi
uv tool install git+https://github.com/michelcaradec/horsebox
You are ready to search.
Requirements
All the commands described in this project rely on the Python package and project manager uv.
-
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
-
Or update it:
uv self update
Tool Installation
For the ones who just want to use the tool.
-
Install the tool:
-
From PyPi:
uv tool install horsebox
-
From the online Github project:
uv tool install git+https://github.com/michelcaradec/horsebox
-
-
Use the tool.
Project Setup
For the ones who want to develop on the project.
Python Environment
-
Clone the project:
git clone https://github.com/michelcaradec/horsebox.git cd horsebox
-
Create a Python virtual environment:
uv sync # Install the development requirements uv sync --extra dev # Activate the environment source .venv/bin/activate
-
Check the tool execution:
uv run horsebox
Alternate commands:
uv run hb.uv run ./src/horsebox/main.py.python ./src/horsebox/main.py.
-
The tool can also be installed from the local project with the command:
uv tool install --editable .
-
Use the tool.
Pre-Commit Setup
-
Install the git hook scripts:
pre-commit install -
Update the hooks to the latest version automatically:
pre-commit autoupdate
Pre-Commit Tips
-
Manually run against all the files:
pre-commit run --all-files --show-diff-on-failure
-
Bypass pre-commit when committing:
git commit --no-verify
-
Un-install the git hook scripts:
pre-commit uninstall
Usage
Naming Conventions
The following terms are used:
- Datasource: the place where the information will be collected from. It can be a folder, a web page, an RSS feed, etc.
- Container: the "box" containing the information. It can be a file, a web page, an RSS article, etc.
- Content: the information contained in a container. It is mostly text, but can also be a date of last update for a file.
- Collector: a working unit in charge of gathering information to be converted in searchable one.
Getting Help
To list the available commands:
hb --help
To get help for a given command (here search):
hb search --help
Rendering
For any command, the option --format specifies the output format:
txt: text mode (default).json: JSON. The shortcut option--jsoncan also be used.
Searching
The query string syntax, specified with the option --query, is the one supported by the Tantivy's query parser.
Example: search in text files (with extension .txt) under the folder demo.
hb search --from ./demo/ --pattern "*.txt" --query "better" --highlight
Options used:
--from: folder to (recursively) index.--pattern: files to index.
[!IMPORTANT] The pattern must be enclosed in quotes to prevent wildcard expansion.
--query: search query.--highlight: shows the places where the result was found in the content of the files.
One result is returned, as there is only one document (i.e. container) in the index.
A different collector can be used to index line by line:
hb search --from ./demo/ --pattern "*.txt" --using fileline --query "better" --highlight --limit 5
Options used:
--using: collector to use for indexing.--limit: returns a maximum number of results (default is 10).
The option --count can be added to show the total number of results found:
hb search --from ./demo/ --pattern "*.txt" --using fileline --query "better" --count
See the section samples for advanced usage.
Building An Index
Example: build an index .index-demo from the text files (with extension .txt) under the folder demo.
hb build --from ./demo/ --pattern "*.txt" --index ./.index-demo
Options used:
--from: folder to (recursively) index.--pattern: files to index.
[!IMPORTANT] The pattern must be enclosed in quotes to prevent wildcard expansion.
--index: location where to persist the index.
By default, the collector filecontent is used.
An alternate collector can be specified with the option --using.
The option --dry-run can be used to show the items to be index, without creating the index.
The built index can be searched:
hb search --index ./.index-demo --query "better" --highlight
Searching on a persisted index will trigger a warning if the age of the index (i.e. the time elapsed since it was built) goes over a given threshold (which can be configured).
The index can be refreshed to contain the most up-to-date data.
Refreshing An Index
A built index can be refreshed to contain the most up-to-date data.
Example: refresh the index .index-demo previously built.
hb refresh --index ./.index-demo
There are cases where an index can't be refreshed:
- The index was built with a version prior to
0.4.0. - The index data source was provided by pipe (see the section Collectors Usage Matrix).
Inspecting An Index
To get technical information on an existing index:
hb inspect --index ./.index-demo
To get the most frequent keywords (option --top):
hb search --index ./.index-demo --top
Analyzing Some Text
[!NOTE] The version
0.7.0introduced a new option--analyzer, which replaces the legacy ones (--tokenizer,--tokenizer-params,--filterand--filter-params). Even-though the use of this new option is strongly recommended, the legacies are still available with the commandanalyze.
The command analyze is used to play with the tokenizers and filters supported by Tantivy to index documents.
To tokenize a text:
hb analyze \
--text "Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust." \
--tokenizer whitespace
To filter a text:
hb analyze \
--text "Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust." \
--filter lowercase
Multiple examples can be found in the script usage.sh.
Concepts
Horsebox has been thought around few concepts:
Understanding them will help in choosing the right usage strategy.
Collectors
A collector is in charge of gathering information from a given datasource, and returning documents to index.
It acts as a level of abstraction, which returns documents to be ingested.
Horsebox supports different types of collectors:
| Collector | Description |
|---|---|
filename |
One document per file, containing the name of the file only. |
filecontent |
One document per file, with the content of the file (default). |
fileline |
One document per line and per file. |
rss |
RSS feed, one document per article. |
html |
Collect the content of an HTML page. |
raw |
Collect ready to index JSON documents. |
pdf |
Collect the content of a PDF document. |
guess |
Used to identify the best collector to use. |
The collector to use is specified with the option --using.
The default collector is filecontent.
See the script usage.sh for sample commands.
Raw Collector
The collector raw can be used to collect ready to index JSON documents.
Each document must have the following fields [^4]:
name(text): name of the container.type(text): type of the container.content(text): content of the container.path(text): full path to the content.size(integer): size of the content.date(text): date-time of the content (formatted asYYYY-mm-dd H:M:S, for example2025-03-14 12:34:56).
The JSON file can contain either an array of JSON objects (default), or one JSON object per line (JSON Lines format).
The JSON Lines format is automatically detected from the file extension (.jsonl or ndjson).
The option --jsonl can be used to force the detection (this is for example required when the data source is provided by pipe).
Some examples can be found with the files raw.json (array of objects) and raw.jsonl (JSON Lines).
[^4]: Run the command hb schema for a full description.
Guess Collector
Disclaimer: starting with version 0.5.0.
The collector guess can be used to identify the best collector to use.
The detection is done in a best effort from the options --from and --pattern.
An error will be returned if no collector could be guessed.
The collector guess is used by default, meaning that the option --using can be skipped.
Examples:
hb search --from "https://planetpython.org/rss20.xml" --query "some text" --using rss
# Can be simplified as (guess from the https scheme and the extension .xml)
hb search --from "https://planetpython.org/rss20.xml" --query "some text"
hb search --from ./raw.json --query "some text" --using raw
# Can be simplified as (guess from the file extension .json)
hb search --from ./raw.json --query "some text"
hb search --from ./raw.jsonl --query "some text" --using raw --jsonl
# Can be simplified as (guess from the file extension .jsonl)
hb search --from ./raw.jsonl --query "some text"
This feature is mainly for command line usage, to help reduce the number of keystrokes.
When used in a script, it is advised to explicitly set the required collector with the option --using.
Collectors Usage Matrix
The following table shows the options supported by each collector.
| Collector | Multi-Sources Mode | Single Source Mode | Pipe Support |
|---|---|---|---|
filename |
--from $folder --pattern *.xxx |
- | - |
filecontent |
--from $folder --pattern *.xxx |
- | --from - --using filecontent |
fileline |
--from $folder --pattern *.xxx |
- | --from - --using fileline |
rss |
- | --from $feed |
- |
html |
- | --from $page |
- |
raw |
- | --from $json |
--from - --using raw |
pdf |
--from $folder --pattern *.pdf |
--from $file.pdf |
- |
-: not supported.
These options are also used by the guess collector in its detection.
Collectors Simplified Patterns
Disclaimer: starting with version 0.8.0.
The file system collectors use the combined options --from and --pattern to specify the folder to (recursively) scan, and the files to index.
For example, the options --from ./demo and --from ./demo/ --pattern "*.txt" will index the files with the extension .txt located under the folder ./demo.
While this syntax makes a clear separation between the datasource and the containers, it can be long to type, especially for standard patterns.
The list of arguments can be simplified by combining both options.
Examples:
--from ./demo --from ./demo/ --pattern "*.txt"can be passed as--from "./demo/*.txt".--from . --pattern "*.pdf"can be passed as--from "*.pdf".
[!IMPORTANT] The pattern must be enclosed in quotes to prevent wildcard expansion.
This new syntax still allows the use of the option --pattern (for example, --from "*.pdf" --pattern "*.pdf" will index all the files with the extension .txt or .pdffrom the current folder).
Index
The index is the place where the collected information lies. It is required to allow the search.
An index is built with the help of Tantivy (a full-text search engine library), and can be either stored in memory or persisted on disk (see the section strategies).
Strategies
Horsebox can be used in different ways to achieve to goal of searching (and hopefully finding) some information.
-
One-step search:
Index and search, with no index retention.
This fits an unstable source of information, with frequent changes.hb search --from ./demo/ --pattern "*.txt" --query "better" --highlight
-
Two-steps search:
Build and persist an index, then search in the existing index.
This fits a stable and voluminous (i.e. long to index) source of information.Build the index once:
hb build --from ./demo/ --pattern "*.txt" --index ./.index-demo
Then search it (multiple times):
hb search --index ./.index-demo --query "better" --highlight
-
All-in-one search:
Like a two-steps search, but in one step.
For the ones who want to do everything in a single command.hb search --from ./demo/ --pattern "*.txt" --index ./.index-demo --query "better" --highlight
The use of the options
--fromand--indexwith the commandsearchwill build and persist an index, which will be immediately searched, and will also be available for future searches.
Annexes
Project Bootstrap
The project was created with the command:
# Will create a directory `horsebox`
uv init --app --package --python 3.10 horsebox
Unit Tests
The Python module doctest has been used to write some unit tests:
python -m doctest -v ./src/**/*.py
Manual Testing In Docker
Horsebox can be installed in a fresh environment to demonstrate its straight-forward setup:
# From the project
docker run --interactive --tty --name horsebox --volume=$(pwd):/home/project --rm debian:stable /bin/bash
# Alternative: Docker image with OhMyZsh (for colors)
docker run --interactive --tty --name horsebox --volume=$(pwd):/home/project --rm ohmyzsh/ohmyzsh:main
# Install few dependencies
source /home/project/demo/docker-setup.sh
# Install Horsebox
uv tool install .
Samples
The script usage.sh contains multiple sample commands:
bash ./demo/usage.sh
Advanced Searches
The query string syntax conforms to Tantivy's query parser.
-
Search on multiple datasources:
Multiple datasources can be collected to build/search an index by repeating the option--from.hb search \ --from "https://www.blog.pythonlibrary.org/feed/" \ --from "https://planetpython.org/rss20.xml" \ --from "https://realpython.com/atom.xml?format=xml" \ --using rss --query "duckdb" --highlight
Source: Top 60 Python RSS Feeds.
-
Search on date:
A date must be formatted using the RFC3339 standard.
Example:2025-01-01T10:00:00.00Z.The field
datemust be specified, and the date must be enclosed in single quotes:hb search --from ./demo/raw.json --using raw --query "date:'2025-01-01T10:00:00.00Z'"
-
Search on range of dates:
Inclusive boundaries are specified with square brackets ([]):hb search --from ./demo/raw.json --using raw --query "date:[2025-01-01T10:00:00.00Z TO 2025-01-04T10:00:00.00Z]"
Exclusive boundaries are specified with curly brackets (
{}):hb search --from ./demo/raw.json --using raw --query "date:{2025-01-01T10:00:00.00Z TO 2025-01-04T10:00:00.00Z}"
Inclusive and exclusive boundaries can be mixed:
hb search --from ./demo/raw.json --using raw --query "date:[2025-01-01T10:00:00.00Z TO 2025-01-04T10:00:00.00Z}"
-
Fuzzy search:
The fuzzy search is not supported by Tantivy query parser [^6].
Horsebox comes with a simple implementation, which supports the expression of a fuzzy search on a single word.
Example: the searchengne~will find the word "engine", as it differs by 1 change according to the Levenshtein distance measure.The distance can be set after the marker
~, with a maximum of 2:engne~1,engne~2.hb search --from ./demo/raw.json --using raw --query "engne~1"
[!IMPORTANT] The highlight (option
--highlight) will not work [^5].
-
Proximity search:
The two words to search are enclosed in single quotes, followed by the maximum distance.hb search --from ./demo/raw.json --using raw --query "'engine inspired'~1" --highlight
Will find all documents where the words "engine" and "inspired" are separated by a maximum of 1 word.
-
Query explanation:
The result of a query can be explained with the help of the option--explain.hb search --from "./demo/*.txt" --using fileline --query "better" --explain --json --limit 2
For each document found, a field
explainwill be returned, with details on why it was selected [^11]. -
Sort the result:
The result of a query can be ordered by a single field with the help of the option--sort.# Ascending order hb search --from "./demo/size/*.txt" --query "file" --sort "+size" # Descending order hb search --from "./demo/size/*.txt" --query "file" --sort "-size" hb search --from "./demo/size/*.txt" --query "file" --sort "size"
The field prefix
+is used for ascending order,-for descending order (set by default if missing).
[!IMPORTANT] This option was introduced with the version
0.10.0. It requires an existing index to be refreshed to make it work.
Only the fieldsname,type,content,sizeanddatecan be used.
[^5]: See https://github.com/quickwit-oss/tantivy/issues/2576.
[^6]: Even though Tantivy implements it with FuzzyTermQuery.
[^11]: See https://docs.rs/tantivy/latest/tantivy/query/struct.Explanation.html.
Using A Custom Analyzer
Disclaimer: starting with version 0.7.0.
By default, the content of a container is indexed in the field content using the default text analyzer, which splits the text on every white space and punctuation [^8], removes words (a.k.a tokens) that are longer than 40 characters [^9], and lowercases the text [^10].
While this text analyzer fits most of the cases, it may not be suitable for more specific content such as code.
The option --analyzer can be used with the commands build and search to apply a custom tokenizer and filters to the content to be indexed.
The definition of the custom analyzer is described in a JSON file.
The analyzed content will be indexed to an extra field custom.
To build an index .index-analyzer with a custom analyzer analyzer-python.json:
hb build \
--index .index-analyzer \
--from ./demo --pattern "*.py" \
--using fileline \
--analyzer ./demo/analyzer-python.json
A full set of examples can be found in the script usage.sh.
Custom Analyzer Definition
The custom analyzer definition is described in a JSON file.
It is composed of two parts:
tokenizer: the tokenizer to use to split the content. There must be one and only one tokenizer.filters: the filters to use to transform and select the tokenized content. There can be zero or more filters.
{
"tokenizer": {
"$tokenize_type": {...}
},
"filters": [
{
"$filter_type": {...}
},
{
"$filter_type": {...}
}
]
}
Each object $tokenize_type and $filter_type may contain extra configuration fields.
The file analyzer-schema.json is a JSON Schema which can be used to validate any custom analyzer definition.
The site JSON Editor Online proposes a playground to test it from your browser.
The Python library jsonschema proposes an implementation of JSON Schema validation.
Custom Analyzer Limitations
- When a custom analyzer is defined, the highlight is done of the field
custom. - The tokenizer regex uses the pattern syntax supported by the Regex implementation.
- The option
--topis not applied on the fieldcustom, due to the fast mode required for aggregation, but not compatible with the tokenizer regex.
[^8]: Using the tokenizer simple.
[^9]: Using the filter remove_long.
[^10]: Using the filter lowercase.
Configuration
Horsebox can be configured through environment variables:
| Setting | Description | Default Value |
|---|---|---|
HB_INDEX_BATCH_SIZE |
Batch size when indexing. | 1000 |
HB_HIGHLIGHT_MAX_CHARS |
Maximum number of characters to show for highlights. | 200 |
HB_PARSER_MAX_LINE |
Maximum size of a line in a container (unlimited if null). | |
HB_PARSER_MAX_CONTENT |
Maximum size of a container (unlimited if null). | |
HB_RENDER_MAX_CONTENT |
Maximum size of a document content to render (unlimited if null). | |
HB_INDEX_EXPIRATION |
Index freshness threshold (in seconds). | 3600 |
HB_CUSTOM_STOPWORDS |
Custom list of stop-words (separated by a comma). | |
HB_STRING_NORMALIZE |
Normalize strings [^7] when reading files (0=disabled, other value=enabled). | 1 |
HB_TOP_MIN_CHARS |
Minimum number of characters of a top keyword. | 1 |
To get help on configuration:
hb config
The default and current values are displayed.
[^7]: The normalization of a string consists in replacing the accented characters by their non-accented equivalent, and converting Unicode escaped characters. This is a CPU intensive process, which may not be required for some datasources.
VSCode Integration
If you use Visual Studio Code, you can integrate Horsebox using tasks.
The file tasks.json provides some sample tasks to index and search Markdown files in the current project.
Where Does This Name Come From
I had some requirements to find a name:
- Short and easy to remember.
- Preferably a compound one, so it could be shortcut at the command line with the first letters of each part.
- Connected to Tantivy, whose logo is a rider on a horse.
I then remembered the nickname of a very good friend met during my studies in Cork, Ireland: "Horsebox".
That was it: the name will be "Horsebox", with its easy-to-type shortcut "hb".
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file horsebox-0.10.0.tar.gz.
File metadata
- Download URL: horsebox-0.10.0.tar.gz
- Upload date:
- Size: 44.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ebeaa72fad63b502685802508f34af48b76606cb57f3e80cb2ca6a3e3e97a78
|
|
| MD5 |
1ce5b431d2bdd6cec497633c99bf9006
|
|
| BLAKE2b-256 |
4561783f5cfd232048b230e024ca00db71d6c6ea23a842ac881e2db9bb714a5b
|
Provenance
The following attestation bundles were made for horsebox-0.10.0.tar.gz:
Publisher:
python-publish.yml on michelcaradec/horsebox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
horsebox-0.10.0.tar.gz -
Subject digest:
5ebeaa72fad63b502685802508f34af48b76606cb57f3e80cb2ca6a3e3e97a78 - Sigstore transparency entry: 557061708
- Sigstore integration time:
-
Permalink:
michelcaradec/horsebox@ba7442f4f6a94c942092e430d5c595dd5609b489 -
Branch / Tag:
refs/tags/v0.10.0 - Owner: https://github.com/michelcaradec
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@ba7442f4f6a94c942092e430d5c595dd5609b489 -
Trigger Event:
release
-
Statement type:
File details
Details for the file horsebox-0.10.0-py3-none-any.whl.
File metadata
- Download URL: horsebox-0.10.0-py3-none-any.whl
- Upload date:
- Size: 58.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f1d3957af6ed3c4eb631b5df9b69d891df73163c3b17a95cf43f092d9e9ef29
|
|
| MD5 |
c34bbd74c3e475b72b796d348eb7dccc
|
|
| BLAKE2b-256 |
68cb711a6906d9508c0cb7b36c1b490a4d13c0d525cc277dab0ba795f2fabd01
|
Provenance
The following attestation bundles were made for horsebox-0.10.0-py3-none-any.whl:
Publisher:
python-publish.yml on michelcaradec/horsebox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
horsebox-0.10.0-py3-none-any.whl -
Subject digest:
6f1d3957af6ed3c4eb631b5df9b69d891df73163c3b17a95cf43f092d9e9ef29 - Sigstore transparency entry: 557061733
- Sigstore integration time:
-
Permalink:
michelcaradec/horsebox@ba7442f4f6a94c942092e430d5c595dd5609b489 -
Branch / Tag:
refs/tags/v0.10.0 - Owner: https://github.com/michelcaradec
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@ba7442f4f6a94c942092e430d5c595dd5609b489 -
Trigger Event:
release
-
Statement type: