Skip to main content

You Know, for local Search.

Project description

Horsebox

A versatile and autonomous command line tool for search.

tests badge pypi badge pre-commit mypy Ask DeepWiki

Table of contents

Abstract

Anybody faced at least once a situation where searching for some information was required, whether it was from a project folder, or any other place that contains information of interest.

Horsebox is a tool whose purpose is to offer such search feature (thanks to the full-text search engine library Tantivy), without any external dependencies, from the command line.

While it was built with a developer persona in mind, it can be used by anybody who is not afraid of typing few characters in a terminal (samples are here to guide you).

Disclaimer: this tool was tested on Linux (Ubuntu, Debian) and MacOS only.

TL;DR

For the ones who want to go straight to the point.

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

# Install Horsebox
uv tool install horsebox

# Alternative: install from the repository
# For the impatient users who want the latest features before they are published on PyPi
uv tool install git+https://github.com/michelcaradec/horsebox

You are ready to search.

Requirements

All the commands described in this project rely on the Python package and project manager uv.

  1. Install uv:

    curl -LsSf https://astral.sh/uv/install.sh | sh
    
  2. Or update it:

    uv self update
    

Tool Installation

For the ones who just want to use the tool.

  1. Install the tool:

    • From PyPi:

      uv tool install horsebox
      
    • From the online Github project:

      uv tool install git+https://github.com/michelcaradec/horsebox
      
  2. Use the tool.

Project Setup

For the ones who want to develop on the project.

Python Environment

  1. Clone the project:

    git clone https://github.com/michelcaradec/horsebox.git
    
    cd horsebox
    
  2. Create a Python virtual environment:

    uv sync
    
    # Install the development requirements
    uv sync --extra dev
    
    # Activate the environment
    source .venv/bin/activate
    
  3. Check the tool execution:

    uv run horsebox
    

    Alternate commands:

    • uv run hb.
    • uv run ./src/horsebox/main.py.
    • python ./src/horsebox/main.py.
  4. The tool can also be installed from the local project with the command:

    uv tool install --editable .
    
  5. Use the tool.

Pre-Commit Setup

  1. Install the git hook scripts:

    pre-commit install
    
  2. Update the hooks to the latest version automatically:

    pre-commit autoupdate
    

Pre-Commit Tips

  • Manually run against all the files:

    pre-commit run --all-files --show-diff-on-failure
    
  • Bypass pre-commit when committing:

    git commit --no-verify
    
  • Un-install the git hook scripts:

    pre-commit uninstall
    

Usage

Naming Conventions

The following terms are used:

  • Datasource: the place where the information will be collected from. It can be a folder, a web page, an RSS feed, etc.
  • Container: the "box" containing the information. It can be a file, a web page, an RSS article, etc.
  • Content: the information contained in a container. It is mostly text, but can also be a date of last update for a file.
  • Collector: a working unit in charge of gathering information to be converted in searchable one.

Getting Help

To list the available commands:

hb --help

To get help for a given command (here search):

hb search --help

Rendering

For any command, the option --format specifies the output format:

  • txt: text mode (default).
  • json: JSON. The shortcut option --json can also be used.

Searching

The query string syntax, specified with the option --query, is the one supported by the Tantivy's query parser.

Example: search in text files (with extension .txt) under the folder demo.

hb search --from ./demo/ --pattern "*.txt" --query "better" --highlight

Options used:

  • --from: folder to (recursively) index.
  • --pattern: files to index.

[!IMPORTANT] The pattern must be enclosed in quotes to prevent wildcard expansion.

  • --query: search query.
  • --highlight: shows the places where the result was found in the content of the files.

One result is returned, as there is only one document (i.e. container) in the index.

A different collector can be used to index line by line:

hb search --from ./demo/ --pattern "*.txt" --using fileline --query "better" --highlight --limit 5

Options used:

  • --using: collector to use for indexing.
  • --limit: returns a maximum number of results (default is 10).

The option --count can be added to show the total number of results found:

hb search --from ./demo/ --pattern "*.txt" --using fileline --query "better" --count

See the section samples for advanced usage.

Building An Index

Example: build an index .index-demo from the text files (with extension .txt) under the folder demo.

hb build --from ./demo/ --pattern "*.txt" --index ./.index-demo

Options used:

  • --from: folder to (recursively) index.
  • --pattern: files to index.

[!IMPORTANT] The pattern must be enclosed in quotes to prevent wildcard expansion.

  • --index: location where to persist the index.

By default, the collector filecontent is used.
An alternate collector can be specified with the option --using.
The option --dry-run can be used to show the items to be index, without creating the index.

The built index can be searched:

hb search --index ./.index-demo --query "better" --highlight

Searching on a persisted index will trigger a warning if the age of the index (i.e. the time elapsed since it was built) goes over a given threshold (which can be configured).
The index can be refreshed to contain the most up-to-date data.

Refreshing An Index

A built index can be refreshed to contain the most up-to-date data.

Example: refresh the index .index-demo previously built.

hb refresh --index ./.index-demo

There are cases where an index can't be refreshed:

  • The index was built with a version prior to 0.4.0.
  • The index data source was provided by pipe (see the section Collectors Usage Matrix).

Inspecting An Index

To get technical information on an existing index:

hb inspect --index ./.index-demo

To get the most frequent keywords (option --top):

hb search --index ./.index-demo --top

Analyzing Some Text

[!NOTE] The version 0.7.0 introduced a new option --analyzer, which replaces the legacy ones (--tokenizer, --tokenizer-params, --filter and --filter-params). Even-though the use of this new option is strongly recommended, the legacies are still available with the command analyze.

The command analyze is used to play with the tokenizers and filters supported by Tantivy to index documents.

To tokenize a text:

hb analyze \
    --text "Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust." \
    --tokenizer whitespace

To filter a text:

hb analyze \
    --text "Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust." \
    --filter lowercase

Multiple examples can be found in the script usage.sh.

Concepts

Horsebox has been thought around few concepts:

Understanding them will help in choosing the right usage strategy.

Collectors

A collector is in charge of gathering information from a given datasource, and returning documents to index.
It acts as a level of abstraction, which returns documents to be ingested.

Horsebox supports different types of collectors:

Collector Description
filename One document per file, containing the name of the file only.
filecontent One document per file, with the content of the file (default).
fileline One document per line and per file.
rss RSS feed, one document per article.
html Collect the content of an HTML page.
raw Collect ready to index JSON documents.
pdf Collect the content of a PDF document.
guess Used to identify the best collector to use.

The collector to use is specified with the option --using.
The default collector is filecontent.

See the script usage.sh for sample commands.

Raw Collector

The collector raw can be used to collect ready to index JSON documents.

Each document must have the following fields [^4]:

  • name (text): name of the container.
  • type (text): type of the container.
  • content (text): content of the container.
  • path (text): full path to the content.
  • size (integer): size of the content.
  • date (text): date-time of the content (formatted as YYYY-mm-dd H:M:S, for example 2025-03-14 12:34:56).

The JSON file can contain either an array of JSON objects (default), or one JSON object per line (JSON Lines format).
The JSON Lines format is automatically detected from the file extension (.jsonl or ndjson).
The option --jsonl can be used to force the detection (this is for example required when the data source is provided by pipe).

Some examples can be found with the files raw.json (array of objects) and raw.jsonl (JSON Lines).

[^4]: Run the command hb schema for a full description.

Guess Collector

Disclaimer: starting with version 0.5.0.

The collector guess can be used to identify the best collector to use.
The detection is done in a best effort from the options --from and --pattern.
An error will be returned if no collector could be guessed.

The collector guess is used by default, meaning that the option --using can be skipped.

Examples:

hb search --from "https://planetpython.org/rss20.xml" --query "some text" --using rss
# Can be simplified as (guess from the https scheme and the extension .xml)
hb search --from "https://planetpython.org/rss20.xml" --query "some text"
hb search --from ./raw.json --query "some text" --using raw
# Can be simplified as (guess from the file extension .json)
hb search --from ./raw.json --query "some text"
hb search --from ./raw.jsonl --query "some text" --using raw --jsonl
# Can be simplified as (guess from the file extension .jsonl)
hb search --from ./raw.jsonl --query "some text"

This feature is mainly for command line usage, to help reduce the number of keystrokes.
When used in a script, it is advised to explicitly set the required collector with the option --using.

Collectors Usage Matrix

The following table shows the options supported by each collector.

Collector Multi-Sources Mode Single Source Mode Pipe Support
filename --from $folder --pattern *.xxx - -
filecontent --from $folder --pattern *.xxx - --from - --using filecontent
fileline --from $folder --pattern *.xxx - --from - --using fileline
rss - --from $feed -
html - --from $page -
raw - --from $json --from - --using raw
pdf --from $folder --pattern *.pdf --from $file.pdf -

-: not supported.

These options are also used by the guess collector in its detection.

Collectors Simplified Patterns

Disclaimer: starting with version 0.8.0.

The file system collectors use the combined options --from and --pattern to specify the folder to (recursively) scan, and the files to index.

For example, the options --from ./demo and --from ./demo/ --pattern "*.txt" will index the files with the extension .txt located under the folder ./demo.

While this syntax makes a clear separation between the datasource and the containers, it can be long to type, especially for standard patterns.

The list of arguments can be simplified by combining both options.

Examples:

  • --from ./demo --from ./demo/ --pattern "*.txt" can be passed as --from "./demo/*.txt".
  • --from . --pattern "*.pdf" can be passed as --from "*.pdf".

[!IMPORTANT] The pattern must be enclosed in quotes to prevent wildcard expansion.

This new syntax still allows the use of the option --pattern (for example, --from "*.pdf" --pattern "*.pdf" will index all the files with the extension .txt or .pdffrom the current folder).

Index

The index is the place where the collected information lies. It is required to allow the search.

An index is built with the help of Tantivy (a full-text search engine library), and can be either stored in memory or persisted on disk (see the section strategies).

Strategies

Horsebox can be used in different ways to achieve to goal of searching (and hopefully finding) some information.

  • One-step search:
    Index and search, with no index retention.
    This fits an unstable source of information, with frequent changes.

    hb search --from ./demo/ --pattern "*.txt" --query "better" --highlight
    
  • Two-steps search:
    Build and persist an index, then search in the existing index.
    This fits a stable and voluminous (i.e. long to index) source of information.

    Build the index once:

    hb build --from ./demo/ --pattern "*.txt" --index ./.index-demo
    

    Then search it (multiple times):

    hb search --index ./.index-demo --query "better" --highlight
    
  • All-in-one search:
    Like a two-steps search, but in one step.
    For the ones who want to do everything in a single command.

    hb search --from ./demo/ --pattern "*.txt" --index ./.index-demo --query "better" --highlight
    

    The use of the options --from and --index with the command search will build and persist an index, which will be immediately searched, and will also be available for future searches.

Annexes

Project Bootstrap

The project was created with the command:

# Will create a directory `horsebox`
uv init --app --package --python 3.10 horsebox

Unit Tests

The Python module doctest has been used to write some unit tests:

python -m doctest -v ./src/**/*.py

Manual Testing In Docker

Horsebox can be installed in a fresh environment to demonstrate its straight-forward setup:

# From the project
docker run --interactive --tty --name horsebox --volume=$(pwd):/home/project --rm debian:stable /bin/bash
# Alternative: Docker image with OhMyZsh (for colors)
docker run --interactive --tty --name horsebox --volume=$(pwd):/home/project --rm ohmyzsh/ohmyzsh:main

# Install few dependencies
source /home/project/demo/docker-setup.sh

# Install Horsebox
uv tool install .

Samples

The script usage.sh contains multiple sample commands:

bash ./demo/usage.sh

Advanced Searches

The query string syntax conforms to Tantivy's query parser.

  • Search on multiple datasources:
    Multiple datasources can be collected to build/search an index by repeating the option --from.

    hb search \
        --from "https://www.blog.pythonlibrary.org/feed/" \
        --from "https://planetpython.org/rss20.xml" \
        --from "https://realpython.com/atom.xml?format=xml" \
        --using rss --query "duckdb" --highlight
    

    Source: Top 60 Python RSS Feeds.

  • Search on date:
    A date must be formatted using the RFC3339 standard.
    Example: 2025-01-01T10:00:00.00Z.

    The field date must be specified, and the date must be enclosed in single quotes:

    hb search --from ./demo/raw.json --using raw --query "date:'2025-01-01T10:00:00.00Z'"
    
  • Search on range of dates:
    Inclusive boundaries are specified with square brackets ([ ]):

    hb search --from ./demo/raw.json --using raw --query "date:[2025-01-01T10:00:00.00Z TO 2025-01-04T10:00:00.00Z]"
    

    Exclusive boundaries are specified with curly brackets ({ }):

    hb search --from ./demo/raw.json --using raw --query "date:{2025-01-01T10:00:00.00Z TO 2025-01-04T10:00:00.00Z}"
    

    Inclusive and exclusive boundaries can be mixed:

    hb search --from ./demo/raw.json --using raw --query "date:[2025-01-01T10:00:00.00Z TO 2025-01-04T10:00:00.00Z}"
    
  • Fuzzy search:
    The fuzzy search is not supported by Tantivy query parser [^6].
    Horsebox comes with a simple implementation, which supports the expression of a fuzzy search on a single word.
    Example: the search engne~ will find the word "engine", as it differs by 1 change according to the Levenshtein distance measure.

    The distance can be set after the marker ~, with a maximum of 2: engne~1, engne~2.

    hb search --from ./demo/raw.json --using raw --query "engne~1"
    

[!IMPORTANT] The highlight (option --highlight) will not work [^5].

  • Proximity search:
    The two words to search are enclosed in single quotes, followed by the maximum distance.

    hb search --from ./demo/raw.json --using raw --query "'engine inspired'~1" --highlight
    

    Will find all documents where the words "engine" and "inspired" are separated by a maximum of 1 word.

  • Query explanation:
    The result of a query can be explained with the help of the option --explain.

    hb search --from "./demo/*.txt" --using fileline --query "better" --explain --json --limit 2
    

    For each document found, a field explain will be returned, with details on why it was selected [^11].

  • Sort the result:
    The result of a query can be ordered by a single field with the help of the option --sort.

    # Ascending order
    hb search --from "./demo/size/*.txt" --query "file" --sort "+size"
    # Descending order
    hb search --from "./demo/size/*.txt" --query "file" --sort "-size"
    hb search --from "./demo/size/*.txt" --query "file" --sort "size"
    

    The field prefix + is used for ascending order, - for descending order (set by default if missing).

[!IMPORTANT] This option was introduced with the version 0.10.0. It requires an existing index to be refreshed to make it work.
Only the fields name, type, content, size and date can be used.

[^5]: See https://github.com/quickwit-oss/tantivy/issues/2576.
[^6]: Even though Tantivy implements it with FuzzyTermQuery.
[^11]: See https://docs.rs/tantivy/latest/tantivy/query/struct.Explanation.html.

Using A Custom Analyzer

Disclaimer: starting with version 0.7.0.

By default, the content of a container is indexed in the field content using the default text analyzer, which splits the text on every white space and punctuation [^8], removes words (a.k.a tokens) that are longer than 40 characters [^9], and lowercases the text [^10].

While this text analyzer fits most of the cases, it may not be suitable for more specific content such as code.

The option --analyzer can be used with the commands build and search to apply a custom tokenizer and filters to the content to be indexed.
The definition of the custom analyzer is described in a JSON file.
The analyzed content will be indexed to an extra field custom.

To build an index .index-analyzer with a custom analyzer analyzer-python.json:

hb build \
    --index .index-analyzer \
    --from ./demo --pattern "*.py" \
    --using fileline \
    --analyzer ./demo/analyzer-python.json

A full set of examples can be found in the script usage.sh.

Custom Analyzer Definition

The custom analyzer definition is described in a JSON file.

It is composed of two parts:

  • tokenizer: the tokenizer to use to split the content. There must be one and only one tokenizer.
  • filters: the filters to use to transform and select the tokenized content. There can be zero or more filters.
{
    "tokenizer": {
        "$tokenize_type": {...}
    },
    "filters": [
        {
            "$filter_type": {...}
        },
        {
            "$filter_type": {...}
        }
    ]
}

Each object $tokenize_type and $filter_type may contain extra configuration fields.

The file analyzer-schema.json is a JSON Schema which can be used to validate any custom analyzer definition.
The site JSON Editor Online proposes a playground to test it from your browser.
The Python library jsonschema proposes an implementation of JSON Schema validation.

Custom Analyzer Limitations

  • When a custom analyzer is defined, the highlight is done of the field custom.
  • The tokenizer regex uses the pattern syntax supported by the Regex implementation.
  • The option --top is not applied on the field custom, due to the fast mode required for aggregation, but not compatible with the tokenizer regex.

[^8]: Using the tokenizer simple.
[^9]: Using the filter remove_long.
[^10]: Using the filter lowercase.

Configuration

Horsebox can be configured through environment variables:

Setting Description Default Value
HB_INDEX_BATCH_SIZE Batch size when indexing. 1000
HB_HIGHLIGHT_MAX_CHARS Maximum number of characters to show for highlights. 200
HB_PARSER_MAX_LINE Maximum size of a line in a container (unlimited if null).
HB_PARSER_MAX_CONTENT Maximum size of a container (unlimited if null).
HB_RENDER_MAX_CONTENT Maximum size of a document content to render (unlimited if null).
HB_INDEX_EXPIRATION Index freshness threshold (in seconds). 3600
HB_CUSTOM_STOPWORDS Custom list of stop-words (separated by a comma).
HB_STRING_NORMALIZE Normalize strings [^7] when reading files (0=disabled, other value=enabled). 1
HB_TOP_MIN_CHARS Minimum number of characters of a top keyword. 1

To get help on configuration:

hb config

The default and current values are displayed.

[^7]: The normalization of a string consists in replacing the accented characters by their non-accented equivalent, and converting Unicode escaped characters. This is a CPU intensive process, which may not be required for some datasources.

VSCode Integration

If you use Visual Studio Code, you can integrate Horsebox using tasks.

The file tasks.json provides some sample tasks to index and search Markdown files in the current project.

Where Does This Name Come From

I had some requirements to find a name:

  • Short and easy to remember.
  • Preferably a compound one, so it could be shortcut at the command line with the first letters of each part.
  • Connected to Tantivy, whose logo is a rider on a horse.

I then remembered the nickname of a very good friend met during my studies in Cork, Ireland: "Horsebox".

That was it: the name will be "Horsebox", with its easy-to-type shortcut "hb".

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

horsebox-0.10.0.tar.gz (44.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

horsebox-0.10.0-py3-none-any.whl (58.7 kB view details)

Uploaded Python 3

File details

Details for the file horsebox-0.10.0.tar.gz.

File metadata

  • Download URL: horsebox-0.10.0.tar.gz
  • Upload date:
  • Size: 44.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for horsebox-0.10.0.tar.gz
Algorithm Hash digest
SHA256 5ebeaa72fad63b502685802508f34af48b76606cb57f3e80cb2ca6a3e3e97a78
MD5 1ce5b431d2bdd6cec497633c99bf9006
BLAKE2b-256 4561783f5cfd232048b230e024ca00db71d6c6ea23a842ac881e2db9bb714a5b

See more details on using hashes here.

Provenance

The following attestation bundles were made for horsebox-0.10.0.tar.gz:

Publisher: python-publish.yml on michelcaradec/horsebox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file horsebox-0.10.0-py3-none-any.whl.

File metadata

  • Download URL: horsebox-0.10.0-py3-none-any.whl
  • Upload date:
  • Size: 58.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for horsebox-0.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f1d3957af6ed3c4eb631b5df9b69d891df73163c3b17a95cf43f092d9e9ef29
MD5 c34bbd74c3e475b72b796d348eb7dccc
BLAKE2b-256 68cb711a6906d9508c0cb7b36c1b490a4d13c0d525cc277dab0ba795f2fabd01

See more details on using hashes here.

Provenance

The following attestation bundles were made for horsebox-0.10.0-py3-none-any.whl:

Publisher: python-publish.yml on michelcaradec/horsebox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page