Skip to main content

Fast BM25 full-text search with substring matching, fuzzy search, and regex — powered by Rust

Project description

lucivy

BM25 search engine with cross-token fuzzy matching — it finds substrings, handles typos, and matches across word boundaries. Built for code search, technical docs, and as a BM25 complement to vector databases.

Install

Everything is MIT-licensed.

Language Install
Python pip install lucivy
Node.js npm install lucivy
WASM (browser) npm install @lucivy/wasm
Rust cargo add ld-lucivy
C++ Static library via CXX bridge (build from source)

Quick start

import lucivy

index = lucivy.Index.create("./my_index", fields=[
    {"name": "title", "type": "text"},
    {"name": "body", "type": "text"},
])

index.add(1, title="Rust Programming", body="Systems programming with memory safety")
index.add(2, title="Python Guide", body="Data science and web development")
index.commit()

results = index.search("programming", highlights=True)
for r in results:
    print(r.doc_id, r.score, r.highlights)

See the language-specific READMEs for full API docs:

Query types

lucivy queries operate on stored text (cross-token). They handle multi-word phrases, substrings, separators, and special characters naturally.

contains — the workhorse query

Fuzzy substring match with separator awareness.

# Exact substring
index.search({"type": "contains", "field": "body", "value": "programming language"})

# Substring within a token: "program" matches "programming"
index.search({"type": "contains", "field": "body", "value": "program"})

# Fuzzy tolerance (default distance=1, catches typos)
index.search({"type": "contains", "field": "body", "value": "programing languag", "distance": 1})

# Strict exact: distance=0 disables fuzzy
index.search({"type": "contains", "field": "body", "value": "programming", "distance": 0})

contains + regex

Regex on stored text (cross-token).

# Matches "programming language" — the .* spans the space between tokens
index.search({"type": "contains", "field": "body", "value": "program.*language", "regex": True})

# Alternation
index.search({"type": "contains", "field": "body", "value": "python|rust", "regex": True})

contains_split

Splits query into words, each word is a contains, combined with OR.

# String query (auto contains_split across all text fields)
index.search("rust async programming")

# Explicit dict query on a specific field
index.search({"type": "contains_split", "field": "body", "value": "memory safety"})

boolean

Combine sub-queries with must (AND), should (OR), must_not (NOT).

index.search({
    "type": "boolean",
    "must": [
        {"type": "contains", "field": "body", "value": "rust"},
        {"type": "contains", "field": "body", "value": "programming"},
    ],
    "must_not": [{"type": "contains", "field": "body", "value": "javascript"}],
})

Filters on non-text fields

Non-text fields (i64, f64, u64, keyword) can be filtered via the filters key.

index.search({
    "type": "contains",
    "field": "body",
    "value": "programming",
    "filters": [
        {"field": "year", "op": "gte", "value": 2023},
    ],
})
# Supported ops: eq, ne, lt, lte, gt, gte, in, not_in, between, starts_with, contains

Highlights

All query types support byte-offset highlights. Internal fields (._raw, ._ngram) are automatically filtered out.

results = index.search("rust programming", highlights=True)
for r in results:
    if r.highlights:
        for field, offsets in r.highlights.items():
            print(f"  {field}: {offsets}")  # e.g. "body": [(5, 9), (20, 31)]

Snapshots (export / import)

Export an index to a portable .luce binary blob, import it elsewhere.

index.export_snapshot_to("./backup.luce")
restored = lucivy.Index.import_snapshot_from("./backup.luce", dest_path="./restored")

What contains matches

Fuzzy mode (default):

Query Document Match? Why
programming "Rust programming is fun" yes exact token match
programing (typo) "Rust programming is fun" yes fuzzy distance=1
program "Rust programming is fun" yes substring of token
programming language "...programming language used..." yes cross-token with separator
c++ "c++ and c# are popular" yes separator-aware
std::collections "use std::collections::HashMap" yes multi-token + :: separator

Regex mode (regex: true):

Pattern Document Match? Why
program.*language "...programming language used..." yes cross-token regex on stored text
python|rust "Python is versatile" yes alternation
v[0-9]+ "version v2.0 released" yes full-scan fallback (literal < 3 chars)

Internals

Triple-field layout

Every text field automatically gets 3 sub-fields:

Sub-field Tokenizer Used by
{name} stemmed or lowercase phrase, parse queries (recall)
{name}._raw lowercase only contains verification (precision)
{name}._ngram character trigrams contains candidate generation

This is transparent to the user — you always reference the base field name.

NgramContainsQuery — how contains works

  1. Candidate collection — depends on mode:
    • Fuzzy: term dictionary lookup on ._raw (O(1) via FST), falling back to trigram intersection on ._ngram if the exact term isn't found
    • Regex: trigram union on ._ngram from extracted regex literals
    • Short literals: full segment scan when literals < 3 chars
  2. Verification — read stored text, dispatch to fuzzy or regex verifier
  3. BM25 scoring — standard idf * (1 + k1) * tf / (tf + k1 * (1 - b + b * dl / avgdl))

Building from source

# Rust library tests
cargo test --lib

# Python bindings
cd bindings/python
maturin develop --release
pytest tests/ -v

# Node.js bindings
cd bindings/nodejs && npm run build
node test.mjs

# C++ bindings
cargo build -p lucivy-cpp --release

Lineage

Fork of tantivy v0.26.0 (via izihawa/tantivy).

quickwit-oss/tantivy v0.22
  -> izihawa/tantivy v0.26.0 (regex phrase queries, FST improvements)
    -> L-Defraiteur/lucivy (NgramContainsQuery, contains_split, fuzzy/regex/hybrid modes, HighlightSink, Python/Node.js/C++/WASM bindings)

License

MIT. See LICENSE.

Fork of tantivy v0.26.0, also MIT (see NOTICE).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lucivy-0.2.0.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lucivy-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

File details

Details for the file lucivy-0.2.0.tar.gz.

File metadata

  • Download URL: lucivy-0.2.0.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for lucivy-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ac963544b313bae229a3a4f88c6a57720ef9fdf35825ef2911ed2f12db74de67
MD5 43330d48b2ee45838448646cf9389201
BLAKE2b-256 9f9ccdfed31fac3eddff1b2db35ded6c62dbecf2ff4f6603867a8a80969959eb

See more details on using hashes here.

File details

Details for the file lucivy-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for lucivy-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 1892d1982fedacfac5aef17dab89dcdf1d5170465c57836f6d53bf1951dae1e9
MD5 1333d0e473747378d32f53427c29a155
BLAKE2b-256 c28435438eccc2b873bfc1c9eb99d90eb5e84d954702124789a01362da1ee057

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page