Fast BM25 full-text search with substring matching, fuzzy search, and regex — powered by Rust
Project description
lucivy
BM25 search engine with cross-token fuzzy matching — it finds substrings, handles typos, and matches across word boundaries. Built for code search, technical docs, and as a BM25 complement to vector databases.
Install
Everything is MIT-licensed.
| Language | Install |
|---|---|
| Python | pip install lucivy |
| Node.js | npm install lucivy |
| WASM (browser) | npm install @lucivy/wasm |
| Rust | cargo add ld-lucivy |
| C++ | Static library via CXX bridge (build from source) |
Quick start
import lucivy
index = lucivy.Index.create("./my_index", fields=[
{"name": "title", "type": "text"},
{"name": "body", "type": "text"},
])
index.add(1, title="Rust Programming", body="Systems programming with memory safety")
index.add(2, title="Python Guide", body="Data science and web development")
index.commit()
results = index.search("programming", highlights=True)
for r in results:
print(r.doc_id, r.score, r.highlights)
See the language-specific READMEs for full API docs:
Query types
lucivy queries operate on stored text (cross-token). They handle multi-word phrases, substrings, separators, and special characters naturally.
contains — the workhorse query
Fuzzy substring match with separator awareness.
# Exact substring
index.search({"type": "contains", "field": "body", "value": "programming language"})
# Substring within a token: "program" matches "programming"
index.search({"type": "contains", "field": "body", "value": "program"})
# Fuzzy tolerance (default distance=1, catches typos)
index.search({"type": "contains", "field": "body", "value": "programing languag", "distance": 1})
# Strict exact: distance=0 disables fuzzy
index.search({"type": "contains", "field": "body", "value": "programming", "distance": 0})
contains + regex
Regex on stored text (cross-token).
# Matches "programming language" — the .* spans the space between tokens
index.search({"type": "contains", "field": "body", "value": "program.*language", "regex": True})
# Alternation
index.search({"type": "contains", "field": "body", "value": "python|rust", "regex": True})
contains_split
Splits query into words, each word is a contains, combined with OR.
# String query (auto contains_split across all text fields)
index.search("rust async programming")
# Explicit dict query on a specific field
index.search({"type": "contains_split", "field": "body", "value": "memory safety"})
boolean
Combine sub-queries with must (AND), should (OR), must_not (NOT).
index.search({
"type": "boolean",
"must": [
{"type": "contains", "field": "body", "value": "rust"},
{"type": "contains", "field": "body", "value": "programming"},
],
"must_not": [{"type": "contains", "field": "body", "value": "javascript"}],
})
Filters on non-text fields
Non-text fields (i64, f64, u64, keyword) can be filtered via the filters key.
index.search({
"type": "contains",
"field": "body",
"value": "programming",
"filters": [
{"field": "year", "op": "gte", "value": 2023},
],
})
# Supported ops: eq, ne, lt, lte, gt, gte, in, not_in, between, starts_with, contains
Highlights
All query types support byte-offset highlights. Internal fields (._raw, ._ngram) are automatically filtered out.
results = index.search("rust programming", highlights=True)
for r in results:
if r.highlights:
for field, offsets in r.highlights.items():
print(f" {field}: {offsets}") # e.g. "body": [(5, 9), (20, 31)]
Snapshots (export / import)
Export an index to a portable .luce binary blob, import it elsewhere.
index.export_snapshot_to("./backup.luce")
restored = lucivy.Index.import_snapshot_from("./backup.luce", dest_path="./restored")
What contains matches
Fuzzy mode (default):
| Query | Document | Match? | Why |
|---|---|---|---|
programming |
"Rust programming is fun" |
yes | exact token match |
programing (typo) |
"Rust programming is fun" |
yes | fuzzy distance=1 |
program |
"Rust programming is fun" |
yes | substring of token |
programming language |
"...programming language used..." |
yes | cross-token with separator |
c++ |
"c++ and c# are popular" |
yes | separator-aware |
std::collections |
"use std::collections::HashMap" |
yes | multi-token + :: separator |
Regex mode (regex: true):
| Pattern | Document | Match? | Why |
|---|---|---|---|
program.*language |
"...programming language used..." |
yes | cross-token regex on stored text |
python|rust |
"Python is versatile" |
yes | alternation |
v[0-9]+ |
"version v2.0 released" |
yes | full-scan fallback (literal < 3 chars) |
Internals
Triple-field layout
Every text field automatically gets 3 sub-fields:
| Sub-field | Tokenizer | Used by |
|---|---|---|
{name} |
stemmed or lowercase | phrase, parse queries (recall) |
{name}._raw |
lowercase only | contains verification (precision) |
{name}._ngram |
character trigrams | contains candidate generation |
This is transparent to the user — you always reference the base field name.
NgramContainsQuery — how contains works
- Candidate collection — depends on mode:
- Fuzzy: term dictionary lookup on
._raw(O(1) via FST), falling back to trigram intersection on._ngramif the exact term isn't found - Regex: trigram union on
._ngramfrom extracted regex literals - Short literals: full segment scan when literals < 3 chars
- Fuzzy: term dictionary lookup on
- Verification — read stored text, dispatch to fuzzy or regex verifier
- BM25 scoring — standard
idf * (1 + k1) * tf / (tf + k1 * (1 - b + b * dl / avgdl))
Building from source
# Rust library tests
cargo test --lib
# Python bindings
cd bindings/python
maturin develop --release
pytest tests/ -v
# Node.js bindings
cd bindings/nodejs && npm run build
node test.mjs
# C++ bindings
cargo build -p lucivy-cpp --release
Lineage
Fork of tantivy v0.26.0 (via izihawa/tantivy).
quickwit-oss/tantivy v0.22
-> izihawa/tantivy v0.26.0 (regex phrase queries, FST improvements)
-> L-Defraiteur/lucivy (NgramContainsQuery, contains_split, fuzzy/regex/hybrid modes, HighlightSink, Python/Node.js/C++/WASM bindings)
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lucivy-0.2.0.tar.gz.
File metadata
- Download URL: lucivy-0.2.0.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac963544b313bae229a3a4f88c6a57720ef9fdf35825ef2911ed2f12db74de67
|
|
| MD5 |
43330d48b2ee45838448646cf9389201
|
|
| BLAKE2b-256 |
9f9ccdfed31fac3eddff1b2db35ded6c62dbecf2ff4f6603867a8a80969959eb
|
File details
Details for the file lucivy-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: lucivy-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 2.8 MB
- Tags: CPython 3.13, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1892d1982fedacfac5aef17dab89dcdf1d5170465c57836f6d53bf1951dae1e9
|
|
| MD5 |
1333d0e473747378d32f53427c29a155
|
|
| BLAKE2b-256 |
c28435438eccc2b873bfc1c9eb99d90eb5e84d954702124789a01362da1ee057
|