Fast BM25 full-text search with substring matching, fuzzy search, and regex — powered by Rust

These details have not been verified by PyPI

Project links

Project description

lucivy

BM25 search engine with cross-token fuzzy matching — it finds substrings, handles typos, and matches across word boundaries. Built for code search, technical docs, and as a BM25 complement to vector databases.

Install

Everything is MIT-licensed.

Language	Install
Python	`pip install lucivy`
Node.js	`npm install lucivy`
WASM (browser)	`npm install @lucivy/wasm`
Rust	`cargo add ld-lucivy`
C++	Static library via CXX bridge (build from source)

Quick start

import lucivy

index = lucivy.Index.create("./my_index", fields=[
    {"name": "title", "type": "text"},
    {"name": "body", "type": "text"},
])

index.add(1, title="Rust Programming", body="Systems programming with memory safety")
index.add(2, title="Python Guide", body="Data science and web development")
index.commit()

results = index.search("programming", highlights=True)
for r in results:
    print(r.doc_id, r.score, r.highlights)

See the language-specific READMEs for full API docs:

Query types

lucivy queries operate on stored text (cross-token). They handle multi-word phrases, substrings, separators, and special characters naturally.

`contains` — the workhorse query

Fuzzy substring match with separator awareness.

# Exact substring
index.search({"type": "contains", "field": "body", "value": "programming language"})

# Substring within a token: "program" matches "programming"
index.search({"type": "contains", "field": "body", "value": "program"})

# Fuzzy tolerance (default distance=1, catches typos)
index.search({"type": "contains", "field": "body", "value": "programing languag", "distance": 1})

# Strict exact: distance=0 disables fuzzy
index.search({"type": "contains", "field": "body", "value": "programming", "distance": 0})

`contains` + `regex`

Regex on stored text (cross-token).

# Matches "programming language" — the .* spans the space between tokens
index.search({"type": "contains", "field": "body", "value": "program.*language", "regex": True})

# Alternation
index.search({"type": "contains", "field": "body", "value": "python|rust", "regex": True})

`contains_split`

Splits query into words, each word is a contains, combined with OR.

# String query (auto contains_split across all text fields)
index.search("rust async programming")

# Explicit dict query on a specific field
index.search({"type": "contains_split", "field": "body", "value": "memory safety"})

`boolean`

Combine sub-queries with must (AND), should (OR), must_not (NOT).

index.search({
    "type": "boolean",
    "must": [
        {"type": "contains", "field": "body", "value": "rust"},
        {"type": "contains", "field": "body", "value": "programming"},
    ],
    "must_not": [{"type": "contains", "field": "body", "value": "javascript"}],
})

Filters on non-text fields

Non-text fields (i64, f64, u64, keyword) can be filtered via the filters key.

index.search({
    "type": "contains",
    "field": "body",
    "value": "programming",
    "filters": [
        {"field": "year", "op": "gte", "value": 2023},
    ],
})
# Supported ops: eq, ne, lt, lte, gt, gte, in, not_in, between, starts_with, contains

Highlights

All query types support byte-offset highlights. Internal fields (._raw, ._ngram) are automatically filtered out.

results = index.search("rust programming", highlights=True)
for r in results:
    if r.highlights:
        for field, offsets in r.highlights.items():
            print(f"  {field}: {offsets}")  # e.g. "body": [(5, 9), (20, 31)]

Snapshots (export / import)

Export an index to a portable .luce binary blob, import it elsewhere.

index.export_snapshot_to("./backup.luce")
restored = lucivy.Index.import_snapshot_from("./backup.luce", dest_path="./restored")

What `contains` matches

Fuzzy mode (default):

Query	Document	Match?	Why
`programming`	`"Rust programming is fun"`	yes	exact token match
`programing` (typo)	`"Rust programming is fun"`	yes	fuzzy distance=1
`program`	`"Rust programming is fun"`	yes	substring of token
`programming language`	`"...programming language used..."`	yes	cross-token with separator
`c++`	`"c++ and c# are popular"`	yes	separator-aware
`std::collections`	`"use std::collections::HashMap"`	yes	multi-token + `::` separator

Regex mode (regex: true):

Pattern	Document	Match?	Why
`program.*language`	`"...programming language used..."`	yes	cross-token regex on stored text
`python\|rust`	`"Python is versatile"`	yes	alternation
`v[0-9]+`	`"version v2.0 released"`	yes	full-scan fallback (literal < 3 chars)

Internals

Triple-field layout

Every text field automatically gets 3 sub-fields:

Sub-field	Tokenizer	Used by
`{name}`	stemmed or lowercase	`phrase`, `parse` queries (recall)
`{name}._raw`	lowercase only	`contains` verification (precision)
`{name}._ngram`	character trigrams	`contains` candidate generation

This is transparent to the user — you always reference the base field name.

NgramContainsQuery — how `contains` works

Candidate collection — depends on mode:
- Fuzzy: term dictionary lookup on ._raw (O(1) via FST), falling back to trigram intersection on ._ngram if the exact term isn't found
- Regex: trigram union on ._ngram from extracted regex literals
- Short literals: full segment scan when literals < 3 chars
Verification — read stored text, dispatch to fuzzy or regex verifier
BM25 scoring — standard idf * (1 + k1) * tf / (tf + k1 * (1 - b + b * dl / avgdl))

Building from source

# Rust library tests
cargo test --lib

# Python bindings
cd bindings/python
maturin develop --release
pytest tests/ -v

# Node.js bindings
cd bindings/nodejs && npm run build
node test.mjs

# C++ bindings
cargo build -p lucivy-cpp --release

Lineage

Fork of tantivy v0.26.0 (via izihawa/tantivy).

quickwit-oss/tantivy v0.22
  -> izihawa/tantivy v0.26.0 (regex phrase queries, FST improvements)
    -> L-Defraiteur/lucivy (NgramContainsQuery, contains_split, fuzzy/regex/hybrid modes, HighlightSink, Python/Node.js/C++/WASM bindings)

License

MIT. See LICENSE.

Fork of tantivy v0.26.0, also MIT (see NOTICE).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.2

Mar 9, 2026

0.3.1

Mar 8, 2026

0.3.0

Mar 8, 2026

0.2.2

Mar 8, 2026

0.2.1

Mar 8, 2026

This version

0.2.0

Mar 8, 2026

0.1.2

Mar 8, 2026

0.1.1

Mar 8, 2026

0.1.0

Mar 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lucivy-0.2.0.tar.gz (1.3 MB view details)

Uploaded Mar 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lucivy-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl (2.8 MB view details)

Uploaded Mar 8, 2026 CPython 3.13manylinux: glibc 2.34+ x86-64

File details

Details for the file lucivy-0.2.0.tar.gz.

File metadata

Download URL: lucivy-0.2.0.tar.gz
Upload date: Mar 8, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for lucivy-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ac963544b313bae229a3a4f88c6a57720ef9fdf35825ef2911ed2f12db74de67`
MD5	`43330d48b2ee45838448646cf9389201`
BLAKE2b-256	`9f9ccdfed31fac3eddff1b2db35ded6c62dbecf2ff4f6603867a8a80969959eb`

See more details on using hashes here.

File details

Details for the file lucivy-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

Download URL: lucivy-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl
Upload date: Mar 8, 2026
Size: 2.8 MB
Tags: CPython 3.13, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for lucivy-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`1892d1982fedacfac5aef17dab89dcdf1d5170465c57836f6d53bf1951dae1e9`
MD5	`1333d0e473747378d32f53427c29a155`
BLAKE2b-256	`c28435438eccc2b873bfc1c9eb99d90eb5e84d954702124789a01362da1ee057`

See more details on using hashes here.

lucivy 0.2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

lucivy

Install

Quick start

Query types

contains — the workhorse query

contains + regex

contains_split

boolean

Filters on non-text fields

Highlights

Snapshots (export / import)

What contains matches

Internals

Triple-field layout

NgramContainsQuery — how contains works

Building from source

Lineage

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`contains` — the workhorse query

`contains` + `regex`

`contains_split`

`boolean`

What `contains` matches

NgramContainsQuery — how `contains` works