Skip to main content

High-performance full-text search engine written in Rust

Project description

NanoFTS

A high-performance full-text search engine with Rust core, featuring efficient indexing and searching capabilities for both English and Chinese text.

Features

  • High Performance: Rust-powered core with sub-millisecond search latency
  • LSM-Tree Architecture: Scalable to billions of documents
  • Incremental Updates: Real-time document add/update/delete
  • Fuzzy Search: Intelligent fuzzy matching with configurable thresholds
  • Full CRUD: Complete document management operations
  • Result Handle: Zero-copy result with set operations (AND/OR/NOT)
  • NumPy Support: Direct numpy array output
  • Multilingual: Support for both English and Chinese text
  • Persistence: Disk-based storage with WAL recovery
  • Async Flush: Non-blocking flush_async() / wait_flush() for instant post-index searchability
  • LRU Cache: Built-in caching for frequently accessed terms
  • Data Import: Import from pandas, polars, arrow, parquet, CSV, JSON

Installation

pip install nanofts

Quick Start

from nanofts import create_engine

# Create a search engine
engine = create_engine(
    index_file="./index.nfts",
    track_doc_terms=True,  # Enable update/delete operations
)

# Add documents (field values must be strings)
engine.add_document(1, {"title": "Python Tutorial", "content": "Learn Python programming"})
engine.add_document(2, {"title": "Data Analysis", "content": "Process data with pandas"})
engine.flush()

# Search - returns ResultHandle object
result = engine.search("Python")
print(f"Found {result.total_hits} documents")
print(f"Document IDs: {result.to_list()}")

# Update document
engine.update_document(1, {"title": "Advanced Python Tutorial", "content": "Deep dive into Python"})

# Delete document
engine.remove_document(2)

# Compact to persist deletions
engine.compact()

Rust Usage (Rust Core)

The Rust crate name is nanofts (minimum Rust version: rustc >= 1.75). If you are building a Rust service, you can use it directly as a pure Rust full-text search library.

Add as a dependency

Add this to your project Cargo.toml:

[dependencies]
nanofts = "0.7.0"

Optional features:

  • mimalloc: enabled by default; lower latency / more stable allocation performance
  • python: enable PyO3/Numpy bindings (only needed if you build the Python extension)
  • simd: enable SIMD acceleration (requires nightly and packed_simd_2)

Minimal example: in-memory indexing and searching

use nanofts::{UnifiedEngine, EngineConfig, EngineResult};
use std::collections::HashMap;

fn main() -> EngineResult<()> {
    // 1) Create an in-memory engine
    let engine = UnifiedEngine::new(EngineConfig::memory_only())?;

    // 2) Add a document (field values must be String)
    let mut fields = HashMap::new();
    fields.insert("title".to_string(), "Rust Tutorial".to_string());
    fields.insert("content".to_string(), "Build a high-performance full-text search engine in Rust".to_string());
    engine.add_document(1, fields)?;

    // 3) Search
    let result = engine.search("Rust")?;
    println!("hits={}, ids={:?}", result.total_hits(), result.to_list());
    Ok(())
}

Persistence: single-file index + WAL recovery

use nanofts::{UnifiedEngine, EngineConfig, EngineResult};

fn main() -> EngineResult<()> {
    let config = EngineConfig::persistent("./index.nfts")
        .with_lazy_load(true)
        .with_cache_size(10_000);
    let engine = UnifiedEngine::new(config)?;

    // ... add/update/remove ...

    // Option A — synchronous flush (blocks until fsync completes)
    engine.flush()?;

    // Option B — asynchronous flush (returns immediately, data is searchable at once)
    engine.flush_async()?;
    let hits = engine.search("keyword")?; // ✅ correct results before wait
    engine.wait_flush()?;                 // block until durably persisted

    // Deletions become permanent only after compaction
    engine.compact()?;
    Ok(())
}

Run the built-in Rust example in this repo

cargo run --example basic_usage --release

Performance Tuning (Rust Developer Perspective)

Build and runtime knobs

  • Use release builds: cargo build --release / cargo run --release (this repo already configures lto=fat, codegen-units=1, panic=abort, strip=true for release).
  • Optimize for your CPU (optional): set RUSTFLAGS="-C target-cpu=native" when building/running on a specific machine.
  • SIMD (optional): if you enable --features simd, use nightly and validate the benefit for your workload.

Fastest ingestion formats and APIs

  • Prefer batch ingestion: it reduces per-document overhead and lets the engine use its optimized parallel paths.
  • Fastest Rust API: UnifiedEngine::add_documents_texts(doc_ids, texts) is the fastest ingestion path when you can pre-concatenate all searchable fields into a single String per document.
  • Columnar ingestion: UnifiedEngine::add_documents_columnar(doc_ids, columns) avoids constructing a HashMap per document and is a good fit for Arrow/DataFrame-style input.
  • Arrow zero-copy ingestion: if your data is already in Arrow (or can be represented as borrowed &str slices), use UnifiedEngine::add_documents_arrow_str(doc_ids, columns) (multi-column) or UnifiedEngine::add_documents_arrow_texts(doc_ids, texts) (single merged text column) to avoid String allocation/copy.
  • Batch HashMap ingestion: UnifiedEngine::add_documents(docs) is still much faster than calling add_document in a loop.

Arrow Zero-Copy API Examples

Multi-column zero-copy ingestion

use nanofts::{UnifiedEngine, EngineConfig};

let engine = UnifiedEngine::new(EngineConfig::memory_only())?;

// Simulate Arrow StringArray data (in real use, extract from Arrow)
let doc_ids = vec![1, 2, 3];
let titles = vec!["Title 1", "Title 2", "Title 3"];
let contents = vec!["Content 1", "Content 2", "Content 3"];

// Zero-copy columnar ingestion
let columns = vec![
    ("title".to_string(), titles),
    ("content".to_string(), contents),
];

engine.add_documents_arrow_str(&doc_ids, columns)?;

Single-column zero-copy ingestion (fastest for Arrow)

// Pre-merged text from Arrow (single column)
let doc_ids = vec![1, 2, 3];
let merged_texts = vec![
    "Title 1 Content 1",
    "Title 2 Content 2", 
    "Title 3 Content 3",
];

// Zero-copy single column ingestion
engine.add_documents_arrow_texts(&doc_ids, &merged_texts)?;

Real Arrow StringArray integration

// Example with real Arrow StringArray
use arrow_array::StringArray;

let title_array = StringArray::from(vec!["Title 1", "Title 2", "Title 3"]);
let content_array = StringArray::from(vec!["Content 1", "Content 2", "Content 3"]);

// Extract zero-copy string slices from Arrow
let title_slices: Vec<&str> = title_array.iter()
    .map(|s| s.unwrap_or(""))
    .collect();
let content_slices: Vec<&str> = content_array.iter()
    .map(|s| s.unwrap_or(""))
    .collect();

let columns = vec![
    ("title".to_string(), title_slices),
    ("content".to_string(), content_slices),
];

engine.add_documents_arrow_str(&doc_ids, columns)?;

Flush/compact strategy

  • flush() vs flush_async(): use flush() when you need a durability guarantee before proceeding; use flush_async() + wait_flush() when you want to start serving queries immediately after bulk ingestion without waiting for fsync.
  • flush_async() timing: returns after a fast CPU-only phase (buffer drain + in-memory merge). Data is immediately searchable. Call wait_flush() at any later point to ensure persistence.
  • flush() frequency: flushing periodically bounds WAL/memory usage, but flushing too often may increase IO amplification.
  • Deletion persistence: deletes/updates are logical until compact().
    • If you delete a lot, compact in bigger batches rather than after every small delete wave.
  • Track doc terms only when you need updates/deletes: enable it only if you need update/delete support (Python: track_doc_terms=True). It adds extra bookkeeping on ingestion.

Large indexes and memory footprint

  • Use lazy_load when the index is large and you don't want to map everything into memory: with_lazy_load(true) / Python lazy_load=True.
  • Tune cache_size: in lazy_load mode, cache hit rate is a major driver for latency. Iterate using engine.stats() (e.g., cache hit rate).

Query-side optimization

  • Use boolean/batch APIs and set operations: prefer search_and / search_or or ResultHandle::{intersect, union, difference} to avoid repeated work.
  • Fuzzy search is more expensive: fuzzy_search introduces extra candidate generation and edit-distance checks. Use it only when needed and tune thresholds/distances.

Benchmarking and profiling

  • Benchmarks: use cargo bench (or your own fixed dataset) and compare A/B with realistic data scale, term distribution, and query sets.
  • CPU profiling: profile release binaries to find hot spots (tokenization, bitmap ops, IO, compression/decompression). On macOS, Instruments is usually the easiest.
  • Measure first: use engine.stats() to track search counts, cumulative time, and cache hit rate before tuning.

API Reference

Creating Engine

from nanofts import create_engine

engine = create_engine(
    index_file="./index.nfts",     # Index file path (empty string for memory-only)
    max_chinese_length=4,          # Max Chinese n-gram length
    min_term_length=2,             # Minimum term length to index
    fuzzy_threshold=0.7,           # Fuzzy search similarity threshold (0.0-1.0)
    fuzzy_max_distance=2,          # Maximum edit distance for fuzzy search
    track_doc_terms=False,         # Enable for update/delete support
    drop_if_exists=False,          # Drop existing index on creation
    lazy_load=False,               # Lazy load mode (memory efficient)
    cache_size=10000,              # LRU cache size for lazy load mode
)

Document Operations

# Add single document
engine.add_document(doc_id=1, fields={"title": "Hello", "content": "World"})

# Add multiple documents
docs = [
    (1, {"title": "Doc 1", "content": "Content 1"}),
    (2, {"title": "Doc 2", "content": "Content 2"}),
]
engine.add_documents(docs)

# Update document (requires track_doc_terms=True)
engine.update_document(1, {"title": "Updated", "content": "New content"})

# Delete single document
engine.remove_document(1)

# Delete multiple documents
engine.remove_documents([1, 2, 3])

# Flush buffer to disk (synchronous)
engine.flush()

# Flush asynchronously — data searchable immediately, disk I/O in background
engine.flush_async()
result = engine.search("keyword")  # ✅ Returns correct results right away
engine.wait_flush()                # Wait for background disk write to complete

# Compact index (applies deletions permanently)
engine.compact()

Search Operations

# Basic search - returns ResultHandle
result = engine.search("python programming")

# Get results
doc_ids = result.to_list()           # List[int]
doc_ids = result.to_numpy()          # numpy array
top_10 = result.top(10)              # Top N results
page_2 = result.page(page=2, size=10)  # Pagination

# Result properties
print(result.total_hits)             # Total match count
print(result.is_empty)               # Check if empty
print(1 in result)                   # Check if doc_id in results

# Fuzzy search (for typo tolerance)
result = engine.fuzzy_search("pythn", min_results=5)
print(result.fuzzy_used)             # True if fuzzy matching was applied

# Batch search
results = engine.search_batch(["python", "rust", "java"])

# AND search (intersection)
result = engine.search_and(["python", "tutorial"])

# OR search (union)
result = engine.search_or(["python", "rust"])

# Filter by document IDs
result = engine.filter_by_ids([1, 2, 3, 4, 5])

# Exclude specific IDs
result = engine.exclude_ids([1, 2])

Result Set Operations

# Search for different terms
python_docs = engine.search("python")
rust_docs = engine.search("rust")

# Intersection (AND)
both = python_docs.intersect(rust_docs)

# Union (OR)
either = python_docs.union(rust_docs)

# Difference (NOT)
python_only = python_docs.difference(rust_docs)

# Chained operations
result = engine.search("python").intersect(
    engine.search("tutorial")
).difference(
    engine.search("beginner")
)

Statistics

stats = engine.stats()
print(stats)
# {
#     'term_count': 1234,
#     'search_count': 100,
#     'fuzzy_search_count': 10,
#     'total_search_ns': 1234567,
#     ...
# }

Data Import

NanoFTS supports importing data from various sources:

from nanofts import create_engine

engine = create_engine("./index.nfts")

# Import from pandas DataFrame
import pandas as pd
df = pd.DataFrame({
    'id': [1, 2, 3],
    'title': ['Hello World', '全文搜索', 'Test Document'],
    'content': ['This is a test', '支持多语言', 'Another test']
})
engine.from_pandas(df, id_column='id')

# Import from Polars DataFrame
import polars as pl
df = pl.DataFrame({
    'id': [1, 2, 3],
    'title': ['Doc 1', 'Doc 2', 'Doc 3']
})
engine.from_polars(df, id_column='id')

# Import from PyArrow Table
import pyarrow as pa
table = pa.Table.from_pydict({
    'id': [1, 2, 3],
    'title': ['Arrow 1', 'Arrow 2', 'Arrow 3']
})
engine.from_arrow(table, id_column='id')

# Import from Parquet file
engine.from_parquet("documents.parquet", id_column='id')

# Import from CSV file
engine.from_csv("documents.csv", id_column='id')

# Import from JSON file
engine.from_json("documents.json", id_column='id')

# Import from JSON Lines file
engine.from_json("documents.jsonl", id_column='id', lines=True)

# Import from Python dict list
data = [
    {'id': 1, 'title': 'Hello', 'content': 'World'},
    {'id': 2, 'title': 'Test', 'content': 'Document'}
]
engine.from_dict(data, id_column='id')

Specifying Text Columns

By default, all columns except the ID column are indexed. You can specify which columns to index:

# Only index 'title' and 'content' columns, ignore 'metadata'
engine.from_pandas(df, id_column='id', text_columns=['title', 'content'])

# Same for other import methods
engine.from_csv("data.csv", id_column='id', text_columns=['title', 'content'])

CSV and JSON Options

You can pass additional options to the underlying pandas readers:

# CSV with custom delimiter
engine.from_csv("data.csv", id_column='id', sep=';', encoding='utf-8')

# JSON Lines format
engine.from_json("data.jsonl", id_column='id', lines=True)

Chinese Text Support

NanoFTS handles Chinese text using n-gram tokenization:

engine = create_engine(
    index_file="./chinese_index.nfts",
    max_chinese_length=4,  # Generate 2,3,4-gram for Chinese
)

engine.add_document(1, {"content": "全文搜索引擎"})
engine.flush()

# Search Chinese text
result = engine.search("搜索")
print(result.to_list())  # [1]

Persistence and Recovery

# Create persistent index
engine = create_engine(index_file="./data.nfts")
engine.add_document(1, {"title": "Test"})
engine.flush()

# Close and reopen
del engine
engine = create_engine(index_file="./data.nfts")

# Data is automatically recovered
result = engine.search("Test")
print(result.to_list())  # [1]

# Important: Use compact() to persist deletions
engine.remove_document(1)
engine.compact()  # Deletions are now permanent

Memory-Only Mode

# Create in-memory engine (no persistence)
engine = create_engine(index_file="")

engine.add_document(1, {"content": "temporary data"})
# No flush needed for in-memory mode

result = engine.search("temporary")

Best Practices

For Production Use

  1. Always call compact() after bulk deletions - Deletions are only persisted after compaction
  2. Use track_doc_terms=True if you need update/delete operations
  3. Call flush() periodically to persist new documents; use flush_async() + wait_flush() when you want zero query-blocking during bulk index builds
  4. Use lazy_load=True for large indexes that don't fit in memory

Performance Tips

# Batch operations are faster
docs = [(i, {"content": f"doc {i}"}) for i in range(10000)]
engine.add_documents(docs)  # Much faster than individual add_document calls
engine.flush()

# Use batch search for multiple queries
results = engine.search_batch(["query1", "query2", "query3"])

# Use result set operations instead of multiple searches
# Good:
result = engine.search_and(["python", "tutorial"])
# Instead of:
# result = engine.search("python").intersect(engine.search("tutorial"))

Migration from Old API

If you're upgrading from the old FullTextSearch API:

# Old API (deprecated)
# from nanofts import FullTextSearch
# fts = FullTextSearch(index_dir="./index")
# fts.add_document(1, {"title": "Test"})
# results = fts.search("Test")  # Returns List[int]

# New API
from nanofts import create_engine
engine = create_engine(index_file="./index.nfts")
engine.add_document(1, {"title": "Test"})
result = engine.search("Test")
results = result.to_list()  # Returns List[int]

Key differences:

  • FullTextSearchcreate_engine() function
  • index_dirindex_file (file path, not directory)
  • Search returns ResultHandle instead of List[int]
  • Call .to_list() to get document IDs
  • Use compact() to persist deletions

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanofts-0.7.0.tar.gz (74.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nanofts-0.7.0-cp39-abi3-win_amd64.whl (604.7 kB view details)

Uploaded CPython 3.9+Windows x86-64

nanofts-0.7.0-cp39-abi3-macosx_11_0_arm64.whl (687.6 kB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

nanofts-0.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (881.4 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file nanofts-0.7.0.tar.gz.

File metadata

  • Download URL: nanofts-0.7.0.tar.gz
  • Upload date:
  • Size: 74.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for nanofts-0.7.0.tar.gz
Algorithm Hash digest
SHA256 17c48f5296d24e5be28fb3095588c39422c685573c64b0ac32d32738ec8352c8
MD5 a36c9d4b9930b9eb19360d4981025e9a
BLAKE2b-256 2a9d82278c87ce7d56dbe9b7ada91f06b8c9b3ecd81d00404132deb8d52a26bb

See more details on using hashes here.

File details

Details for the file nanofts-0.7.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: nanofts-0.7.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 604.7 kB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for nanofts-0.7.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 32ca979dca366ffaf240ac48e3046b6b95dab8794332a919ff200a970774d884
MD5 fa8fbb6556c1593f7ad80d08ae3a6f98
BLAKE2b-256 629788f2bc3005da8ded3c2cf305a8f057b3680579d2fa3f4419c9b518a86074

See more details on using hashes here.

File details

Details for the file nanofts-0.7.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nanofts-0.7.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f478376b7908cc082c8645151c2248c21dbcbbeba53f4b209d41efddc56b0857
MD5 46ff2e09c1679cc60b9baa88d7f1fca8
BLAKE2b-256 3c35ea502139959911d4c571c7d337df73b5a5ce6fe3150e927cbfdade02900c

See more details on using hashes here.

File details

Details for the file nanofts-0.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nanofts-0.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 31f999bc0499c3e808f6bff4b35d5023017e30f899b208e9728bc039cafb6b26
MD5 1f6804c41a19452718352ebfee0f4073
BLAKE2b-256 a5b88da77c35d17f840d0e7848836242a5143de94a8d53e5909394c0f9f52107

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page