Skip to main content

High-performance full-text search engine written in Rust

Project description

NanoFTS

A high-performance full-text search engine with Rust core, featuring efficient indexing and searching capabilities for both English and Chinese text.

Features

  • High Performance: Rust-powered core with sub-millisecond search latency
  • LSM-Tree Architecture: Scalable to billions of documents
  • Incremental Updates: Real-time document add/update/delete
  • Fuzzy Search: Intelligent fuzzy matching with configurable thresholds
  • Full CRUD: Complete document management operations
  • Result Handle: Zero-copy result with set operations (AND/OR/NOT)
  • NumPy Support: Direct numpy array output
  • Multilingual: Support for both English and Chinese text
  • Persistence: Disk-based storage with WAL recovery
  • LRU Cache: Built-in caching for frequently accessed terms
  • Data Import: Import from pandas, polars, arrow, parquet, CSV, JSON

Installation

pip install nanofts

Quick Start

from nanofts import create_engine

# Create a search engine
engine = create_engine(
    index_file="./index.nfts",
    track_doc_terms=True,  # Enable update/delete operations
)

# Add documents (field values must be strings)
engine.add_document(1, {"title": "Python教程", "content": "学习Python编程"})
engine.add_document(2, {"title": "数据分析", "content": "使用pandas进行数据处理"})
engine.flush()

# Search - returns ResultHandle object
result = engine.search("Python")
print(f"Found {result.total_hits} documents")
print(f"Document IDs: {result.to_list()}")

# Update document
engine.update_document(1, {"title": "高级Python教程", "content": "深入学习Python"})

# Delete document
engine.remove_document(2)

# Compact to persist deletions
engine.compact()

API Reference

Creating Engine

from nanofts import create_engine

engine = create_engine(
    index_file="./index.nfts",     # Index file path (empty string for memory-only)
    max_chinese_length=4,          # Max Chinese n-gram length
    min_term_length=2,             # Minimum term length to index
    fuzzy_threshold=0.7,           # Fuzzy search similarity threshold (0.0-1.0)
    fuzzy_max_distance=2,          # Maximum edit distance for fuzzy search
    track_doc_terms=False,         # Enable for update/delete support
    drop_if_exists=False,          # Drop existing index on creation
    lazy_load=False,               # Lazy load mode (memory efficient)
    cache_size=10000,              # LRU cache size for lazy load mode
)

Document Operations

# Add single document
engine.add_document(doc_id=1, fields={"title": "Hello", "content": "World"})

# Add multiple documents
docs = [
    (1, {"title": "Doc 1", "content": "Content 1"}),
    (2, {"title": "Doc 2", "content": "Content 2"}),
]
engine.add_documents(docs)

# Update document (requires track_doc_terms=True)
engine.update_document(1, {"title": "Updated", "content": "New content"})

# Delete single document
engine.remove_document(1)

# Delete multiple documents
engine.remove_documents([1, 2, 3])

# Flush buffer to disk
engine.flush()

# Compact index (applies deletions permanently)
engine.compact()

Search Operations

# Basic search - returns ResultHandle
result = engine.search("python programming")

# Get results
doc_ids = result.to_list()           # List[int]
doc_ids = result.to_numpy()          # numpy array
top_10 = result.top(10)              # Top N results
page_2 = result.page(page=2, size=10)  # Pagination

# Result properties
print(result.total_hits)             # Total match count
print(result.is_empty)               # Check if empty
print(1 in result)                   # Check if doc_id in results

# Fuzzy search (for typo tolerance)
result = engine.fuzzy_search("pythn", min_results=5)
print(result.fuzzy_used)             # True if fuzzy matching was applied

# Batch search
results = engine.search_batch(["python", "rust", "java"])

# AND search (intersection)
result = engine.search_and(["python", "tutorial"])

# OR search (union)
result = engine.search_or(["python", "rust"])

# Filter by document IDs
result = engine.filter_by_ids([1, 2, 3, 4, 5])

# Exclude specific IDs
result = engine.exclude_ids([1, 2])

Result Set Operations

# Search for different terms
python_docs = engine.search("python")
rust_docs = engine.search("rust")

# Intersection (AND)
both = python_docs.intersect(rust_docs)

# Union (OR)
either = python_docs.union(rust_docs)

# Difference (NOT)
python_only = python_docs.difference(rust_docs)

# Chained operations
result = engine.search("python").intersect(
    engine.search("tutorial")
).difference(
    engine.search("beginner")
)

Statistics

stats = engine.stats()
print(stats)
# {
#     'term_count': 1234,
#     'search_count': 100,
#     'fuzzy_search_count': 10,
#     'total_search_ns': 1234567,
#     ...
# }

Data Import

NanoFTS supports importing data from various sources:

from nanofts import create_engine

engine = create_engine("./index.nfts")

# Import from pandas DataFrame
import pandas as pd
df = pd.DataFrame({
    'id': [1, 2, 3],
    'title': ['Hello World', '全文搜索', 'Test Document'],
    'content': ['This is a test', '支持多语言', 'Another test']
})
engine.from_pandas(df, id_column='id')

# Import from Polars DataFrame
import polars as pl
df = pl.DataFrame({
    'id': [1, 2, 3],
    'title': ['Doc 1', 'Doc 2', 'Doc 3']
})
engine.from_polars(df, id_column='id')

# Import from PyArrow Table
import pyarrow as pa
table = pa.Table.from_pydict({
    'id': [1, 2, 3],
    'title': ['Arrow 1', 'Arrow 2', 'Arrow 3']
})
engine.from_arrow(table, id_column='id')

# Import from Parquet file
engine.from_parquet("documents.parquet", id_column='id')

# Import from CSV file
engine.from_csv("documents.csv", id_column='id')

# Import from JSON file
engine.from_json("documents.json", id_column='id')

# Import from JSON Lines file
engine.from_json("documents.jsonl", id_column='id', lines=True)

# Import from Python dict list
data = [
    {'id': 1, 'title': 'Hello', 'content': 'World'},
    {'id': 2, 'title': 'Test', 'content': 'Document'}
]
engine.from_dict(data, id_column='id')

Specifying Text Columns

By default, all columns except the ID column are indexed. You can specify which columns to index:

# Only index 'title' and 'content' columns, ignore 'metadata'
engine.from_pandas(df, id_column='id', text_columns=['title', 'content'])

# Same for other import methods
engine.from_csv("data.csv", id_column='id', text_columns=['title', 'content'])

CSV and JSON Options

You can pass additional options to the underlying pandas readers:

# CSV with custom delimiter
engine.from_csv("data.csv", id_column='id', sep=';', encoding='utf-8')

# JSON Lines format
engine.from_json("data.jsonl", id_column='id', lines=True)

Chinese Text Support

NanoFTS handles Chinese text using n-gram tokenization:

engine = create_engine(
    index_file="./chinese_index.nfts",
    max_chinese_length=4,  # Generate 2,3,4-gram for Chinese
)

engine.add_document(1, {"content": "全文搜索引擎"})
engine.flush()

# Search Chinese text
result = engine.search("搜索")
print(result.to_list())  # [1]

Persistence and Recovery

# Create persistent index
engine = create_engine(index_file="./data.nfts")
engine.add_document(1, {"title": "Test"})
engine.flush()

# Close and reopen
del engine
engine = create_engine(index_file="./data.nfts")

# Data is automatically recovered
result = engine.search("Test")
print(result.to_list())  # [1]

# Important: Use compact() to persist deletions
engine.remove_document(1)
engine.compact()  # Deletions are now permanent

Memory-Only Mode

# Create in-memory engine (no persistence)
engine = create_engine(index_file="")

engine.add_document(1, {"content": "temporary data"})
# No flush needed for in-memory mode

result = engine.search("temporary")

Best Practices

For Production Use

  1. Always call compact() after bulk deletions - Deletions are only persisted after compaction
  2. Use track_doc_terms=True if you need update/delete operations
  3. Call flush() periodically to persist new documents
  4. Use lazy_load=True for large indexes that don't fit in memory

Performance Tips

# Batch operations are faster
docs = [(i, {"content": f"doc {i}"}) for i in range(10000)]
engine.add_documents(docs)  # Much faster than individual add_document calls
engine.flush()

# Use batch search for multiple queries
results = engine.search_batch(["query1", "query2", "query3"])

# Use result set operations instead of multiple searches
# Good:
result = engine.search_and(["python", "tutorial"])
# Instead of:
# result = engine.search("python").intersect(engine.search("tutorial"))

Migration from Old API

If you're upgrading from the old FullTextSearch API:

# Old API (deprecated)
# from nanofts import FullTextSearch
# fts = FullTextSearch(index_dir="./index")
# fts.add_document(1, {"title": "Test"})
# results = fts.search("Test")  # Returns List[int]

# New API
from nanofts import create_engine
engine = create_engine(index_file="./index.nfts")
engine.add_document(1, {"title": "Test"})
result = engine.search("Test")
results = result.to_list()  # Returns List[int]

Key differences:

  • FullTextSearchcreate_engine() function
  • index_dirindex_file (file path, not directory)
  • Search returns ResultHandle instead of List[int]
  • Call .to_list() to get document IDs
  • Use compact() to persist deletions

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanofts-0.3.3.tar.gz (63.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nanofts-0.3.3-cp39-abi3-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.9+Windows x86-64

nanofts-0.3.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

nanofts-0.3.3-cp39-abi3-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file nanofts-0.3.3.tar.gz.

File metadata

  • Download URL: nanofts-0.3.3.tar.gz
  • Upload date:
  • Size: 63.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for nanofts-0.3.3.tar.gz
Algorithm Hash digest
SHA256 4e1363048574e463a03bd4b60f5f9232dffdf9674bb9f6cd9c1dc98812bbb79a
MD5 47353f84928f1b1558a7a8d4902d4c9f
BLAKE2b-256 aa78f1417f25b92914d95e72e3f847f1b1277102ac4441f27625cfe5d956967c

See more details on using hashes here.

File details

Details for the file nanofts-0.3.3-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: nanofts-0.3.3-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for nanofts-0.3.3-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 94e7876c50b0e0abf93dea77d9dda45b215a7748e0d3bab7bb8f70f3f4886254
MD5 83f9679ed2b9ad0f99b43c640544b581
BLAKE2b-256 581420a10c3c9026b814bc810c250ef103faa53d4b31cb46d14abe7870192b82

See more details on using hashes here.

File details

Details for the file nanofts-0.3.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nanofts-0.3.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0f15807c062b0f9eea9cbbffc6ed6e3fe5357d06b40103553b3e7df2c871216a
MD5 2200b3b450bb2f164359717ff76f6d79
BLAKE2b-256 7be7eedb7bf181688aa0709fa88e6ec95f6c0e0e840cf497be9bc60351b9e09c

See more details on using hashes here.

File details

Details for the file nanofts-0.3.3-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nanofts-0.3.3-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6fa99674c41bd4543f6b75dcc403a9bd2c85afb2e9486ced61548e49e8049f19
MD5 c649063eaf8fb9bbe113c79ac8e0d494
BLAKE2b-256 e39b1e54b12ff5d3575de684cdf7b158616fbff90733ce58cffa810d3864dd79

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page