Skip to main content

A lightweight full-text search engine with CJK (Chinese/Japanese/Korean) support

Project description

fulltext0 - A lightweight full-text search engine

A fast, lightweight full-text search engine library with native support for CJK (Chinese/Japanese/Korean) text indexing and search.

Features

  • CJK N-gram Tokenization: Automatically generates bigrams and unigrams for CJK characters
  • VarInt Compression: Efficient posting list compression using variable-length integer encoding
  • Memory-mapped Index: Fast query execution with mmap-based index access
  • Python ctypes Interface: Easy-to-use Python bindings with zero compilation required for basic usage
  • Cross-platform: Works on Windows, macOS, and Linux

Installation

pip install fulltext0

Quick Start

Building an Index

import fulltext0

# Build index from a corpus file (one document per line)
fulltext0.build(
    corpus_path="documents.txt",
    idx_path="my_index.idx",
    off_path="my_index.offsets"
)

Searching

import fulltext0

# Open the index
with fulltext0.Index("my_index.idx", "my_index.offsets") as idx:
    # Search for documents
    doc_ids = idx.query("system")
    print(f"Found {len(doc_ids)} documents")

    # Get the actual document text
    lines = idx.get_lines("documents.txt", doc_ids[:5])
    for line in lines:
        print(line)

Tokenization

import fulltext0

# Tokenize text (CJK characters are split into bigrams and unigrams)
tokens = fulltext0.tokenize("Hello 系統設計 world")
# Returns: ['hello', '系', '統設', '統', '系統', '設', '計', 'world']

How It Works

CJK N-gram Tokenization

For Chinese, Japanese, and Korean text, the engine generates:

  • Bigrams: Consecutive character pairs (e.g., "系統" for "系" + "統")
  • Unigrams: Individual characters (e.g., "系", "統")

This approach handles the lack of word boundaries in CJK scripts without requiring a dictionary.

Inverted Index

The engine builds an inverted index mapping each token to the list of document IDs containing that token. Query execution performs an AND intersection across all query tokens.

Compression

Posting lists are compressed using VarInt (variable-length integer) delta encoding, reducing index size significantly for large document sets.

API Reference

fulltext0.build(corpus_path, idx_path=None, off_path=None)

Build an inverted index from a corpus file.

Parameters:

  • corpus_path (str): Path to the corpus file (one document per line)
  • idx_path (str, optional): Path for the index file. Defaults to _index/data.idx
  • off_path (str, optional): Path for the offsets file. Defaults to _index/offsets.bin

Returns: int - 0 on success

fulltext0.Index(idx_path=None, off_path=None)

Open an existing index for searching.

Parameters:

  • idx_path (str, optional): Path to the index file
  • off_path (str, optional): Path to the offsets file

Methods:

  • stats()IndexStats: Get index statistics (number of terms, documents)
  • query(query_str)List[int]: Search for documents matching the query, returns list of doc IDs
  • get_lines(corpus_path, doc_ids)List[str]: Retrieve the text of documents by ID
  • close(): Close the index

Context Manager: Supports with statement for automatic cleanup.

fulltext0.tokenize(text) → List[str]

Tokenize text into search tokens.

Parameters:

  • text (str): Text to tokenize

Returns: List of tokens

Performance

  • Indexes 1000 documents in under 1 second
  • Query execution in milliseconds
  • ~27% of original posting list size with VarInt compression
  • O(1) term lookup using hash tables

Requirements

  • Python 3.8+
  • C compiler (gcc, clang, or MSVC on Windows)
  • Works on Windows, macOS, and Linux

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fulltext0-0.2.1.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fulltext0-0.2.1-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file fulltext0-0.2.1.tar.gz.

File metadata

  • Download URL: fulltext0-0.2.1.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for fulltext0-0.2.1.tar.gz
Algorithm Hash digest
SHA256 8e2a38e0563c0ef6c24a03e6e6298fd947926feab96bcef58e90ccf09f18399b
MD5 0deedf70a8fd64e9bc3d3cd2a8b7a44d
BLAKE2b-256 20e6d7894b78648f15027d5c9ad00cc356bc06120e965b8f2cbeb0fbbe63a5cc

See more details on using hashes here.

File details

Details for the file fulltext0-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: fulltext0-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for fulltext0-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a077bb489b0a7801cf749018f9e6fcbf57cf3b835ec97a9f991e64644e058709
MD5 8f67d865ace1a06ed4fa41443fe0ee6e
BLAKE2b-256 2f30dd2856aec972f2cd9da10e7599aaa80d57778a3dbe302326ae7f0682fb25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page