Skip to main content

A lightweight full-text search engine with CJK (Chinese/Japanese/Korean) support

Project description

fulltext0 - A lightweight full-text search engine

A fast, lightweight full-text search engine library with native support for CJK (Chinese/Japanese/Korean) text indexing and search.

Features

  • CJK N-gram Tokenization: Automatically generates bigrams and unigrams for CJK characters
  • VarInt Compression: Efficient posting list compression using variable-length integer encoding
  • Memory-mapped Index: Fast query execution with mmap-based index access
  • Python ctypes Interface: Easy-to-use Python bindings with zero compilation required for basic usage
  • Cross-platform: Works on macOS, Linux, and other Unix-like systems

Installation

pip install fulltext0

Or build from source:

git clone https://github.com/ccccourse0/fulltext0.git
cd fulltext0
pip install .

Quick Start

Building an Index

import fulltext0

# Build index from a corpus file (one document per line)
fulltext0.build(
    corpus_path="documents.txt",
    idx_path="my_index.idx",
    off_path="my_index.offsets"
)

Searching

import fulltext0

# Open the index
with fulltext0.Index("my_index.idx", "my_index.offsets") as idx:
    # Search for documents
    doc_ids = idx.query("system")
    print(f"Found {len(doc_ids)} documents")

    # Get the actual document text
    lines = idx.get_lines("documents.txt", doc_ids[:5])
    for line in lines:
        print(line)

Tokenization

import fulltext0

# Tokenize text (CJK characters are split into bigrams and unigrams)
tokens = fulltext0.tokenize("Hello 系統設計 world")
# Returns: ['hello', '系', '統設', '統', '系統', '設', '計', 'world']

How It Works

CJK N-gram Tokenization

For Chinese, Japanese, and Korean text, the engine generates:

  • Bigrams: Consecutive character pairs (e.g., "系統" for "系" + "統")
  • Unigrams: Individual characters (e.g., "系", "統")

This approach handles the lack of word boundaries in CJK scripts without requiring a dictionary.

Inverted Index

The engine builds an inverted index mapping each token to the list of document IDs containing that token. Query execution performs an AND intersection across all query tokens.

Compression

Posting lists are compressed using VarInt (variable-length integer) delta encoding, reducing index size significantly for large document sets.

CLI Usage

After installation, you can also use the command-line tools:

# Build index
fulltext0-index documents.txt

# Search
fulltext0-query "search terms"

API Reference

fulltext0.build(corpus_path, idx_path=None, off_path=None)

Build an inverted index from a corpus file.

Parameters:

  • corpus_path (str): Path to the corpus file (one document per line)
  • idx_path (str, optional): Path for the index file. Defaults to _index/data.idx
  • off_path (str, optional): Path for the offsets file. Defaults to _index/offsets.bin

Returns: int - 0 on success

fulltext0.Index(idx_path=None, off_path=None)

Open an existing index for searching.

Parameters:

  • idx_path (str, optional): Path to the index file
  • off_path (str, optional): Path to the offsets file

Methods:

  • stats()IndexStats: Get index statistics (number of terms, documents)
  • query(query_str)List[int]: Search for documents matching the query, returns list of doc IDs
  • get_lines(corpus_path, doc_ids)List[str]: Retrieve the text of documents by ID
  • close(): Close the index

Context Manager: Supports with statement for automatic cleanup.

fulltext0.tokenize(text) → List[str]

Tokenize text into search tokens.

Parameters:

  • text (str): Text to tokenize

Returns: List of tokens

Performance

  • Indexes 1000 documents in under 1 second
  • Query execution in milliseconds
  • ~27% of original posting list size with VarInt compression
  • O(1) term lookup using hash tables

Requirements

  • Python 3.8+
  • C compiler (gcc or clang)
  • Works on macOS, Linux, and Unix-like systems

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fulltext0-0.2.0.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fulltext0-0.2.0-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file fulltext0-0.2.0.tar.gz.

File metadata

  • Download URL: fulltext0-0.2.0.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for fulltext0-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2287cb1f296f29c23b0010786e78ebbfebf967012fba3a9e54556ca932924cf0
MD5 0cf4c39a5dcba87ddf399dad54fec7f4
BLAKE2b-256 9e8d969363751510ecde27b7928a223f1a4c66c50aa327b55913f94ed43c0c7d

See more details on using hashes here.

File details

Details for the file fulltext0-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: fulltext0-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 5.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for fulltext0-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 72f32d214d008469f6acb09667636f3bdfbdeeca6d845ce57b2e97fc166b0164
MD5 636a7143de9053935b794c74e1ae7a90
BLAKE2b-256 665c43ec09662b16ebbccc40a6f5577498c0769da31eeb86bf3b8cdc3917930c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page