Skip to main content

A lightweight full-text search engine with CJK (Chinese/Japanese/Korean) support

Project description

fulltext0 - A lightweight full-text search engine

A fast, lightweight full-text search engine library with native support for CJK (Chinese/Japanese/Korean) text indexing and search.

Features

  • CJK N-gram Tokenization: Automatically generates bigrams and unigrams for CJK characters
  • VarInt Compression: Efficient posting list compression using variable-length integer encoding
  • Memory-mapped Index: Fast query execution with mmap-based index access
  • Python ctypes Interface: Easy-to-use Python bindings with zero compilation required for basic usage
  • Cross-platform: Works on macOS, Linux, and other Unix-like systems

Installation

pip install fulltext0

Or build from source:

git clone https://github.com/ccccourse0/fulltext0.git
cd fulltext0
pip install .

Quick Start

Building an Index

import fulltext0

# Build index from a corpus file (one document per line)
fulltext0.build(
    corpus_path="documents.txt",
    idx_path="my_index.idx",
    off_path="my_index.offsets"
)

Searching

import fulltext0

# Open the index
with fulltext0.Index("my_index.idx", "my_index.offsets") as idx:
    # Search for documents
    doc_ids = idx.query("system")
    print(f"Found {len(doc_ids)} documents")

    # Get the actual document text
    lines = idx.get_lines("documents.txt", doc_ids[:5])
    for line in lines:
        print(line)

Tokenization

import fulltext0

# Tokenize text (CJK characters are split into bigrams and unigrams)
tokens = fulltext0.tokenize("Hello 系統設計 world")
# Returns: ['hello', '系', '統設', '統', '系統', '設', '計', 'world']

How It Works

CJK N-gram Tokenization

For Chinese, Japanese, and Korean text, the engine generates:

  • Bigrams: Consecutive character pairs (e.g., "系統" for "系" + "統")
  • Unigrams: Individual characters (e.g., "系", "統")

This approach handles the lack of word boundaries in CJK scripts without requiring a dictionary.

Inverted Index

The engine builds an inverted index mapping each token to the list of document IDs containing that token. Query execution performs an AND intersection across all query tokens.

Compression

Posting lists are compressed using VarInt (variable-length integer) delta encoding, reducing index size significantly for large document sets.

CLI Usage

After installation, you can also use the command-line tools:

# Build index
fulltext0-index documents.txt

# Search
fulltext0-query "search terms"

API Reference

fulltext0.build(corpus_path, idx_path=None, off_path=None)

Build an inverted index from a corpus file.

Parameters:

  • corpus_path (str): Path to the corpus file (one document per line)
  • idx_path (str, optional): Path for the index file. Defaults to _index/data.idx
  • off_path (str, optional): Path for the offsets file. Defaults to _index/offsets.bin

Returns: int - 0 on success

fulltext0.Index(idx_path=None, off_path=None)

Open an existing index for searching.

Parameters:

  • idx_path (str, optional): Path to the index file
  • off_path (str, optional): Path to the offsets file

Methods:

  • stats()IndexStats: Get index statistics (number of terms, documents)
  • query(query_str)List[int]: Search for documents matching the query, returns list of doc IDs
  • get_lines(corpus_path, doc_ids)List[str]: Retrieve the text of documents by ID
  • close(): Close the index

Context Manager: Supports with statement for automatic cleanup.

fulltext0.tokenize(text) → List[str]

Tokenize text into search tokens.

Parameters:

  • text (str): Text to tokenize

Returns: List of tokens

Performance

  • Indexes 1000 documents in under 1 second
  • Query execution in milliseconds
  • ~27% of original posting list size with VarInt compression
  • O(1) term lookup using hash tables

Requirements

  • Python 3.8+
  • C compiler (gcc or clang)
  • Works on macOS, Linux, and Unix-like systems

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fulltext0-0.1.0.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fulltext0-0.1.0-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file fulltext0-0.1.0.tar.gz.

File metadata

  • Download URL: fulltext0-0.1.0.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for fulltext0-0.1.0.tar.gz
Algorithm Hash digest
SHA256 75f11ecd83280674d23aae137f0270f29426f89ded285a1857b98207605fdb26
MD5 1ce33f0228a2708eae83d003f33c01b5
BLAKE2b-256 b6615471c93fe292821c205c64a1b14875e0873be1f1cd7f72bceb92e67e7790

See more details on using hashes here.

File details

Details for the file fulltext0-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fulltext0-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for fulltext0-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e73b2ea52af1c46a20ed668d29ee017e36f35f5473c4d27d398134047ec371cb
MD5 1c68674fbeb04eff104b44696d6fb674
BLAKE2b-256 52389dbcd883f9f66284f09715ff75fa55c054ab92e1fe558c1b9035ed96ed9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page