Skip to main content

A lightweight full-text search engine with CJK (Chinese/Japanese/Korean) support

Project description

fulltext0 - A lightweight full-text search engine

A fast, lightweight full-text search engine library with native support for CJK (Chinese/Japanese/Korean) text indexing and search.

Features

  • CJK N-gram Tokenization: Automatically generates bigrams and unigrams for CJK characters
  • VarInt Compression: Efficient posting list compression using variable-length integer encoding
  • Memory-mapped Index: Fast query execution with mmap-based index access
  • Python ctypes Interface: Easy-to-use Python bindings with zero compilation required for basic usage
  • Cross-platform: Works on Windows, macOS, and Linux

Installation

# From PyPI (after published)
pip install fulltext0

# From GitHub (latest version)
pip install git+https://github.com/ccccourse0/fulltext0.git

# Local development install
cd /path/to/fulltext0
pip install .

Note: If fulltext0 command is not found after install, add the user scripts directory to your PATH:

  • macOS / Linux: ~/.local/bin
  • Windows: %APPDATA%\Python\Python3x\Scripts

Or use pipx (pip install pipx) to install in isolated environments:

pipx install fulltext0

Command Line Interface

# Build index from a text file (one document per line)
fulltext0 index input.txt

# Query the index
fulltext0 query "框架"

# Query and show matching lines
fulltext0 query "框架" --show

# Specify custom index and corpus paths
fulltext0 index input.txt -o my_index.idx --offset my_index.offsets
fulltext0 query "框架" -i my_index.idx --offset my_index.offsets -c input.txt --show

Quick Start

Building an Index

import fulltext0

# Build index from a corpus file (one document per line)
fulltext0.build(
    corpus_path="documents.txt",
    idx_path="my_index.idx",
    off_path="my_index.offsets"
)

Searching

import fulltext0

# Open the index
with fulltext0.Index("my_index.idx", "my_index.offsets") as idx:
    # Search for documents
    doc_ids = idx.query("system")
    print(f"Found {len(doc_ids)} documents")

    # Get the actual document text
    lines = idx.get_lines("documents.txt", doc_ids[:5])
    for line in lines:
        print(line)

Tokenization

import fulltext0

# Tokenize text (CJK characters are split into bigrams and unigrams)
tokens = fulltext0.tokenize("Hello 系統設計 world")
# Returns: ['hello', '系', '統設', '統', '系統', '設', '計', 'world']

How It Works

CJK N-gram Tokenization

For Chinese, Japanese, and Korean text, the engine generates:

  • Bigrams: Consecutive character pairs (e.g., "系統" for "系" + "統")
  • Unigrams: Individual characters (e.g., "系", "統")

This approach handles the lack of word boundaries in CJK scripts without requiring a dictionary.

Inverted Index

The engine builds an inverted index mapping each token to the list of document IDs containing that token. Query execution performs an AND intersection across all query tokens.

Compression

Posting lists are compressed using VarInt (variable-length integer) delta encoding, reducing index size significantly for large document sets.

API Reference

fulltext0.build(corpus_path, idx_path=None, off_path=None)

Build an inverted index from a corpus file.

Parameters:

  • corpus_path (str): Path to the corpus file (one document per line)
  • idx_path (str, optional): Path for the index file. Defaults to _index/data.idx
  • off_path (str, optional): Path for the offsets file. Defaults to _index/offsets.bin

Returns: int - 0 on success

fulltext0.Index(idx_path=None, off_path=None)

Open an existing index for searching.

Parameters:

  • idx_path (str, optional): Path to the index file
  • off_path (str, optional): Path to the offsets file

Methods:

  • stats()IndexStats: Get index statistics (number of terms, documents)
  • query(query_str)List[int]: Search for documents matching the query, returns list of doc IDs
  • get_lines(corpus_path, doc_ids)List[str]: Retrieve the text of documents by ID
  • close(): Close the index

Context Manager: Supports with statement for automatic cleanup.

fulltext0.tokenize(text) → List[str]

Tokenize text into search tokens.

Parameters:

  • text (str): Text to tokenize

Returns: List of tokens

Performance

  • Indexes 1000 documents in under 1 second
  • Query execution in milliseconds
  • ~27% of original posting list size with VarInt compression
  • O(1) term lookup using hash tables

Requirements

  • Python 3.8+
  • C compiler (gcc, clang, or MSVC on Windows)
  • Works on Windows, macOS, and Linux

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fulltext0-0.5.0.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fulltext0-0.5.0-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file fulltext0-0.5.0.tar.gz.

File metadata

  • Download URL: fulltext0-0.5.0.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for fulltext0-0.5.0.tar.gz
Algorithm Hash digest
SHA256 06fa0a976df9dc662613a80e4908a83307e3bd027e6b786627c068252bb2655e
MD5 235cd9aef8830f12098651161035e5ee
BLAKE2b-256 d5c8912e9f96caee60c6924eb5290510392b4f78ea483f15382330910a16c002

See more details on using hashes here.

File details

Details for the file fulltext0-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: fulltext0-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for fulltext0-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 da7c0e7df6543dfbd8a9c4a147eeb5e797ca6c5d0a3d4d6eda14a76119c7f7b8
MD5 5a56e6a426ea6be787e5f93662b5f800
BLAKE2b-256 38b916acc0932b9cde695289511dddd3bdc453ea9a767246c344b8c96052a277

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page