A lightweight full-text search engine with CJK (Chinese/Japanese/Korean) support

These details have not been verified by PyPI

Project links

Project description

fulltext0 - A lightweight full-text search engine

A fast, lightweight full-text search engine library with native support for CJK (Chinese/Japanese/Korean) text indexing and search.

Features

CJK N-gram Tokenization: Automatically generates bigrams and unigrams for CJK characters
VarInt Compression: Efficient posting list compression using variable-length integer encoding
Memory-mapped Index: Fast query execution with mmap-based index access
Python ctypes Interface: Easy-to-use Python bindings with zero compilation required for basic usage
Cross-platform: Works on macOS, Linux, and other Unix-like systems

Installation

pip install fulltext0

Or build from source:

git clone https://github.com/ccccourse0/fulltext0.git
cd fulltext0
pip install .

Quick Start

Building an Index

import fulltext0

# Build index from a corpus file (one document per line)
fulltext0.build(
    corpus_path="documents.txt",
    idx_path="my_index.idx",
    off_path="my_index.offsets"
)

Searching

import fulltext0

# Open the index
with fulltext0.Index("my_index.idx", "my_index.offsets") as idx:
    # Search for documents
    doc_ids = idx.query("system")
    print(f"Found {len(doc_ids)} documents")

    # Get the actual document text
    lines = idx.get_lines("documents.txt", doc_ids[:5])
    for line in lines:
        print(line)

Tokenization

import fulltext0

# Tokenize text (CJK characters are split into bigrams and unigrams)
tokens = fulltext0.tokenize("Hello 系統設計 world")
# Returns: ['hello', '系', '統設', '統', '系統', '設', '計', 'world']

How It Works

CJK N-gram Tokenization

For Chinese, Japanese, and Korean text, the engine generates:

Bigrams: Consecutive character pairs (e.g., "系統" for "系" + "統")
Unigrams: Individual characters (e.g., "系", "統")

This approach handles the lack of word boundaries in CJK scripts without requiring a dictionary.

Inverted Index

The engine builds an inverted index mapping each token to the list of document IDs containing that token. Query execution performs an AND intersection across all query tokens.

Compression

Posting lists are compressed using VarInt (variable-length integer) delta encoding, reducing index size significantly for large document sets.

CLI Usage

After installation, you can also use the command-line tools:

# Build index
fulltext0-index documents.txt

# Search
fulltext0-query "search terms"

API Reference

fulltext0.build(corpus_path, idx_path=None, off_path=None)

Build an inverted index from a corpus file.

Parameters:

corpus_path (str): Path to the corpus file (one document per line)
idx_path (str, optional): Path for the index file. Defaults to _index/data.idx
off_path (str, optional): Path for the offsets file. Defaults to _index/offsets.bin

Returns: int - 0 on success

fulltext0.Index(idx_path=None, off_path=None)

Open an existing index for searching.

Parameters:

idx_path (str, optional): Path to the index file
off_path (str, optional): Path to the offsets file

Methods:

stats() → IndexStats: Get index statistics (number of terms, documents)
query(query_str) → List[int]: Search for documents matching the query, returns list of doc IDs
get_lines(corpus_path, doc_ids) → List[str]: Retrieve the text of documents by ID
close(): Close the index

Context Manager: Supports with statement for automatic cleanup.

fulltext0.tokenize(text) → List[str]

Tokenize text into search tokens.

Parameters:

text (str): Text to tokenize

Returns: List of tokens

Performance

Indexes 1000 documents in under 1 second
Query execution in milliseconds
~27% of original posting list size with VarInt compression
O(1) term lookup using hash tables

Requirements

Python 3.8+
C compiler (gcc or clang)
Works on macOS, Linux, and Unix-like systems

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

Apr 27, 2026

0.2.1

Apr 27, 2026

0.2.0

Apr 27, 2026

This version

0.1.0

Apr 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fulltext0-0.1.0.tar.gz (6.4 kB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fulltext0-0.1.0-py3-none-any.whl (5.1 kB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file fulltext0-0.1.0.tar.gz.

File metadata

Download URL: fulltext0-0.1.0.tar.gz
Upload date: Apr 27, 2026
Size: 6.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for fulltext0-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`75f11ecd83280674d23aae137f0270f29426f89ded285a1857b98207605fdb26`
MD5	`1ce33f0228a2708eae83d003f33c01b5`
BLAKE2b-256	`b6615471c93fe292821c205c64a1b14875e0873be1f1cd7f72bceb92e67e7790`

See more details on using hashes here.

File details

Details for the file fulltext0-0.1.0-py3-none-any.whl.

File metadata

Download URL: fulltext0-0.1.0-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 5.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for fulltext0-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e73b2ea52af1c46a20ed668d29ee017e36f35f5473c4d27d398134047ec371cb`
MD5	`1c68674fbeb04eff104b44696d6fb674`
BLAKE2b-256	`52389dbcd883f9f66284f09715ff75fa55c054ab92e1fe558c1b9035ed96ed9b`

See more details on using hashes here.

fulltext0 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fulltext0 - A lightweight full-text search engine

Features

Installation

Quick Start

Building an Index

Searching

Tokenization

How It Works

CJK N-gram Tokenization

Inverted Index

Compression

CLI Usage

API Reference

fulltext0.build(corpus_path, idx_path=None, off_path=None)

fulltext0.Index(idx_path=None, off_path=None)

fulltext0.tokenize(text) → List[str]

Performance

Requirements

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes