Skip to main content

CJK Bigram Tokenizer for SQLite FTS5

Project description

sqlite-cjk-fts

CJK Bigram Tokenizer for SQLite FTS5 — enables full-text search on Chinese, Japanese, and Korean text.

Features

  • Bigram tokenization: Generates overlapping bigrams for CJK characters, so any 2-character substring can be matched.
  • Unicode-aware: Handles Chinese/Japanese/Korean Han characters, Hiragana, Katakana, and Hangul.
  • ASCII/Latin support: Falls back to standard word boundary splitting for non-CJK text.
  • Cross-platform: Supports macOS (.dylib), Linux (.so), and Windows (.dll).
  • Auto-build: If no pre-compiled extension is found, automatically builds from source.

Installation

pip install sqlite-cjk-fts

macOS Note

Apple's pre-installed Python does not support load_extension(). On macOS, use Homebrew Python:

brew install python@3.13
/opt/homebrew/bin/python3 your_script.py

Or create a virtual environment with Homebrew Python:

/opt/homebrew/bin/python3 -m venv .venv
source .venv/bin/activate
pip install sqlite-cjk-fts

Linux Requirements

On Linux, ensure you have build tools installed:

# Ubuntu/Debian
apt install build-essential libsqlite3-dev

# Fedora/RHEL
dnf install gcc sqlite-devel

Then install:

pip install sqlite-cjk-fts

The package will automatically compile the C extension from source.

Windows Requirements

On Windows, install Visual Studio Build Tools or MinGW-w64, then:

pip install sqlite-cjk-fts

Quick Start

from sqlite_cjk_fts import connect, create_table, insert, search

# Open database (auto-loads or auto-builds the CJK tokenizer)
db = connect(":memory:")

# Create FTS5 table with CJK bigram tokenizer
create_table(db, "docs", ["title", "body"])

# Insert data
insert(db, "docs", ("天氣預報", "今天天氣非常好,適合出門散步"))
insert(db, "docs", ("新聞頭條", "今日股市大漲,創下歷史新高"))

# Search
results = search(db, "docs", "天氣")
print(results)
# [('天氣預報', '今天天氣非常好,適合出門散步')]

results = search(db, "docs", "今天")
print(results)
# [('天氣預報', '今天天氣非常好,適合出門散步')]

results = search(db, "docs", "股市")
print(results)
# [('新聞頭條', '今日股市大漲,創下歷史新高')]

API

sqlite_cjk_fts.connect(database, **kwargs)

Open a SQLite database with the CJK tokenizer auto-loaded.

from sqlite_cjk_fts import connect

db = connect("myapp.db")           # file-based
db = connect(":memory:")           # in-memory
db = connect(":memory:", ext_path="/custom/path/to/libcjkfts.so")  # custom extension

sqlite_cjk_fts.create_table(conn, name, columns, tokenizer="cjk_bigram")

Create an FTS5 virtual table.

from sqlite_cjk_fts import connect, create_table

db = connect(":memory:")
create_table(db, "articles", ["title", "content"])

sqlite_cjk_fts.insert(conn, table, values)

Insert a row into an FTS5 table.

from sqlite_cjk_fts import insert

insert(db, "docs", ("title here", "body text here"))
# or with named parameters:
insert(db, "docs", {"title": "title here", "body": "body text here"})

sqlite_cjk_fts.search(conn, table, query, columns=None, limit=0)

Search the FTS5 table.

results = search(db, "docs", "天氣")           # all columns
results = search(db, "docs", "天氣", limit=10) # with limit
results = search(db, "docs", "天氣", columns=["title"])  # specific columns

sqlite_cjk_fts.build_extension(compiler=None, output_dir=None, verbose=False)

Manually build the C extension from source.

from sqlite_cjk_fts import build_extension

# Build with verbose output
ext_path = build_extension(verbose=True)
print(f"Built: {ext_path}")

sqlite_cjk_fts.get_ext_path(ext_path=None)

Get the path to the extension file.

from sqlite_cjk_fts import get_ext_path

path = get_ext_path()  # auto-detect
path = get_ext_path("/custom/path/libcjkfts.so")  # explicit path

sqlite_cjk_fts.Connection

The extended sqlite3.Connection class. Use via connect().

sqlite_cjk_fts.Tokenizer

The tokenizer name constant: "cjk_bigram".

Using with raw sqlite3

If you prefer the standard sqlite3 module, load the extension manually:

import sqlite3
from sqlite_cjk_fts import get_ext_path

db = sqlite3.connect(":memory:")
db.execute("PRAGMA enable_load_extension=1")
db.load_extension(get_ext_path())  # or explicit path
db.execute("CREATE VIRTUAL TABLE docs USING fts5(content, tokenize='cjk_bigram')")

Platform-Specific Notes

Platform Extension Build Command
macOS .dylib gcc -fPIC -shared cjk_tokenizer.c -o libcjkfts.dylib
Linux .so gcc -fPIC -shared cjk_tokenizer.c -o libcjkfts.so
Windows .dll cl /LD cjk_tokenizer.c /Fe:libcjkfts.dll

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlite_cjk_fts-0.3.0.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqlite_cjk_fts-0.3.0-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file sqlite_cjk_fts-0.3.0.tar.gz.

File metadata

  • Download URL: sqlite_cjk_fts-0.3.0.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for sqlite_cjk_fts-0.3.0.tar.gz
Algorithm Hash digest
SHA256 586c27d07ca49a6285e344310134a31c582600a716d6d7ec1283f0969cc6c1d6
MD5 6c484366982d6b2ed18789d4153f2f74
BLAKE2b-256 ea40c3638410076cf3214aa3ae9187324770fa1cba71580a1e7df7252f14bc7c

See more details on using hashes here.

File details

Details for the file sqlite_cjk_fts-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: sqlite_cjk_fts-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for sqlite_cjk_fts-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 00d096f91291afdeb77a21eaa4cfb157c93d83b4c1a149b48fb85ac38393d473
MD5 786cf549b5364c70527356872f6ab49d
BLAKE2b-256 aebebb471b1c7722a6e382f3566476c49506c6659894398c6571335f306195c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page