CJK Bigram Tokenizer for SQLite FTS5
Project description
sqlite-cjk-fts
CJK Bigram Tokenizer for SQLite FTS5 — enables full-text search on Chinese, Japanese, and Korean text.
Features
- Bigram tokenization: Generates overlapping bigrams for CJK characters, so any 2-character substring can be matched.
- Unicode-aware: Handles Chinese/Japanese/Korean Han characters, Hiragana, Katakana, and Hangul.
- ASCII/Latin support: Falls back to standard word boundary splitting for non-CJK text.
- Simple API: Drop-in
connect()function that auto-loads the extension.
Installation
pip install sqlite-cjk-fts
macOS Note
Apple's pre-installed Python does not support load_extension(). On macOS, use Homebrew Python:
brew install python@3.12
/opt/homebrew/bin/python3 your_script.py
Or install in a virtual environment with a properly-built SQLite.
Quick Start
from sqlite_cjk_fts import connect, create_table, insert, search
# Open database (auto-loads the CJK tokenizer)
db = connect(":memory:")
# Create FTS5 table with CJK bigram tokenizer
create_table(db, "docs", ["title", "body"])
# Insert data
insert(db, "docs", ("天氣預報", "今天天氣非常好,適合出門散步"))
insert(db, "docs", ("新聞頭條", "今日股市大漲,創下歷史新高"))
# Search
results = search(db, "docs", "天氣")
print(results)
# [('天氣預報', '今天天氣非常好,適合出門散步')]
results = search(db, "docs", "今天")
print(results)
# [('天氣預報', '今天天氣非常好,適合出門散步')]
results = search(db, "docs", "股市")
print(results)
# [('新聞頭條', '今日股市大漲,創下歷史新高')]
API
sqlite_cjk_fts.connect(database, **kwargs)
Open a SQLite database with the CJK tokenizer auto-loaded.
from sqlite_cjk_fts import connect
db = connect("myapp.db") # file-based
db = connect(":memory:") # in-memory
sqlite_cjk_fts.create_table(conn, name, columns, tokenizer="cjk_bigram")
Create an FTS5 virtual table.
from sqlite_cjk_fts import connect, create_table
db = connect(":memory:")
create_table(db, "articles", ["title", "content"])
sqlite_cjk_fts.insert(conn, table, values)
Insert a row into an FTS5 table.
from sqlite_cjk_fts import insert
insert(db, "docs", ("title here", "body text here"))
# or with named parameters:
insert(db, "docs", {"title": "title here", "body": "body text here"})
sqlite_cjk_fts.search(conn, table, query, columns=None, limit=0)
Search the FTS5 table.
results = search(db, "docs", "天氣") # all columns
results = search(db, "docs", "天氣", limit=10) # with limit
results = search(db, "docs", "天氣", columns=["title"]) # specific columns
sqlite_cjk_fts.Connection
The extended sqlite3.Connection class. Use via connect().
sqlite_cjk_fts.Tokenizer
The tokenizer name constant: "cjk_bigram".
Using with raw sqlite3
If you prefer the standard sqlite3 module, load the extension manually:
import sqlite3
db = sqlite3.connect(":memory:")
db.execute("PRAGMA enable_load_extension=1")
db.load_extension("path/to/libcjkfts.dylib") # or .so on Linux
db.execute("CREATE VIRTUAL TABLE docs USING fts5(content, tokenize='cjk_bigram')")
Japanese Support
db = connect(":memory:")
create_table(db, "docs", ["title", "body"])
insert(db, "docs", ("天気予報", "今日の東京の天気は晴れです"))
print(search(db, "docs", "天気"))
# [('天気予報', "今日の東京の天気は晴れです")]
Korean Support
db = connect(":memory:")
create_table(db, "docs", ["title", "body"])
insert(db, "docs", ("날씨", "오늘 서울의 날씨가 맑습니다"))
print(search(db, "docs", "날씨"))
# [('날씨', '오늘 서울의 날씨가 맑습니다')]
Mixed CJK + Latin
db = connect(":memory:")
create_table(db, "docs", ["title", "body"])
insert(db, "docs", ("tech article", "SQLite支援FTS5全文搜尋功能,非常powerful"))
print(search(db, "docs", "全文")) # CJK
# [('tech article', 'SQLite支援FTS5全文搜尋功能,非常powerful')]
print(search(db, "docs", "SQLite")) # ASCII
# [('tech article', 'SQLite支援FTS5全文搜尋功能,非常powerful')]
print(search(db, "docs", "powerful")) # trailing ASCII
# [('tech article', 'SQLite支援FTS5全文搜尋功能,非常powerful')]
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sqlite_cjk_fts-0.2.0.tar.gz.
File metadata
- Download URL: sqlite_cjk_fts-0.2.0.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92128242c98906893e5144e13feb0455f760073b33983198736d5a80aafb1f57
|
|
| MD5 |
a23a6ef95cdf6d6db87274ac9917e43a
|
|
| BLAKE2b-256 |
942f96b3676c1e089e969e3ef49362f5b76a5c84c87ff627ebcc08e2640fdf1e
|
File details
Details for the file sqlite_cjk_fts-0.2.0-py3-none-any.whl.
File metadata
- Download URL: sqlite_cjk_fts-0.2.0-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f9f1ddad0a150e5e7d110f50df9d72e50757cde62c462575a312b80cb9eb2b3
|
|
| MD5 |
48862ad7ab87e25fe25dee14950f1c76
|
|
| BLAKE2b-256 |
369755dc4fa2f8f10944316113456d99601211e10b0ce05aaa4ab0d6cf8a23d8
|