Skip to main content

CJK Bigram Tokenizer for SQLite FTS5

Project description

sqlite-cjk-fts

CJK Bigram Tokenizer for SQLite FTS5 — enables full-text search on Chinese, Japanese, and Korean text.

Features

  • Bigram tokenization: Generates overlapping bigrams for CJK characters, so any 2-character substring can be matched.
  • Unicode-aware: Handles Chinese/Japanese/Korean Han characters, Hiragana, Katakana, and Hangul.
  • ASCII/Latin support: Falls back to standard word boundary splitting for non-CJK text.
  • Simple API: Drop-in connect() function that auto-loads the extension.

Installation

pip install sqlite-cjk-fts

macOS Note

Apple's pre-installed Python does not support load_extension(). On macOS, use Homebrew Python:

brew install python@3.12
/opt/homebrew/bin/python3 your_script.py

Or install in a virtual environment with a properly-built SQLite.

Quick Start

from sqlite_cjk_fts import connect, create_table, insert, search

# Open database (auto-loads the CJK tokenizer)
db = connect(":memory:")

# Create FTS5 table with CJK bigram tokenizer
create_table(db, "docs", ["title", "body"])

# Insert data
insert(db, "docs", ("天氣預報", "今天天氣非常好,適合出門散步"))
insert(db, "docs", ("新聞頭條", "今日股市大漲,創下歷史新高"))

# Search
results = search(db, "docs", "天氣")
print(results)
# [('天氣預報', '今天天氣非常好,適合出門散步')]

results = search(db, "docs", "今天")
print(results)
# [('天氣預報', '今天天氣非常好,適合出門散步')]

results = search(db, "docs", "股市")
print(results)
# [('新聞頭條', '今日股市大漲,創下歷史新高')]

API

sqlite_cjk_fts.connect(database, **kwargs)

Open a SQLite database with the CJK tokenizer auto-loaded.

from sqlite_cjk_fts import connect

db = connect("myapp.db")           # file-based
db = connect(":memory:")           # in-memory

sqlite_cjk_fts.create_table(conn, name, columns, tokenizer="cjk_bigram")

Create an FTS5 virtual table.

from sqlite_cjk_fts import connect, create_table

db = connect(":memory:")
create_table(db, "articles", ["title", "content"])

sqlite_cjk_fts.insert(conn, table, values)

Insert a row into an FTS5 table.

from sqlite_cjk_fts import insert

insert(db, "docs", ("title here", "body text here"))
# or with named parameters:
insert(db, "docs", {"title": "title here", "body": "body text here"})

sqlite_cjk_fts.search(conn, table, query, columns=None, limit=0)

Search the FTS5 table.

results = search(db, "docs", "天氣")           # all columns
results = search(db, "docs", "天氣", limit=10) # with limit
results = search(db, "docs", "天氣", columns=["title"])  # specific columns

sqlite_cjk_fts.Connection

The extended sqlite3.Connection class. Use via connect().

sqlite_cjk_fts.Tokenizer

The tokenizer name constant: "cjk_bigram".

Using with raw sqlite3

If you prefer the standard sqlite3 module, load the extension manually:

import sqlite3

db = sqlite3.connect(":memory:")
db.execute("PRAGMA enable_load_extension=1")
db.load_extension("path/to/libcjkfts.dylib")  # or .so on Linux
db.execute("CREATE VIRTUAL TABLE docs USING fts5(content, tokenize='cjk_bigram')")

Japanese Support

db = connect(":memory:")
create_table(db, "docs", ["title", "body"])
insert(db, "docs", ("天気予報", "今日の東京の天気は晴れです"))

print(search(db, "docs", "天気"))
# [('天気予報', "今日の東京の天気は晴れです")]

Korean Support

db = connect(":memory:")
create_table(db, "docs", ["title", "body"])
insert(db, "docs", ("날씨", "오늘 서울의 날씨가 맑습니다"))

print(search(db, "docs", "날씨"))
# [('날씨', '오늘 서울의 날씨가 맑습니다')]

Mixed CJK + Latin

db = connect(":memory:")
create_table(db, "docs", ["title", "body"])
insert(db, "docs", ("tech article", "SQLite支援FTS5全文搜尋功能,非常powerful"))

print(search(db, "docs", "全文"))    # CJK
# [('tech article', 'SQLite支援FTS5全文搜尋功能,非常powerful')]

print(search(db, "docs", "SQLite"))  # ASCII
# [('tech article', 'SQLite支援FTS5全文搜尋功能,非常powerful')]

print(search(db, "docs", "powerful"))  # trailing ASCII
# [('tech article', 'SQLite支援FTS5全文搜尋功能,非常powerful')]

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlite_cjk_fts-0.2.0.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqlite_cjk_fts-0.2.0-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file sqlite_cjk_fts-0.2.0.tar.gz.

File metadata

  • Download URL: sqlite_cjk_fts-0.2.0.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for sqlite_cjk_fts-0.2.0.tar.gz
Algorithm Hash digest
SHA256 92128242c98906893e5144e13feb0455f760073b33983198736d5a80aafb1f57
MD5 a23a6ef95cdf6d6db87274ac9917e43a
BLAKE2b-256 942f96b3676c1e089e969e3ef49362f5b76a5c84c87ff627ebcc08e2640fdf1e

See more details on using hashes here.

File details

Details for the file sqlite_cjk_fts-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: sqlite_cjk_fts-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for sqlite_cjk_fts-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f9f1ddad0a150e5e7d110f50df9d72e50757cde62c462575a312b80cb9eb2b3
MD5 48862ad7ab87e25fe25dee14950f1c76
BLAKE2b-256 369755dc4fa2f8f10944316113456d99601211e10b0ce05aaa4ab0d6cf8a23d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page