Skip to main content

A Python binding for tokenizers of SQLite Full Text Search

Project description

build status

sqlitefts-python

sqlitefts-python provides binding for tokenizer of SQLite Full-Text search(FTS3/4) and FTS5. it allows you to write tokenizers in Python.

SQLite has Full-Text search feature FTS3/FTS4 and FTS5 along with some predefined tokenizers for FTS3/4, and also predefined tokenizers for FTS5. It is easy to use and has enough functionality. Python has a built-in SQLite module, so that it is easy to use and deploy. You don’t need anything else to full-text search.

But… the predefined tokenizers are not enough for some languages including Japanese. Also it is not easy to write own tokenizers. This module provides ability to write tokenizers using Python with CFFI, so that you don’t need C compiler to write your tokenizer.

It also has ranking functions based on peewee, utility function to add FTS5 auxiliary functions, and an FTS5 aux function implementation.

NOTE: all connections using this modules should be explicitly closed. due to GC behavior, it can be crashed if a connection is left open when a program terminated.

Sample tokenizer

There are differences between FTS3/4 and FTS5, so 2 different base classes are defined.

  • a tokenizer for FTS3/4 can be used with FTS5 by using FTS3TokenizerAdaptor.

  • a tokenizer for FTS5 can be used with FTS3/4 if ‘flags’ is not used.

FTS3/4:

import sqlitefts as fts

class SimpleTokenizer(fts.Tokenizer):
    _p = re.compile(r'\w+', re.UNICODE)

    def tokenize(self, text):
        for m in self._p.finditer(text):
            s, e = m.span()
            t = text[s:e]
            l = len(t.encode('utf-8'))
            p = len(text[:s].encode('utf-8'))
            yield t, p, p + l

tk = sqlitefts.make_tokenizer_module(SimpleTokenizer())
fts.register_tokenizer(conn, 'simple_tokenizer', tk)

FTS5:

from sqlitefts import fts5

class SimpleTokenizer(fts5.FTS5Tokenizer):
    _p = re.compile(r'\w+', re.UNICODE)

    def tokenize(self, text, flags=None):
        for m in self._p.finditer(text):
            s, e = m.span()
            t = text[s:e]
            l = len(t.encode('utf-8'))
            p = len(text[:s].encode('utf-8'))
            yield t, p, p + l

tk = fts5.make_fts5_tokenizer(SimpleTokenizer())
fts5.register_tokenizer(conn, 'simple_tokenizer', tk)

Requirements

  • Python 2.7, Python 3.9+, and PyPy2.7, PyPy3.10+ (older versions may work, but not tested)

    • sqlite3 has to be dynamically linked. see GH-37

  • CFFI

  • FTS3/4 and/or FTS5 enabled SQLite3 or APSW (OS/Python bundled SQLite3 shared library may not work, building sqlite3 from source or pre-compiled binary may be required)

    • SQLite 3.11.x have to be compiled with -DSQLITE_ENABLE_FTS3_TOKENIZER to enable 2-arg fts3_tokenizer

    • SQLite older/newer than 3.11.x do not have extra requirements

Note for APSW users:
  • FTS3 should work as same as builtin sqlite3 - sqlite3(_sqlite3) is used to access SQLite internals

  • sqlitefts.fts5 does not support APSW Amalgamation build. see GH-14

Licence

This software is released under the MIT License, see LICENSE.

Thanks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlitefts-1.0.0.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

sqlitefts-1.0.0-py2.py3-none-any.whl (13.9 kB view details)

Uploaded Python 2Python 3

File details

Details for the file sqlitefts-1.0.0.tar.gz.

File metadata

  • Download URL: sqlitefts-1.0.0.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.6.14

File hashes

Hashes for sqlitefts-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b3733810d9e75a88646e1f8e4521a6b65a30c33ca0b1a48b16cf45b24461d6b9
MD5 d3c43874f857afc8b501abf396244278
BLAKE2b-256 840d8b3421302f3fe9e2ae1bdcb46c76c70fe88fa209476e9f83e70f4c641c76

See more details on using hashes here.

File details

Details for the file sqlitefts-1.0.0-py2.py3-none-any.whl.

File metadata

  • Download URL: sqlitefts-1.0.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.6.14

File hashes

Hashes for sqlitefts-1.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 38b17ac33088ab4baad5ab22460365932b9afb0e2affac8ef37a7c80ac86bd30
MD5 3ea9e7018053fe81b0c9fe74d951e22f
BLAKE2b-256 be425a6c5186b22be17733f3016cb7c7135a3d9be7c81172b9cd174c7a049527

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page