Skip to main content

A Python binding for tokenizers of SQLite Full Text Search

Project description

travisci build status appveyor build status

sqlitefts-python

sqlitefts-python provides binding for tokenizer of SQLite Full-Text search(FTS3/4) and FTS5. it allows you to write tokenizers in Python.

SQLite has Full-Text search feature FTS3/FTS4 and FTS5 along with some predefined tokenizers for FTS3/4, and also predefined tokenizers for FTS5. It is easy to use and has enough functionality. Python has a built-in SQLite module, so that it is easy to use and deploy. You don’t need anything else to full-text search.

But… the predefined tokenizers are not enough for some languages including Japanese. Also it is not easy to write own tokenizers. This module provides ability to write tokenizers using Python with CFFI, so that you don’t need C compiler to write your tokenizer.

It also has ranking functions based on peewee, utility function to add FTS5 auxiliary functions, and an FTS5 aux function implementation.

NOTE: all connections using this modules should be explicitly closed. due to GC behavior, it can be crashed if a connection is left open when a program terminated.

Sample tokenizer

There are differences between FTS3/4 and FTS5, so 2 different base classes are defined.

  • a tokenizer for FTS3/4 can be used with FTS5 by using FTS3TokenizerAdaptor.

  • a tokenizer for FTS5 can be used with FTS3/4 if ‘flags’ is not used.

FTS3/4:

import sqlitefts as fts

class SimpleTokenizer(fts.Tokenizer):
    _p = re.compile(r'\w+', re.UNICODE)

    def tokenize(self, text):
        for m in self._p.finditer(text):
            s, e = m.span()
            t = text[s:e]
            l = len(t.encode('utf-8'))
            p = len(text[:s].encode('utf-8'))
            yield t, p, p + l

tk = sqlitefts.make_tokenizer_module(SimpleTokenizer())
fts.register_tokenizer(conn, 'simple_tokenizer', tk)

FTS5:

from sqlitefts import fts5

class SimpleTokenizer(fts5.FTS5Tokenizer):
    _p = re.compile(r'\w+', re.UNICODE)

    def tokenize(self, text, flags=None):
        for m in self._p.finditer(text):
            s, e = m.span()
            t = text[s:e]
            l = len(t.encode('utf-8'))
            p = len(text[:s].encode('utf-8'))
            yield t, p, p + l

tk = fts5.make_fts5_tokenizer(SimpleTokenizer())
fts5.register_tokenizer(conn, 'simple_tokenizer', tk)

Requirements

  • Python 2.7, Python 3.3+, and PyPy2.7, PyPy3.2+

  • CFFI

  • FTS3/4 and/or FTS5 enabled SQLite3 or APSW (for Windows, you may need to download and replace sqlite3.dll)

    • SQLite 3.11.x have to be compiled with -DSQLITE_ENABLE_FTS3_TOKENIZER to enable 2-arg fts3_tokenizer

    • SQLite 3.10.2 and older versions do not have extra requirements. 2-arg fts3_tokenizer is always avaiable.

    • SQLite 3.12.0 and later vesrions do not have extra requirements. 2-arg fts3_tokenizer can be enabled dynamically.

Note for APSW users: An APSW Amalgamation build does not expose SQLite APIs used in this module, so libsqlite3.so/sqlite3.dll is also required even it has no runtime library dependencies on SQLite. An APSW local build already depends on the shared library. Detail: sqlite3_db_config can be invoked via Connection.config, but it rejects SQLITE_DBCONFIG_ENABLE_FTS3_TOKENIZER to register a new tokenizer. tested at APSW 3.21.0-r1.

Licence

This software is released under the MIT License, see LICENSE.

Thanks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlitefts-0.5.1.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

sqlitefts-0.5.1-py2.py3-none-any.whl (10.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file sqlitefts-0.5.1.tar.gz.

File metadata

  • Download URL: sqlitefts-0.5.1.tar.gz
  • Upload date:
  • Size: 17.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.0

File hashes

Hashes for sqlitefts-0.5.1.tar.gz
Algorithm Hash digest
SHA256 01690a49b6878da24039eec8aed4741f4be44696af64560ea7c3ab8c7ca53737
MD5 011dc82901dba2ef20437bbd1c15db95
BLAKE2b-256 eb7a4ecb9ba105104064391377fab36315bc9660782174be85182adbbdaefd4b

See more details on using hashes here.

File details

Details for the file sqlitefts-0.5.1-py2.py3-none-any.whl.

File metadata

  • Download URL: sqlitefts-0.5.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.0

File hashes

Hashes for sqlitefts-0.5.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 59b5713c3d26305eb212eeccc902e0651ea10016746bd0865b12f3f8b95c38b3
MD5 f96617501ef7bf2f93a71e78f914bdb3
BLAKE2b-256 e241e4572adf1f5c3a72511174fd4668ba5019b35d922e015bf25399f62a17fd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page