Skip to main content

No project description provided

Project description

Lucytok

Lucene's boring English tokenizers recreated for Python. Compatible with SearchArray.

Lets you configure a handful of normal tokenization rules like ascii folding, posessive removal, both types of porter stemming, English stopwords, etc.

Usage

Creating a tokenizer close to Elasticsearch's default english analyzer

from lucytok import english
es_english = english("Nsp->NNN->l->sNNN->1")
tokenized = es_english("The quick brown fox jumps over the lazy døg")
print(tokenized)

Outputs

['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'døg']

Make a tokenizer with ASCII folding...

from lucytok import english
es_english_folded = english("asp->NNN->l->sNNN->1")
print(es_english_folded("The quick brown fox jumps over the lazy døg"))
['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'dog']

Split compounds and convert British to American spelling...

from lucytok import english
es_british = english("asp->NNN->l->scbN->1")
print(es_british("The watercolour fox jumps over the lazy døg"))
['_', 'water', 'color', 'fox', 'jump', 'over', '_', 'lazi', 'dog']

Spec

Create a tokenizer using the following settings (these concepts correspond to their Elasticsearch counterparts):


#  |- ASCII fold (a) or not (N)
#  ||- Standard (s) or WS tokenizer (w)
#  ||- Remove possessive suffixes (p) or not (N)
#  |||
# "NsN->NNN->N->NNNN->N"
#       |||  |  ||||  |
#       |||  |  ||||  |- Porter stem version (1) or version (2) vs N/0 for none
#       |||  |  ||||- Manually convert irregular plurals (p) or not (N)
#       |||  |  |||- Split Compounds (c) or not (N)
#       |||  |  ||- Convert british to american spelling (b) or not (N)
#       |||  |  |- Blank out stopwords (s) or not (N)
#       |||  |- Lowercase (l) or not (N)
#       |||- Split on letter/number transitions (n) or not (N)
#       ||- Split on case changes (c) or not (N)
#       |- Split on punctuation (p) or not (N)


# "NsN->NNN->N->NNN->N"
#  ---
#  (tokenization)

# "NsN->NNN->N->NNNN->N"
#       ---
#       (word splitting on rules, like WordDelimeterFilter in Lucene)

# "NsN->NNN->N->NNNN->N"
#            -
#            (lowercasing or not)

# "NsN->NNN->N->NNNN->N"
#               ----
#               (dictionary based splitting stopwords -> compounds -> british/american English -> irregular plurals)

# "NsN->NNN->N->NNNN->N"
#                     - stemming (porter)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lucytok-0.1.10.tar.gz (33.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lucytok-0.1.10-py3-none-any.whl (33.4 kB view details)

Uploaded Python 3

File details

Details for the file lucytok-0.1.10.tar.gz.

File metadata

  • Download URL: lucytok-0.1.10.tar.gz
  • Upload date:
  • Size: 33.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.6 Darwin/24.1.0

File hashes

Hashes for lucytok-0.1.10.tar.gz
Algorithm Hash digest
SHA256 4dbe2b4b9c183f02d1a0f63a45f54f8bcd236cfbc32546df5c41cfe589e2b43b
MD5 2c4baf202520caae0cd50c7425e2f3d4
BLAKE2b-256 d021a5e42a6fe8270aeccc56a9c77350bd01c9ba3a52f66ba7fb34cad1354d2b

See more details on using hashes here.

File details

Details for the file lucytok-0.1.10-py3-none-any.whl.

File metadata

  • Download URL: lucytok-0.1.10-py3-none-any.whl
  • Upload date:
  • Size: 33.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.6 Darwin/24.1.0

File hashes

Hashes for lucytok-0.1.10-py3-none-any.whl
Algorithm Hash digest
SHA256 6593ea27e9767bc678304bfe79aba2ffb2052860dbf7301ec4b1e3d3feda25ca
MD5 bf982e0127767f7508b7f679f6c7df52
BLAKE2b-256 3bebf0376e4badbb5bebc5193ae275f6bdec3f77af8d2af3070ba9201d60dd39

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page