lucytok

No project description provided

These details have not been verified by PyPI

Project description

Lucytok

Lucene's boring English tokenizers recreated for Python. Compatible with SearchArray.

Lets you configure a handful of normal tokenization rules like ascii folding, posessive removal, both types of porter stemming, English stopwords, etc.

Usage

Creating a tokenizer close to Elasticsearch's default english analyzer

from lucytok import english
es_english = english("Nsp->NNN->l->sNNN->1")
tokenized = es_english("The quick brown fox jumps over the lazy døg")
print(tokenized)

Outputs

['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'døg']

Make a tokenizer with ASCII folding...

from lucytok import english
es_english_folded = english("asp->NNN->l->sNNN->1")
print(es_english_folded("The quick brown fox jumps over the lazy døg"))

['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'dog']

Split compounds and convert British to American spelling...

from lucytok import english
es_british = english("asp->NNN->l->scbN->1")
print(es_british("The watercolour fox jumps over the lazy døg"))

['_', 'water', 'color', 'fox', 'jump', 'over', '_', 'lazi', 'dog']

Spec

Create a tokenizer using the following settings (these concepts correspond to their Elasticsearch counterparts):


#  |- ASCII fold (a) or not (N)
#  ||- Standard (s) or WS tokenizer (w)
#  ||- Remove possessive suffixes (p) or not (N)
#  |||
# "NsN->NNN->N->NNNN->N"
#       |||  |  ||||  |
#       |||  |  ||||  |- Porter stem version (1) or version (2) vs N/0 for none
#       |||  |  ||||- Manually convert irregular plurals (p) or not (N)
#       |||  |  |||- Split Compounds (c) or not (N)
#       |||  |  ||- Convert british to american spelling (b) or not (N)
#       |||  |  |- Blank out stopwords (s) or not (N)
#       |||  |- Lowercase (l) or not (N)
#       |||- Split on letter/number transitions (n) or not (N)
#       ||- Split on case changes (c) or not (N)
#       |- Split on punctuation (p) or not (N)


# "NsN->NNN->N->NNN->N"
#  ---
#  (tokenization)

# "NsN->NNN->N->NNNN->N"
#       ---
#       (word splitting on rules, like WordDelimeterFilter in Lucene)

# "NsN->NNN->N->NNNN->N"
#            -
#            (lowercasing or not)

# "NsN->NNN->N->NNNN->N"
#               ----
#               (dictionary based splitting stopwords -> compounds -> british/american English -> irregular plurals)

# "NsN->NNN->N->NNNN->N"
#                     - stemming (porter)

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.10

Nov 11, 2024

0.1.9

Nov 10, 2024

0.1.8

Nov 10, 2024

0.1.7

Nov 10, 2024

0.1.6

Nov 9, 2024

0.1.5

Nov 8, 2024

0.1.4

Nov 8, 2024

0.1.3

Nov 5, 2024

0.1.2

Nov 5, 2024

0.1.1

Nov 2, 2024

0.1.0

Nov 2, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lucytok-0.1.10.tar.gz (33.5 kB view details)

Uploaded Nov 11, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lucytok-0.1.10-py3-none-any.whl (33.4 kB view details)

Uploaded Nov 11, 2024 Python 3

File details

Details for the file lucytok-0.1.10.tar.gz.

File metadata

Download URL: lucytok-0.1.10.tar.gz
Upload date: Nov 11, 2024
Size: 33.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.12.6 Darwin/24.1.0

File hashes

Hashes for lucytok-0.1.10.tar.gz
Algorithm	Hash digest
SHA256	`4dbe2b4b9c183f02d1a0f63a45f54f8bcd236cfbc32546df5c41cfe589e2b43b`
MD5	`2c4baf202520caae0cd50c7425e2f3d4`
BLAKE2b-256	`d021a5e42a6fe8270aeccc56a9c77350bd01c9ba3a52f66ba7fb34cad1354d2b`

See more details on using hashes here.

File details

Details for the file lucytok-0.1.10-py3-none-any.whl.

File metadata

Download URL: lucytok-0.1.10-py3-none-any.whl
Upload date: Nov 11, 2024
Size: 33.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.12.6 Darwin/24.1.0

File hashes

Hashes for lucytok-0.1.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6593ea27e9767bc678304bfe79aba2ffb2052860dbf7301ec4b1e3d3feda25ca`
MD5	`bf982e0127767f7508b7f679f6c7df52`
BLAKE2b-256	`3bebf0376e4badbb5bebc5193ae275f6bdec3f77af8d2af3070ba9201d60dd39`

See more details on using hashes here.

lucytok 0.1.10

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Lucytok

Usage

Spec

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes