No project description provided
Project description
Lucytok
Lucene's boring English tokenizers recreated for Python. Compatible with SearchArray.
Lets you configure a handful of normal tokenization rules like ascii folding, posessive removal, both types of porter stemming, English stopwords, etc.
Usage
Creating a tokenizer close to Elasticsearch's default english analyzer
from lucytok import english
es_english = english("Nsp->NNN->l->sNNN->1")
tokenized = es_english("The quick brown fox jumps over the lazy døg")
print(tokenized)
Outputs
['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'døg']
Make a tokenizer with ASCII folding...
from lucytok import english
es_english_folded = english("asp->NNN->l->sNNN->1")
print(es_english_folded("The quick brown fox jumps over the lazy døg"))
['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'dog']
Split compounds and convert British to American spelling...
from lucytok import english
es_british = english("asp->NNN->l->scbN->1")
print(es_british("The watercolour fox jumps over the lazy døg"))
['_', 'water', 'color', 'fox', 'jump', 'over', '_', 'lazi', 'dog']
Spec
Create a tokenizer using the following settings (these concepts correspond to their Elasticsearch counterparts):
# |- ASCII fold (a) or not (N)
# ||- Standard (s) or WS tokenizer (w)
# ||- Remove possessive suffixes (p) or not (N)
# |||
# "NsN->NNN->N->NNNN->N"
# ||| | |||| |
# ||| | |||| |- Porter stem version (1) or version (2) vs N/0 for none
# ||| | ||||- Manually convert irregular plurals (p) or not (N)
# ||| | |||- Split Compounds (c) or not (N)
# ||| | ||- Convert british to american spelling (b) or not (N)
# ||| | |- Blank out stopwords (s) or not (N)
# ||| |- Lowercase (l) or not (N)
# |||- Split on letter/number transitions (n) or not (N)
# ||- Split on case changes (c) or not (N)
# |- Split on punctuation (p) or not (N)
# "NsN->NNN->N->NNN->N"
# ---
# (tokenization)
# "NsN->NNN->N->NNNN->N"
# ---
# (word splitting on rules, like WordDelimeterFilter in Lucene)
# "NsN->NNN->N->NNNN->N"
# -
# (lowercasing or not)
# "NsN->NNN->N->NNNN->N"
# ----
# (dictionary based splitting stopwords -> compounds -> british/american English -> irregular plurals)
# "NsN->NNN->N->NNNN->N"
# - stemming (porter)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lucytok-0.1.10.tar.gz.
File metadata
- Download URL: lucytok-0.1.10.tar.gz
- Upload date:
- Size: 33.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.6 Darwin/24.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4dbe2b4b9c183f02d1a0f63a45f54f8bcd236cfbc32546df5c41cfe589e2b43b
|
|
| MD5 |
2c4baf202520caae0cd50c7425e2f3d4
|
|
| BLAKE2b-256 |
d021a5e42a6fe8270aeccc56a9c77350bd01c9ba3a52f66ba7fb34cad1354d2b
|
File details
Details for the file lucytok-0.1.10-py3-none-any.whl.
File metadata
- Download URL: lucytok-0.1.10-py3-none-any.whl
- Upload date:
- Size: 33.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.6 Darwin/24.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6593ea27e9767bc678304bfe79aba2ffb2052860dbf7301ec4b1e3d3feda25ca
|
|
| MD5 |
bf982e0127767f7508b7f679f6c7df52
|
|
| BLAKE2b-256 |
3bebf0376e4badbb5bebc5193ae275f6bdec3f77af8d2af3070ba9201d60dd39
|