shirkhan tools
Project description
shirkhan
the python tool library for shirkhan
currently using with project u-open-imla
Usage
安装
pip install shirkhan
更新
pip install --upgrade shirkhan
删除
pip uninstall shirkhan
from shirkhan import decode, encode, syllabify
encode("xxxx")
decode("yyy")
word = "شىرخان"
print(syllabify(word))
功能示例:
分音节
思路:
1. 把单词向量化,按照元音,辅音 的0,1 值生成token 0100100
2. 从后往前分析 所以需要反转 retoken 0010010
3. 把retoken 以元音为分界分组 001 001 0
4. 按照分音节通用算法进行给retoken 植入分隔符
- 两个元音之间有1个辅音它属于前面的音节
- 两个元音之间有2个辅音它一个属于前面的,一个属于后面的
- 两个元音之间有3个辅音 第一个属于前面的,后两个属于后面的
- 两个元音之间有4个辅音 第一个属于前面的,其后的两个一组,最后一个属于后面的 【shirkhan 给自己出的规则,目前没有任何凭据这么做,而且是不对的】
- 两个元音之间有5个辅音 第一个属于前面的,其后的三个一组,最后一个属于后面的 【shirkhan 给自己出的规则,目前没有任何凭据这么做,而且是不对的】
5. 把嵌入分割符的retoken分割点坐标映射到原始内容上 i -> len(word)-i
6. 按照分割符切割
from shirkhan import syllabify
print(syllabify('شىرخان'))
# output ['شىر', 'خان']
元音辅音组合的向量
from shirkhan import SWord
target_word = "شىرخان"
print(SWord(target_word).tokenize())
# output 010010
组合向量分组
from shirkhan import SWord
target_word = "شىرخاننىڭمۇ"
sw = SWord(target_word)
gtoken = sw.get_grouped_token()
gretoken = sw.get_grouped_retoken()
print(sw.tokenize())
print(gtoken)
print(gretoken)
# 01001001001
# [['0', '1'], ['0', '0', '1'], ['0', '0', '1'], ['0', '0', '1']]
# [['1'], ['0', '0', '1'], ['0', '0', '1'], ['0', '0', '1'], ['0']]
分音原始内容
from shirkhan import SWord
target_word = "شىرخاننىڭمۇ"
sw = SWord(target_word)
print(sw.get_positional_word())
print(sw.get_positional_token())
print(sw.get_positional_retoken())
# شىرxخانxنىڭxمۇ
# 010x010x010x01
# 10x010x010x010
单词生成字单词
# 第一种方式
from shirkhan import SWord
target_word = "شىرخاننىڭمۇ"
sw = SWord(target_word)
print(sw.get_similar_words())
# 第二种方式
from shirkhan import SWord
target_word = "شىرخاننىڭمۇ"
print(SWord(target_word))
# output:
# ['شىر', 'شىرخان', 'شىرخاننىڭ', 'شىرخاننىڭمۇ']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
shirkhan-0.0.28.tar.gz
(23.6 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
shirkhan-0.0.28-py3-none-any.whl
(29.0 kB
view details)
File details
Details for the file shirkhan-0.0.28.tar.gz.
File metadata
- Download URL: shirkhan-0.0.28.tar.gz
- Upload date:
- Size: 23.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
701392023e028367e7bcb36f30ec15c2e7041fe9c21e02ee8b806d6bf8557f20
|
|
| MD5 |
75c393f3a151273bb6140f9c7083ab71
|
|
| BLAKE2b-256 |
362de81cc3a89a445c630d8ea51296d2738211c168abb9b67ae23a61213f9477
|
File details
Details for the file shirkhan-0.0.28-py3-none-any.whl.
File metadata
- Download URL: shirkhan-0.0.28-py3-none-any.whl
- Upload date:
- Size: 29.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
524146bce4f7f5e8df7c89ff043fc308dd39652ebdb8ae641ba68bc55d20e1f5
|
|
| MD5 |
f38012ea7fb34c30ff16c8e0e205de84
|
|
| BLAKE2b-256 |
adec2a1ca2d08c4807a213e5b45c1f1f8cc1ca41d8fbce32b35d5ea1b70b5e71
|