Skip to main content

shirkhan tools

Project description

shirkhan

the python tool library for shirkhan

currently using with project u-open-imla

Usage

安装

pip install shirkhan

更新

 pip install --upgrade shirkhan

删除

 pip uninstall  shirkhan
from shirkhan import decode, encode, syllabify

encode("xxxx")
decode("yyy")
word = "شىرخان"
print(syllabify(word))

功能示例:

分音节

思路:
1. 把单词向量化,按照元音,辅音 的0,1 值生成token 0100100
2. 从后往前分析 所以需要反转 retoken 0010010
3. 把retoken 以元音为分界分组 001 001 0
4. 按照分音节通用算法进行给retoken 植入分隔符
    - 两个元音之间有1个辅音它属于前面的音节
    - 两个元音之间有2个辅音它一个属于前面的,一个属于后面的
    - 两个元音之间有3个辅音 第一个属于前面的,后两个属于后面的
    - 两个元音之间有4个辅音 第一个属于前面的,其后的两个一组,最后一个属于后面的   【shirkhan 给自己出的规则,目前没有任何凭据这么做,而且是不对的】
    - 两个元音之间有5个辅音 第一个属于前面的,其后的三个一组,最后一个属于后面的   【shirkhan 给自己出的规则,目前没有任何凭据这么做,而且是不对的】

5. 把嵌入分割符的retoken分割点坐标映射到原始内容上 i -> len(word)-i
6. 按照分割符切割
from shirkhan import syllabify

print(syllabify('شىرخان'))

# output ['شىر', 'خان']

元音辅音组合的向量

from shirkhan import SWord

target_word = "شىرخان"

print(SWord(target_word).tokenize())

# output 010010

组合向量分组

from shirkhan import SWord

target_word = "شىرخاننىڭمۇ"
sw = SWord(target_word)
gtoken = sw.get_grouped_token()
gretoken = sw.get_grouped_retoken()

print(sw.tokenize())
print(gtoken)
print(gretoken)

# 01001001001
# [['0', '1'], ['0', '0', '1'], ['0', '0', '1'], ['0', '0', '1']]
# [['1'], ['0', '0', '1'], ['0', '0', '1'], ['0', '0', '1'], ['0']]

分音原始内容

from shirkhan import SWord

target_word = "شىرخاننىڭمۇ"
sw = SWord(target_word)
print(sw.get_positional_word())
print(sw.get_positional_token())
print(sw.get_positional_retoken())

# شىرxخانxنىڭxمۇ
# 010x010x010x01
# 10x010x010x010

单词生成字单词

# 第一种方式
from shirkhan import SWord

target_word = "شىرخاننىڭمۇ"
sw = SWord(target_word)
print(sw.get_similar_words())

# 第二种方式
from shirkhan import SWord

target_word = "شىرخاننىڭمۇ"
print(SWord(target_word))

# output:
# ['شىر', 'شىرخان', 'شىرخاننىڭ', 'شىرخاننىڭمۇ']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shirkhan-0.0.28.tar.gz (23.6 kB view hashes)

Uploaded Source

Built Distribution

shirkhan-0.0.28-py3-none-any.whl (29.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page