Skip to main content

shirkhan tools

Project description

shirkhan

the python tool library for shirkhan

currently using with project u-open-imla

Usage

安装

pip install shirkhan

更新

 pip install --upgrade shirkhan

删除

 pip uninstall  shirkhan
from shirkhan import decode, encode, syllabify

encode("xxxx")
decode("yyy")
word = "شىرخان"
print(syllabify(word))

功能示例:

分音节

思路:
1. 把单词向量化,按照元音,辅音 的0,1 值生成token 0100100
2. 从后往前分析 所以需要反转 retoken 0010010
3. 把retoken 以元音为分界分组 001 001 0
4. 按照分音节通用算法进行给retoken 植入分隔符
    - 两个元音之间有1个辅音它属于前面的音节
    - 两个元音之间有2个辅音它一个属于前面的,一个属于后面的
    - 两个元音之间有3个辅音 第一个属于前面的,后两个属于后面的
    - 两个元音之间有4个辅音 第一个属于前面的,其后的两个一组,最后一个属于后面的   【shirkhan 给自己出的规则,目前没有任何凭据这么做,而且是不对的】
    - 两个元音之间有5个辅音 第一个属于前面的,其后的三个一组,最后一个属于后面的   【shirkhan 给自己出的规则,目前没有任何凭据这么做,而且是不对的】

5. 把嵌入分割符的retoken分割点坐标映射到原始内容上 i -> len(word)-i
6. 按照分割符切割
from shirkhan import syllabify

print(syllabify('شىرخان'))

# output ['شىر', 'خان']

元音辅音组合的向量

from shirkhan import SWord

target_word = "شىرخان"

print(SWord(target_word).tokenize())

# output 010010

组合向量分组

from shirkhan import SWord

target_word = "شىرخاننىڭمۇ"
sw = SWord(target_word)
gtoken = sw.get_grouped_token()
gretoken = sw.get_grouped_retoken()

print(sw.tokenize())
print(gtoken)
print(gretoken)

# 01001001001
# [['0', '1'], ['0', '0', '1'], ['0', '0', '1'], ['0', '0', '1']]
# [['1'], ['0', '0', '1'], ['0', '0', '1'], ['0', '0', '1'], ['0']]

分音原始内容

from shirkhan import SWord

target_word = "شىرخاننىڭمۇ"
sw = SWord(target_word)
print(sw.get_positional_word())
print(sw.get_positional_token())
print(sw.get_positional_retoken())

# شىرxخانxنىڭxمۇ
# 010x010x010x01
# 10x010x010x010

单词生成字单词

# 第一种方式
from shirkhan import SWord

target_word = "شىرخاننىڭمۇ"
sw = SWord(target_word)
print(sw.get_similar_words())

# 第二种方式
from shirkhan import SWord

target_word = "شىرخاننىڭمۇ"
print(SWord(target_word))

# output:
# ['شىر', 'شىرخان', 'شىرخاننىڭ', 'شىرخاننىڭمۇ']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shirkhan-0.0.28.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shirkhan-0.0.28-py3-none-any.whl (29.0 kB view details)

Uploaded Python 3

File details

Details for the file shirkhan-0.0.28.tar.gz.

File metadata

  • Download URL: shirkhan-0.0.28.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for shirkhan-0.0.28.tar.gz
Algorithm Hash digest
SHA256 701392023e028367e7bcb36f30ec15c2e7041fe9c21e02ee8b806d6bf8557f20
MD5 75c393f3a151273bb6140f9c7083ab71
BLAKE2b-256 362de81cc3a89a445c630d8ea51296d2738211c168abb9b67ae23a61213f9477

See more details on using hashes here.

File details

Details for the file shirkhan-0.0.28-py3-none-any.whl.

File metadata

  • Download URL: shirkhan-0.0.28-py3-none-any.whl
  • Upload date:
  • Size: 29.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for shirkhan-0.0.28-py3-none-any.whl
Algorithm Hash digest
SHA256 524146bce4f7f5e8df7c89ff043fc308dd39652ebdb8ae641ba68bc55d20e1f5
MD5 f38012ea7fb34c30ff16c8e0e205de84
BLAKE2b-256 adec2a1ca2d08c4807a213e5b45c1f1f8cc1ca41d8fbce32b35d5ea1b70b5e71

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page