Skip to main content

compute similar scores of two text

Project description

TextProcess Package

这是一个文本预处理的python库,主要是在做自然语言处理任务时,对文本进行一些预处理工作。

支持的功能

  • 英文字母大写转小写
  • 中文繁体转简体
  • 中文简体转繁体
  • 全角转半角
  • 去除emotion表情
  • 将emotion表情替换成文字描述
  • 去除控制字符
  • 去除超链接tag,href
  • 去除http超链接
  • 将长数字转换成特殊字符
  • 过滤括号及括号内的内容【xxxxx】/(xxxxxxx)/ [xxxx] 【.】|(.)|[.*]
  • 过滤连续标点和空格
  • 只保留中文字符
  • 保留中文和英文
  • 保留中文和英文及数字

安装方法

pip install TextProcess-Ora

使用方法

import TextProcess.TextProcess as tp


if __name__ == '__main__':
    test_string = u'我😍愛你中華https://<a></a>,,,,,, Hello Word 121233124234213 [sdfsd]{}【】'
    test = tp.TextProcess()
    # 英文字母大写转小写
    print(test.strLower(test_string))
    # '我😍你中华<http://><a></a>, hello word。'

    # 中文繁体转简体
    print(test.Tra2Sim(test_string, 'zh-hans'))

    # 中文简体转繁体
    print(test.Tra2Sim(test_string, 'zh-hant'))

    # 全角转半角
    print(test.strQ2B(test_string))

    # 去除emotion表情
    print(test.replace_emotion(test_string,""))

    # 将emotion表情替换成文字描述
    print(test.convert_emotion(test_string))

    # 去除控制字符
    print(test.replace_control_character(test_string, ''))

    # 去除超链接tag,href
    print(test.remove_ahref(test_string, ''))

    # 去除http超链接
    print(test.remove_http(test_string, ''))

    # 将长数字转换成特殊字符
    print(test.replace_long_num(test_string, 'LONG_NUM'))

    # 过滤括号及括号内的内容【xxxxx】/(xxxxxxx)/ [xxxx] 【.*】|(.*)|\[.*\]
    print(test.replace_brackets(test_string, ''))

    # 过滤连续标点和空格
    print(test.remove_commas(test_string))

    # 只保留中文字符
    print(test.remove_not_che(test_string))

    # 保留中文和英文
    print(test.keep_chi_eng(test_string, ''))

    # 保留中文和英文及数字
    print(test.keep_chi_eng_num(test_string, ''))

    # 一条龙服务 基本过滤
    print(test.evaluate(test_string, 'OnlinePipe'))

    #一条龙服务 强过滤
    print(test.evaluate(test_string, 'OnlinePipeStrictMore'))

    #一条龙服务 极强过滤
    print(test.evaluate(test_string, 'OnlinePipeStrictMost'))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TextProcess_Ora-0.0.6.tar.gz (194.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

TextProcess_Ora-0.0.6-py2.py3-none-any.whl (293.6 kB view details)

Uploaded Python 2Python 3

File details

Details for the file TextProcess_Ora-0.0.6.tar.gz.

File metadata

  • Download URL: TextProcess_Ora-0.0.6.tar.gz
  • Upload date:
  • Size: 194.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.1 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/2.7.10

File hashes

Hashes for TextProcess_Ora-0.0.6.tar.gz
Algorithm Hash digest
SHA256 89036683ee870eb61a5b5b374bae635374b045aac7025efbcb18d16dad79097d
MD5 bcb5c401644939f3ae562e03d9191c0a
BLAKE2b-256 e690d78bed820e0ed591d35e091f3fdc24f723c98cf624e357620959d383aa9b

See more details on using hashes here.

File details

Details for the file TextProcess_Ora-0.0.6-py2.py3-none-any.whl.

File metadata

  • Download URL: TextProcess_Ora-0.0.6-py2.py3-none-any.whl
  • Upload date:
  • Size: 293.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.1 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/2.7.10

File hashes

Hashes for TextProcess_Ora-0.0.6-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 42d367ee149b4ba4d8dc9ca01045d93c6d00760b0e75949bf6b96e48587dcd28
MD5 e0684edce0ab34694dc9932c69a682ce
BLAKE2b-256 b12d71f065bc451f2bf45f9cb995c263debf841596a5e0ab53662b1142b317f3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page