Skip to main content

No project description provided

Project description

# 中文文本字符集分析过滤工具

概要说明 

UTF-8字符集分析过滤工具 CharsetFilter

版本: V 1.0.3

更新:xmxoxo  2020/6/8

GitHub地址: https://github.com/xmxoxo/CharsetFilter

工具说明:本工具把UTF8字符集分成了39个子集,可对文本文件中的字符集进行分析, 统计各类字符的总数以及出现的种类数。同时还可以方便地过滤或者保留的字符, 特别适合NLP等领域中对不可见字符的过滤分析等处理。

注: 被分析的文本文件需要是UTF8格式

## 对象调用使用案例

# 测试 
def test ():
    objC = CharsetFilter()
    txt = '中大1三K┫□\,≯ó㈥l。 ・ ・ 。 ノ ♡不ε﹣¥▽ ̄ˊˋ﹉▲āōē﹑'
    #s = '。 ・ ・ 。 ノ ♡'
    #a = objC.segIndex(0x25b2)
    #a = objC.segIndex(0x2EF4)
    #a = objC.segIndex(0xFFFD)
    #a = objC.segIndex(0x0006)
    #a = objC.segIndex(0xFFFE)
    #a = objC.segIndex(0xFFA1)
    #a = objC.segIndex(0x2453)
    #a = objC.segIndex(0x2580) #0x25BD 0x2580
    #for x in txt:
    #    a = objC.segIndex(ord(x))
    #    print(x,hex(ord(x)),a)

    #print('-'*40)
    strRet = objC.charAnalyze (txt, detail=1)
    print('字符集分析报告'.center(40,'-'))
    print(strRet)

    remove = []
    remain = [2, 36] # 只保留 中文汉字 和 英文半角
    rettxt = objC.txtfilter(txt, remove=remove, remain=remain)
    print('过滤结果:\n%s' % rettxt)
    print('原始长度:%d, 过滤后长度:%d' % ( len(txt), len(rettxt)))

命令行使用案例说明

分析文本字符集,输出简要信息

CharsetFilter --file ./111.txt 

分析文本字符集,输出详细信息,详细信息会保存到 xxx_report.txt 文件中

CharsetFilter --file ./111.txt --detail 1

分析文本字符集,按默认值过滤(过滤 "尚未识别 0", "控制字符 3"),并保存过滤结果(自动命名)

CharsetFilter --file ./111.txt --filter 1

分析文本字符集,仅保留 1,2,36,39,并保存过滤结果(自动命名为 xxx_out.txt)

CharsetFilter --file ./111.txt --filter 1 --remain_charset 1 2 36 39

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

CharsetFilter-1.0.3.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

CharsetFilter-1.0.3-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file CharsetFilter-1.0.3.tar.gz.

File metadata

  • Download URL: CharsetFilter-1.0.3.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for CharsetFilter-1.0.3.tar.gz
Algorithm Hash digest
SHA256 75d74fdc3615c0ffe32af1a25289f788afa01e0e1ba5e13797efa5d3a933ae73
MD5 ce867fbf113b4cd9f593d13b0fe81d58
BLAKE2b-256 c2cb41453ecfae06e76c06aeccbe7742b7063937f05c550cf54da486d7955898

See more details on using hashes here.

File details

Details for the file CharsetFilter-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: CharsetFilter-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 20.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for CharsetFilter-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 88f9273494c40126741ad493c070c0fa054c551f5b4c0866d259ca655750c321
MD5 e916ffe6772630ce826ef2327e89e0c9
BLAKE2b-256 7a22df5204b04bbd8a65143945988f240d2ff876d6e57c1b34a3fd210a36e568

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page