No project description provided
Project description
# 中文文本字符集分析过滤工具
概要说明
UTF-8字符集分析过滤工具 CharsetFilter
版本: V 1.0.3
更新:xmxoxo 2020/6/8
GitHub地址: https://github.com/xmxoxo/CharsetFilter
工具说明:本工具把UTF8字符集分成了39个子集,可对文本文件中的字符集进行分析, 统计各类字符的总数以及出现的种类数。同时还可以方便地过滤或者保留的字符, 特别适合NLP等领域中对不可见字符的过滤分析等处理。
注: 被分析的文本文件需要是UTF8格式
## 对象调用使用案例
# 测试
def test ():
objC = CharsetFilter()
txt = '中大1三K┫□\,≯ó㈥l。 ・ ・ 。 ノ ♡不ε﹣¥▽ ̄ˊˋ﹉▲āōē﹑'
#s = '。 ・ ・ 。 ノ ♡'
#a = objC.segIndex(0x25b2)
#a = objC.segIndex(0x2EF4)
#a = objC.segIndex(0xFFFD)
#a = objC.segIndex(0x0006)
#a = objC.segIndex(0xFFFE)
#a = objC.segIndex(0xFFA1)
#a = objC.segIndex(0x2453)
#a = objC.segIndex(0x2580) #0x25BD 0x2580
#for x in txt:
# a = objC.segIndex(ord(x))
# print(x,hex(ord(x)),a)
#print('-'*40)
strRet = objC.charAnalyze (txt, detail=1)
print('字符集分析报告'.center(40,'-'))
print(strRet)
remove = []
remain = [2, 36] # 只保留 中文汉字 和 英文半角
rettxt = objC.txtfilter(txt, remove=remove, remain=remain)
print('过滤结果:\n%s' % rettxt)
print('原始长度:%d, 过滤后长度:%d' % ( len(txt), len(rettxt)))
命令行使用案例说明
分析文本字符集,输出简要信息
CharsetFilter --file ./111.txt
分析文本字符集,输出详细信息,详细信息会保存到 xxx_report.txt 文件中
CharsetFilter --file ./111.txt --detail 1
分析文本字符集,按默认值过滤(过滤 "尚未识别 0", "控制字符 3"),并保存过滤结果(自动命名)
CharsetFilter --file ./111.txt --filter 1
分析文本字符集,仅保留 1,2,36,39,并保存过滤结果(自动命名为 xxx_out.txt)
CharsetFilter --file ./111.txt --filter 1 --remain_charset 1 2 36 39
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file CharsetFilter-1.0.3.tar.gz.
File metadata
- Download URL: CharsetFilter-1.0.3.tar.gz
- Upload date:
- Size: 7.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75d74fdc3615c0ffe32af1a25289f788afa01e0e1ba5e13797efa5d3a933ae73
|
|
| MD5 |
ce867fbf113b4cd9f593d13b0fe81d58
|
|
| BLAKE2b-256 |
c2cb41453ecfae06e76c06aeccbe7742b7063937f05c550cf54da486d7955898
|
File details
Details for the file CharsetFilter-1.0.3-py3-none-any.whl.
File metadata
- Download URL: CharsetFilter-1.0.3-py3-none-any.whl
- Upload date:
- Size: 20.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88f9273494c40126741ad493c070c0fa054c551f5b4c0866d259ca655750c321
|
|
| MD5 |
e916ffe6772630ce826ef2327e89e0c9
|
|
| BLAKE2b-256 |
7a22df5204b04bbd8a65143945988f240d2ff876d6e57c1b34a3fd210a36e568
|