No project description provided
Project description
# 中文文本字符集分析过滤工具
概要说明
UTF-8字符集分析过滤工具 CharsetFilter
版本: V 1.0.3
更新:xmxoxo 2020/6/8
GitHub地址: https://github.com/xmxoxo/CharsetFilter
工具说明:本工具把UTF8字符集分成了39个子集,可对文本文件中的字符集进行分析, 统计各类字符的总数以及出现的种类数。同时还可以方便地过滤或者保留的字符, 特别适合NLP等领域中对不可见字符的过滤分析等处理。
注: 被分析的文本文件需要是UTF8格式
## 对象调用使用案例
# 测试
def test ():
objC = CharsetFilter()
txt = '中大1三K┫□\,≯ó㈥l。 ・ ・ 。 ノ ♡不ε﹣¥▽ ̄ˊˋ﹉▲āōē﹑'
#s = '。 ・ ・ 。 ノ ♡'
#a = objC.segIndex(0x25b2)
#a = objC.segIndex(0x2EF4)
#a = objC.segIndex(0xFFFD)
#a = objC.segIndex(0x0006)
#a = objC.segIndex(0xFFFE)
#a = objC.segIndex(0xFFA1)
#a = objC.segIndex(0x2453)
#a = objC.segIndex(0x2580) #0x25BD 0x2580
#for x in txt:
# a = objC.segIndex(ord(x))
# print(x,hex(ord(x)),a)
#print('-'*40)
strRet = objC.charAnalyze (txt, detail=1)
print('字符集分析报告'.center(40,'-'))
print(strRet)
remove = []
remain = [2, 36] # 只保留 中文汉字 和 英文半角
rettxt = objC.txtfilter(txt, remove=remove, remain=remain)
print('过滤结果:\n%s' % rettxt)
print('原始长度:%d, 过滤后长度:%d' % ( len(txt), len(rettxt)))
命令行使用案例说明
分析文本字符集,输出简要信息
CharsetFilter --file ./111.txt
分析文本字符集,输出详细信息,详细信息会保存到 xxx_report.txt 文件中
CharsetFilter --file ./111.txt --detail 1
分析文本字符集,按默认值过滤(过滤 "尚未识别 0", "控制字符 3"),并保存过滤结果(自动命名)
CharsetFilter --file ./111.txt --filter 1
分析文本字符集,仅保留 1,2,36,39,并保存过滤结果(自动命名为 xxx_out.txt)
CharsetFilter --file ./111.txt --filter 1 --remain_charset 1 2 36 39
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
CharsetFilter-1.0.3.tar.gz
(7.0 kB
view details)
Built Distribution
File details
Details for the file CharsetFilter-1.0.3.tar.gz
.
File metadata
- Download URL: CharsetFilter-1.0.3.tar.gz
- Upload date:
- Size: 7.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 75d74fdc3615c0ffe32af1a25289f788afa01e0e1ba5e13797efa5d3a933ae73 |
|
MD5 | ce867fbf113b4cd9f593d13b0fe81d58 |
|
BLAKE2b-256 | c2cb41453ecfae06e76c06aeccbe7742b7063937f05c550cf54da486d7955898 |
File details
Details for the file CharsetFilter-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: CharsetFilter-1.0.3-py3-none-any.whl
- Upload date:
- Size: 20.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 88f9273494c40126741ad493c070c0fa054c551f5b4c0866d259ca655750c321 |
|
MD5 | e916ffe6772630ce826ef2327e89e0c9 |
|
BLAKE2b-256 | 7a22df5204b04bbd8a65143945988f240d2ff876d6e57c1b34a3fd210a36e568 |