新词发现算法
Project description
pyUnit-NewWord
无监督训练文本词库
安装
pip install pyunit-newword
注意事项
该算法采用Hash字典存储,大量消耗内存。100M的纯中文文本需要12G以上的内存,不然耗时太严重。
更新说明
新增加自动识别新词模型,无需手动设置参数
训练代码非模型(文本是UTF-8格式)
from pyunit_newword import NewWords
if __name__ == '__main__':
nw = NewWords(filter_cond=10, filter_free=2)
nw.add_text(r'C:\Users\Administrator\Desktop\微博数据.txt')
nw.analysis_data()
with open('分析结果.txt', 'w', encoding='utf-8')as f:
for word in nw.get_words():
print(word)
f.write(word[0] + '\n')
无监督训练新词模型
from pyunit_newword import NewWords
if __name__ == '__main__':
nw = NewWords(accuracy=0.01)
nw.add_text(r'C:\Users\Administrator\Desktop\微博数据.txt')
nw.analysis_data()
with open('分析结果.txt', 'w', encoding='utf-8')as f:
for word in nw.get_words():
print(word)
f.write(word[0] + '\n')
微博数据下载
爬虫的微博数据一部分截图(大概100M纯文本)
训练微博数据后的结果
训练后得到的词语视频
算法实现来源
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
File details
Details for the file pyunit_newword-2020.2.12-py3-none-any.whl
.
File metadata
- Download URL: pyunit_newword-2020.2.12-py3-none-any.whl
- Upload date:
- Size: 26.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1971c5d9a704960a97e6086122ca67addfb8b28cd0d79636f6fbf745729a44c1 |
|
MD5 | e7ef0b42a73686667b93592ff98ea2bd |
|
BLAKE2b-256 | 69145fb694951f95c006e616184d467c219db4887a14c6eca7d30b0fafebdfed |