Skip to main content

新词发现算法

Project description

pyUnit-NewWord

无监督训练文本词库

安装

pip install pyunit-newword

注意事项

该算法采用Hash字典存储,大量消耗内存。100M的纯中文文本需要12G以上的内存,不然耗时太严重。

更新说明

新增加自动识别新词模型,无需手动设置参数

训练代码非模型(文本是UTF-8格式)

from pyunit_newword import NewWords

if __name__ == '__main__':
    nw = NewWords(filter_cond=10, filter_free=2)
    nw.add_text(r'C:\Users\Administrator\Desktop\微博数据.txt')
    nw.analysis_data()
    with open('分析结果.txt', 'w', encoding='utf-8')as f:
        for word in nw.get_words():
            print(word)
            f.write(word[0] + '\n')

无监督训练新词模型

from pyunit_newword import NewWords

if __name__ == '__main__':
    nw = NewWords(accuracy=0.01)
    nw.add_text(r'C:\Users\Administrator\Desktop\微博数据.txt')
    nw.analysis_data()
    with open('分析结果.txt', 'w', encoding='utf-8')as f:
        for word in nw.get_words():
            print(word)
            f.write(word[0] + '\n')

微博数据下载

点击下载微博数据

爬虫的微博数据一部分截图(大概100M纯文本)

微博数据

训练微博数据后的结果

5个词语

训练后得到的词语视频

词语视频

算法实现来源

基于改进互信息和邻接熵的微博新词发现方法


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

pyunit_newword-2020.2.12-py3-none-any.whl (26.8 kB view details)

Uploaded Python 3

File details

Details for the file pyunit_newword-2020.2.12-py3-none-any.whl.

File metadata

  • Download URL: pyunit_newword-2020.2.12-py3-none-any.whl
  • Upload date:
  • Size: 26.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.4

File hashes

Hashes for pyunit_newword-2020.2.12-py3-none-any.whl
Algorithm Hash digest
SHA256 1971c5d9a704960a97e6086122ca67addfb8b28cd0d79636f6fbf745729a44c1
MD5 e7ef0b42a73686667b93592ff98ea2bd
BLAKE2b-256 69145fb694951f95c006e616184d467c219db4887a14c6eca7d30b0fafebdfed

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page