Skip to main content

NLP toolkit, including tokenization, sequence tagging, etc.

Project description

naivenlp

Python package PyPI version Python

NLP常用工具包。

主要包含以下模块:

Installation

pip install -U naivenlp

Install extra dependencies:

pip install pycorrector
pip install git+https://github.com/kpu/kenlm.git

Tokenizers

Tokenizer的作用是分词, 同时具有把词语映射到ID的功能。

naivenlp.tokenizers模块包含以下Tokenizer实现:

  • JiebaTokenizer,继承自VocabBasedTokenizer,分词使用jieba
  • CustomTokenizer,继承自VocabBasedTokenizer,基于词典文件的Tokenizer,包装tokenize_fn自定义函数来实现各种自定义的Tokenizer
  • TransformerTokenizer,继承自VocabBasedTokenizer,用于Transformer模型分词
  • BertTokenizer,继承自VocabBasedTokenizer,用于BERT模型分词

JiebaTokenizer的使用

分词过程使用jieba

from naivenlp.tokenizers import JiebaTokenizer

tokenizer = JiebaTokenizer(
    vocab_file='vocab.txt',
    pad_token='[PAD]',
    unk_token='[UNK]',
    bos_token='[BOS]',
    eos_token='[EOS]',
)

tokenizer.tokenize('hello world!', mode=0, hmm=True)

tokenizer.encode('hello world!', add_bos=False, add_eos=False)

CustomTokenizer的使用

方便用户自定义分词过程。

以使用baidu/lac来分词为例。

pip install lac
from naivenlp.tokenizers import CustomTokenizer

from LAC import LAC

lac = LAC(mode='seg')

def lac_tokenize(text, **kwargs):
    return lac.run(text)


tokenizer = CustomTokenizer(
    vocab_file='vocab.txt',
    tokenize_fn=lac_tokenize,
    pad_token='[PAD]',
    unk_token='[UNK]',
    bos_token='[BOS]',
    eos_token='[EOS]',
)

tokenizer.tokenize('hello world!')

tokenizer.encode('hello world!', add_bos=False, add_eos=False)

BasicTokenizer的使用

这个分词器的使用很简单。不需要词典。它会根据空格来分词。它有以下功能:

  • 按照空格和特殊字符分词
  • 根据设置,决定是否大小写转换
  • 根据设置,切分汉字,按照字的粒度分词
from naivenlp.tokenizers import BasicTokenizer

tokenizer = BasicTokenizer(do_lower_case=True, tokenize_chinese_chars=True)

tokenizer.tokenize('hello world, 你好世界')

WordpieceTokenizer的使用

Wordpiece是一种分词算法,具体请自己查询相关文档。

WordpieceTokenizer需要传入一个词典map。

from naivenlp.tokenizers import WordpieceTokenizer

tokenizer = WordpieceTokenizer(vocab=vocab, unk_token='[UNK]')

tokenizer.tokenize('hello world, 你好世界')

TransformerTokenizer的使用

from naivenlp.tokenizers import TransformerTokenizer


tokenizer = TransformerTokenizer(vocab_file='vocab.txt')

tokenizer.tokenize('Hello World, 你好世界')

tokenizer.encode('Hello World, 你好世界', add_bos=False, add_eos=False)

BertTokenizer的使用

from naivenlp.tokenizers import BertTokenizer


tokenizer = BertTokenizer(vocab_file='vocab.txt', cls_token='[CLS]', sep_token='[SEP]', mask_token='[MASK]')

tokenizer.tokenize('Hello World, 你好世界')

tokenizer.encode('Hello World, 你好世界', add_bos=False, add_eos=False)

Correctors

文本纠错,包括传统的n-gram语言模型和词典的方式,也可以使用基于深度学习的方法。

n-gram语言模型和词典纠错

这里的KenLMCorrector是对 shibing624/pycorrector 项目的包装。

from naivenlp.correctors import KenLMCorrector

c = KenLMCorrector()
texts = [
    '软件开发工成师',
    '少先队员因该为老人让坐',
]

for text in texts:
    print(c.correct(text))

可以得到纠错结果:

('软件开发工程师', [('工成师', '工程师', 4, 7)])
('少先队员应该为老人让座', [('因该', '应该', 4, 6), ('坐', '座', 10, 11)])

基于深度学习的纠错

主要是利用seq2seq模型完成纠错。例如:

  • RNN + Attention 传统的seq2seq 模型
  • Transformer模型

TODO

Similarity

多种字符串相似度的度量。是对luozhouyang/python-string-similarity的包装。

>>> import naivenlp
>>> a = 'ACCTTTDEX'
>>> b = 'CGGTTEEXX'
>>> naivenlp.cosine_distance(a, b)
1.0
>>> naivenlp.cosine_similarity(a, b)
1.0
>>> naivenlp.jaccard_distance(a, b)
1.0
>>> naivenlp.jaccard_similarity(a, b)
0.0
>>> naivenlp.levenshtein_distance(a, b)
5
>>> naivenlp.levenshtein_distance_normalized(a, b)
0.5555555555555556
>>> naivenlp.levenshtein_similarity(a, b)
0.4444444444444444
>>> naivenlp.weighted_levenshtein_distance(a, b)
5.0
>>> naivenlp.damerau_distance(a, b)
5
>>> naivenlp.lcs_distance(a, b)
8
>>> naivenlp.lcs_length(a, b)
5
>>> naivenlp.sorense_dice_distance(a, b)
1.0
>>> naivenlp.sorense_dice_similarity(a, b)
0.0
>>> naivenlp.optimal_string_alignment_distance(a, b)
5
>>> 

Structures

常用的数据结构实现。

目前支持:

  • 字典树Trie

Trie的使用

>>> import naivenlp
>>> trie = naivenlp.Trie()
>>> trie.put('上海市浦东新区')
>>> trie.show()
.
|    +----上
|    |    +----海
|    |    |    +----市
|    |    |    |    +----浦
|    |    |    |    |    +----东
|    |    |    |    |    |    +----新
|    |    |    |    |    |    |    +----区
>>> trie.put('上海市黄浦区')
>>> trie.show()
.
|    +----上
|    |    +----海
|    |    |    +----市
|    |    |    |    +----浦
|    |    |    |    |    +----东
|    |    |    |    |    |    +----新
|    |    |    |    |    |    |    +----区
|    |    |    |    +----黄
|    |    |    |    |    +----浦
|    |    |    |    |    |    +----区
>>> for r in trie.keys_with_prefix('上海市'):
...     print(r)
... 
['上', '海', '市', '浦', '东', '新', '区']
['上', '海', '市', '黄', '浦', '区']
>>> 

Utils

常用文本操作:

  • naivenlp.q2b(s) 全角转半角
  • naivenlp.b2q(s) 半角转全角
  • naivenlp.split_sentence(s) 把长文本切分成句子列表

Datasource

数据收集模块。目前支持:

  • 下载所有的搜狗词库保存成文本文件

下载搜狗词库

from naivenlp.datasources import sogou_datasource as sg

# 下载category_id=1下面所有的词典,保存到/tmp/sogou
sg.download_category(1, '/tmp/sogou')

# 下载所有category保存到/tmp/sogou
sg.download_all_category('/tmp/sogou')

# 把下载的所有文件合成一个文件
sg.collect('/tmp/sogou', './sogou.vocab', maxlen=6)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

naivenlp-0.0.8.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

naivenlp-0.0.8-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file naivenlp-0.0.8.tar.gz.

File metadata

  • Download URL: naivenlp-0.0.8.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.1.1.post20200604 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.10

File hashes

Hashes for naivenlp-0.0.8.tar.gz
Algorithm Hash digest
SHA256 fdfcafdbcf1c95818df70f3739f375b440e98637c6d89ee9263d6a0602a7fc85
MD5 9127db6a1de730664383b075defbcccc
BLAKE2b-256 c627a39cf0ea74efc84e415b1e694d0ac2bc65be00ed3e6d0b8dee78490e71f9

See more details on using hashes here.

File details

Details for the file naivenlp-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: naivenlp-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 29.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.1.1.post20200604 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.10

File hashes

Hashes for naivenlp-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 8bc5ffd8c5fadbe115f894d78b6b5d1a326c9326899dae34605f0b91098850d8
MD5 535582f24f769041997a0918f4bfa4ec
BLAKE2b-256 67bb874fb6bc598444ff1ed093727ae00f0f45a4ceccd5445c593a66e3d63ef6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page