NLP toolkit, including tokenization, sequence tagging, etc.
Project description
naivenlp
NLP常用工具包。
主要包含以下模块:
Installation
pip install -U naivenlp
Install extra dependencies:
pip install pycorrector
pip install git+https://github.com/kpu/kenlm.git
Tokenizers
Tokenizer
的作用是分词, 同时具有把词语映射到ID的功能。
naivenlp.tokenizers
模块包含以下Tokenizer
实现:
JiebaTokenizer
,继承自VocabBasedTokenizer
,分词使用jieba
CustomTokenizer
,继承自VocabBasedTokenizer
,基于词典文件的Tokenizer
,包装tokenize_fn
自定义函数来实现各种自定义的Tokenizer
TransformerTokenizer
,继承自VocabBasedTokenizer
,用于Transformer
模型分词BertTokenizer
,继承自VocabBasedTokenizer
,用于BERT
模型分词
JiebaTokenizer的使用
分词过程使用jieba
。
from naivenlp.tokenizers import JiebaTokenizer
tokenizer = JiebaTokenizer(
vocab_file='vocab.txt',
pad_token='[PAD]',
unk_token='[UNK]',
bos_token='[BOS]',
eos_token='[EOS]',
)
tokenizer.tokenize('hello world!', mode=0, hmm=True)
tokenizer.encode('hello world!', add_bos=False, add_eos=False)
CustomTokenizer的使用
方便用户自定义分词过程。
以使用baidu/lac
来分词为例。
pip install lac
from naivenlp.tokenizers import CustomTokenizer
from LAC import LAC
lac = LAC(mode='seg')
def lac_tokenize(text, **kwargs):
return lac.run(text)
tokenizer = CustomTokenizer(
vocab_file='vocab.txt',
tokenize_fn=lac_tokenize,
pad_token='[PAD]',
unk_token='[UNK]',
bos_token='[BOS]',
eos_token='[EOS]',
)
tokenizer.tokenize('hello world!')
tokenizer.encode('hello world!', add_bos=False, add_eos=False)
BasicTokenizer的使用
这个分词器的使用很简单。不需要词典。它会根据空格来分词。它有以下功能:
- 按照空格和特殊字符分词
- 根据设置,决定是否大小写转换
- 根据设置,切分汉字,按照字的粒度分词
from naivenlp.tokenizers import BasicTokenizer
tokenizer = BasicTokenizer(do_lower_case=True, tokenize_chinese_chars=True)
tokenizer.tokenize('hello world, 你好世界')
WordpieceTokenizer的使用
Wordpiece
是一种分词算法,具体请自己查询相关文档。
WordpieceTokenizer
需要传入一个词典map。
from naivenlp.tokenizers import WordpieceTokenizer
tokenizer = WordpieceTokenizer(vocab=vocab, unk_token='[UNK]')
tokenizer.tokenize('hello world, 你好世界')
TransformerTokenizer的使用
from naivenlp.tokenizers import TransformerTokenizer
tokenizer = TransformerTokenizer(vocab_file='vocab.txt')
tokenizer.tokenize('Hello World, 你好世界')
tokenizer.encode('Hello World, 你好世界', add_bos=False, add_eos=False)
BertTokenizer的使用
from naivenlp.tokenizers import BertTokenizer
tokenizer = BertTokenizer(vocab_file='vocab.txt', cls_token='[CLS]', sep_token='[SEP]', mask_token='[MASK]')
tokenizer.tokenize('Hello World, 你好世界')
tokenizer.encode('Hello World, 你好世界', add_bos=False, add_eos=False)
Correctors
文本纠错,包括传统的n-gram语言模型和词典的方式,也可以使用基于深度学习的方法。
n-gram语言模型和词典纠错
这里的KenLMCorrector
是对 shibing624/pycorrector 项目的包装。
from naivenlp.correctors import KenLMCorrector
c = KenLMCorrector()
texts = [
'软件开发工成师',
'少先队员因该为老人让坐',
]
for text in texts:
print(c.correct(text))
可以得到纠错结果:
('软件开发工程师', [('工成师', '工程师', 4, 7)])
('少先队员应该为老人让座', [('因该', '应该', 4, 6), ('坐', '座', 10, 11)])
基于深度学习的纠错
主要是利用seq2seq
模型完成纠错。例如:
RNN
+Attention
传统的seq2seq
模型Transformer
模型
TODO
Similarity
多种字符串相似度的度量。是对luozhouyang/python-string-similarity的包装。
>>> import naivenlp
>>> a = 'ACCTTTDEX'
>>> b = 'CGGTTEEXX'
>>> naivenlp.cosine_distance(a, b)
1.0
>>> naivenlp.cosine_similarity(a, b)
1.0
>>> naivenlp.jaccard_distance(a, b)
1.0
>>> naivenlp.jaccard_similarity(a, b)
0.0
>>> naivenlp.levenshtein_distance(a, b)
5
>>> naivenlp.levenshtein_distance_normalized(a, b)
0.5555555555555556
>>> naivenlp.levenshtein_similarity(a, b)
0.4444444444444444
>>> naivenlp.weighted_levenshtein_distance(a, b)
5.0
>>> naivenlp.damerau_distance(a, b)
5
>>> naivenlp.lcs_distance(a, b)
8
>>> naivenlp.lcs_length(a, b)
5
>>> naivenlp.sorense_dice_distance(a, b)
1.0
>>> naivenlp.sorense_dice_similarity(a, b)
0.0
>>> naivenlp.optimal_string_alignment_distance(a, b)
5
>>>
Structures
常用的数据结构实现。
目前支持:
- 字典树Trie
Trie的使用
>>> import naivenlp
>>> trie = naivenlp.Trie()
>>> trie.put('上海市浦东新区')
>>> trie.show()
.
| +----上
| | +----海
| | | +----市
| | | | +----浦
| | | | | +----东
| | | | | | +----新
| | | | | | | +----区
>>> trie.put('上海市黄浦区')
>>> trie.show()
.
| +----上
| | +----海
| | | +----市
| | | | +----浦
| | | | | +----东
| | | | | | +----新
| | | | | | | +----区
| | | | +----黄
| | | | | +----浦
| | | | | | +----区
>>> for r in trie.keys_with_prefix('上海市'):
... print(r)
...
['上', '海', '市', '浦', '东', '新', '区']
['上', '海', '市', '黄', '浦', '区']
>>>
Utils
常用文本操作:
naivenlp.q2b(s)
全角转半角naivenlp.b2q(s)
半角转全角naivenlp.split_sentence(s)
把长文本切分成句子列表
Datasource
数据收集模块。目前支持:
- 下载所有的搜狗词库保存成文本文件
下载搜狗词库
from naivenlp.datasources import sogou_datasource as sg
# 下载category_id=1下面所有的词典,保存到/tmp/sogou
sg.download_category(1, '/tmp/sogou')
# 下载所有category保存到/tmp/sogou
sg.download_all_category('/tmp/sogou')
# 把下载的所有文件合成一个文件
sg.collect('/tmp/sogou', './sogou.vocab', maxlen=6)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
naivenlp-0.0.8.tar.gz
(19.8 kB
view details)
Built Distribution
naivenlp-0.0.8-py3-none-any.whl
(29.7 kB
view details)
File details
Details for the file naivenlp-0.0.8.tar.gz
.
File metadata
- Download URL: naivenlp-0.0.8.tar.gz
- Upload date:
- Size: 19.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.1.1.post20200604 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fdfcafdbcf1c95818df70f3739f375b440e98637c6d89ee9263d6a0602a7fc85 |
|
MD5 | 9127db6a1de730664383b075defbcccc |
|
BLAKE2b-256 | c627a39cf0ea74efc84e415b1e694d0ac2bc65be00ed3e6d0b8dee78490e71f9 |
File details
Details for the file naivenlp-0.0.8-py3-none-any.whl
.
File metadata
- Download URL: naivenlp-0.0.8-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.1.1.post20200604 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bc5ffd8c5fadbe115f894d78b6b5d1a326c9326899dae34605f0b91098850d8 |
|
MD5 | 535582f24f769041997a0918f4bfa4ec |
|
BLAKE2b-256 | 67bb874fb6bc598444ff1ed093727ae00f0f45a4ceccd5445c593a66e3d63ef6 |