NLP toolkit, including tokenization, sequence tagging, etc.

These details have not been verified by PyPI

Project links

Homepage

Project description

naivenlp

Python package

NLP常用工具包。

主要包含以下模块：

naivenlp

Installation

pip install -U naivenlp

Install extra dependencies:

pip install pycorrector
pip install git+https://github.com/kpu/kenlm.git

Tokenizers

Tokenizer的作用是分词，同时具有把词语映射到ID的功能。

naivenlp.tokenizers模块包含以下Tokenizer实现：

JiebaTokenizer，继承自VocabBasedTokenizer，分词使用jieba
CustomTokenizer，继承自VocabBasedTokenizer，基于词典文件的Tokenizer，包装tokenize_fn自定义函数来实现各种自定义的Tokenizer
TransformerTokenizer，继承自VocabBasedTokenizer，用于Transformer模型分词
BertTokenizer，继承自VocabBasedTokenizer，用于BERT模型分词

JiebaTokenizer的使用

分词过程使用jieba。

from naivenlp.tokenizers import JiebaTokenizer

tokenizer = JiebaTokenizer(
    vocab_file='vocab.txt',
    pad_token='[PAD]',
    unk_token='[UNK]',
    bos_token='[BOS]',
    eos_token='[EOS]',
)

tokenizer.tokenize('hello world!', mode=0, hmm=True)

tokenizer.encode('hello world!', add_bos=False, add_eos=False)

CustomTokenizer的使用

方便用户自定义分词过程。

以使用baidu/lac来分词为例。

pip install lac

from naivenlp.tokenizers import CustomTokenizer

from LAC import LAC

lac = LAC(mode='seg')

def lac_tokenize(text, **kwargs):
    return lac.run(text)


tokenizer = CustomTokenizer(
    vocab_file='vocab.txt',
    tokenize_fn=lac_tokenize,
    pad_token='[PAD]',
    unk_token='[UNK]',
    bos_token='[BOS]',
    eos_token='[EOS]',
)

tokenizer.tokenize('hello world!')

tokenizer.encode('hello world!', add_bos=False, add_eos=False)

BasicTokenizer的使用

这个分词器的使用很简单。不需要词典。它会根据空格来分词。它有以下功能：

按照空格和特殊字符分词
根据设置，决定是否大小写转换
根据设置，切分汉字，按照字的粒度分词

from naivenlp.tokenizers import BasicTokenizer

tokenizer = BasicTokenizer(do_lower_case=True, tokenize_chinese_chars=True)

tokenizer.tokenize('hello world, 你好世界')

WordpieceTokenizer的使用

Wordpiece是一种分词算法，具体请自己查询相关文档。

WordpieceTokenizer需要传入一个词典map。

from naivenlp.tokenizers import WordpieceTokenizer

tokenizer = WordpieceTokenizer(vocab=vocab, unk_token='[UNK]')

tokenizer.tokenize('hello world, 你好世界')

TransformerTokenizer的使用

from naivenlp.tokenizers import TransformerTokenizer


tokenizer = TransformerTokenizer(vocab_file='vocab.txt')

tokenizer.tokenize('Hello World, 你好世界')

tokenizer.encode('Hello World, 你好世界', add_bos=False, add_eos=False)

BertTokenizer的使用

from naivenlp.tokenizers import BertTokenizer


tokenizer = BertTokenizer(vocab_file='vocab.txt', cls_token='[CLS]', sep_token='[SEP]', mask_token='[MASK]')

tokenizer.tokenize('Hello World, 你好世界')

tokenizer.encode('Hello World, 你好世界', add_bos=False, add_eos=False)

Correctors

文本纠错，包括传统的n-gram语言模型和词典的方式，也可以使用基于深度学习的方法。

n-gram语言模型和词典纠错

这里的KenLMCorrector是对 shibing624/pycorrector 项目的包装。

from naivenlp.correctors import KenLMCorrector

c = KenLMCorrector()
texts = [
    '软件开发工成师',
    '少先队员因该为老人让坐',
]

for text in texts:
    print(c.correct(text))

可以得到纠错结果：

('软件开发工程师', [('工成师', '工程师', 4, 7)])
('少先队员应该为老人让座', [('因该', '应该', 4, 6), ('坐', '座', 10, 11)])

基于深度学习的纠错

主要是利用seq2seq模型完成纠错。例如：

RNN + Attention 传统的seq2seq 模型
Transformer模型

模型的训练使用 OpenNMT-tf 库，训练方法请到此项目的文档查看。

以下是使用Transformer模型的示例：

>>> from naivenlp.correctors import TransformerCorrector
>>> SAVED_MODEL='/models/correction_models/transformer-step-2000' # 此处换成你自己训练的模型，SavedModel格式
>>> corrector = TransformerCorrector(SAVED_MODEL)
>>> result, prob = corrector.correct('我最近每天晚上都会拧着鼻子去喝30cc的醋了。')
>>> print('result: ', result)
result:  我最近每天晚上都会拧着鼻子去喝30cc的醋。
>>> print('  prob: ', prob)
  prob:  -6.088574
>>>

Similarity

多种字符串相似度的度量。是对luozhouyang/python-string-similarity的包装。

>>> import naivenlp
>>> a = 'ACCTTTDEX'
>>> b = 'CGGTTEEXX'
>>> naivenlp.cosine_distance(a, b)
1.0
>>> naivenlp.cosine_similarity(a, b)
1.0
>>> naivenlp.jaccard_distance(a, b)
1.0
>>> naivenlp.jaccard_similarity(a, b)
0.0
>>> naivenlp.levenshtein_distance(a, b)
5
>>> naivenlp.levenshtein_distance_normalized(a, b)
0.5555555555555556
>>> naivenlp.levenshtein_similarity(a, b)
0.4444444444444444
>>> naivenlp.weighted_levenshtein_distance(a, b)
5.0
>>> naivenlp.damerau_distance(a, b)
5
>>> naivenlp.lcs_distance(a, b)
8
>>> naivenlp.lcs_length(a, b)
5
>>> naivenlp.sorense_dice_distance(a, b)
1.0
>>> naivenlp.sorense_dice_similarity(a, b)
0.0
>>> naivenlp.optimal_string_alignment_distance(a, b)
5
>>>

Structures

常用的数据结构实现。

目前支持：

字典树Trie

Trie的使用

>>> import naivenlp
>>> trie = naivenlp.Trie()
>>> trie.put('上海市浦东新区')
>>> trie.show()
.
+----上
|    +----海
|    |    +----市
|    |    |    +----浦
|    |    |    |    +----东
|    |    |    |    |    +----新
|    |    |    |    |    |    +----区
>>> trie.put('上海市黄浦区')
>>> trie.show()
.
+----上
|    +----海
|    |    +----市
|    |    |    +----浦
|    |    |    |    +----东
|    |    |    |    |    +----新
|    |    |    |    |    |    +----区
|    |    |    +----黄
|    |    |    |    +----浦
|    |    |    |    |    +----区
>>> 
>>> for r in trie.keys_with_prefix('上海市'):
...     print(r)
... 
['上', '海', '市', '浦', '东', '新', '区']
['上', '海', '市', '黄', '浦', '区']
>>>

Utils

常用文本操作：

naivenlp.q2b(s) 全角转半角
naivenlp.b2q(s) 半角转全角
naivenlp.split_sentence(s) 把长文本切分成句子列表

Datasource

数据收集模块。目前支持：

下载所有的搜狗词库保存成文本文件

下载搜狗词库

from naivenlp.datasources import sogou as sg

# 下载category_id=1下面所有的词典，保存到/tmp/sogou
sg.download_category(1, '/tmp/sogou')

# 下载所有category保存到/tmp/sogou
sg.download_all_category('/tmp/sogou')

# 把下载的所有文件合成一个文件
sg.collect('/tmp/sogou', './sogou.vocab', maxlen=6)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.9

Jul 22, 2020

0.0.8

Jul 16, 2020

0.0.7

Jul 16, 2020

0.0.6

Jul 16, 2020

0.0.5

Jul 11, 2020

0.0.4

Jul 5, 2020

0.0.3

Jul 1, 2020

0.0.2

Jun 21, 2020

0.0.1

Jun 10, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

naivenlp-0.0.9.tar.gz (25.9 kB view details)

Uploaded Jul 22, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

naivenlp-0.0.9-py3-none-any.whl (33.0 kB view details)

Uploaded Jul 22, 2020 Python 3

File details

Details for the file naivenlp-0.0.9.tar.gz.

File metadata

Download URL: naivenlp-0.0.9.tar.gz
Upload date: Jul 22, 2020
Size: 25.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for naivenlp-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`6dc7340e40b5d48fb7e750e8547e91e4cf61b98e04e56c436a17c4968b824ed3`
MD5	`4c8ecc480c09dd97a2de6f01f992f436`
BLAKE2b-256	`e75519d93059f9fe6bf7239e9f9103261c59529fc7082cc690f6bdb5a789b8a3`

See more details on using hashes here.

File details

Details for the file naivenlp-0.0.9-py3-none-any.whl.

File metadata

Download URL: naivenlp-0.0.9-py3-none-any.whl
Upload date: Jul 22, 2020
Size: 33.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for naivenlp-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a04e9d51c5ab04ab65faba0b9bfa78d969dfe14ad5dd0ab4266ac96e0391fc82`
MD5	`d21b94798a15737a5f71f6c8266c9f5c`
BLAKE2b-256	`f91cc58cf4f1aa446e9bfa5e3e2a2f3d01a97fa3576198755e3d168ada796f7f`

See more details on using hashes here.

naivenlp 0.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

naivenlp

Installation

Tokenizers

JiebaTokenizer的使用

CustomTokenizer的使用

BasicTokenizer的使用

WordpieceTokenizer的使用

TransformerTokenizer的使用

BertTokenizer的使用

Correctors

n-gram语言模型和词典纠错

基于深度学习的纠错

Similarity

Structures

Trie的使用

Utils

Datasource

下载搜狗词库

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes