Skip to main content

NLP toolkit, including tokenization, sequence tagging, etc.

Project description

naivenlp

Python package PyPI version Python

NLP常用工具包。

主要包含以下模块:

Tokenizers

Tokenizer的作用是分词, 同时具有把词语映射到ID的功能。

naivenlp.tokenizers模块包含以下Tokenizer实现:

  • JiebaTokenizer,继承自VocabBasedTokenizer,分词使用jieba
  • CustomTokenizer,继承自VocabBasedTokenizer,基于词典文件的Tokenizer,包装tokenize_fn自定义函数来实现各种自定义的Tokenizer
  • TransformerTokenizer,继承自VocabBasedTokenizer,用于Transformer模型分词
  • BertTokenizer,继承自VocabBasedTokenizer,用于BERT模型分词

JiebaTokenizer的使用

分词过程使用jieba

from naivenlp.tokenizers import JiebaTokenizer

tokenizer = JiebaTokenizer(
    vocab_file='vocab.txt',
    pad_token='[PAD]',
    unk_token='[UNK]',
    bos_token='[BOS]',
    eos_token='[EOS]',
)

tokenizer.tokenize('hello world!', mode=0, hmm=True)

tokenizer.encode('hello world!', add_bos=False, add_eos=False)

CustomTokenizer的使用

方便用户自定义分词过程。

以使用baidu/lac来分词为例。

pip install lac
from naivenlp.tokenizers import CustomTokenizer

from LAC import LAC

lac = LAC(mode='seg')

def lac_tokenize(text, **kwargs):
    return lac.run(text)


tokenizer = CustomTokenizer(
    vocab_file='vocab.txt',
    tokenize_fn=lac_tokenize,
    pad_token='[PAD]',
    unk_token='[UNK]',
    bos_token='[BOS]',
    eos_token='[EOS]',
)

tokenizer.tokenize('hello world!')

tokenizer.encode('hello world!', add_bos=False, add_eos=False)

BasicTokenizer的使用

这个分词器的使用很简单。不需要词典。它会根据空格来分词。它有以下功能:

  • 按照空格和特殊字符分词
  • 根据设置,决定是否大小写转换
  • 根据设置,切分汉字,按照字的粒度分词
from naivenlp.tokenizers import BasicTokenizer

tokenizer = BasicTokenizer(do_lower_case=True, tokenize_chinese_chars=True)

tokenizer.tokenize('hello world, 你好世界')

WordpieceTokenizer的使用

Wordpiece是一种分词算法,具体请自己查询相关文档。

WordpieceTokenizer需要传入一个词典map。

from naivenlp.tokenizers import WordpieceTokenizer

tokenizer = WordpieceTokenizer(vocab=vocab, unk_token='[UNK]')

tokenizer.tokenize('hello world, 你好世界')

TransformerTokenizer的使用

from naivenlp.tokenizers import TransformerTokenizer


tokenizer = TransformerTokenizer(vocab_file='vocab.txt')

tokenizer.tokenize('Hello World, 你好世界')

tokenizer.encode('Hello World, 你好世界', add_bos=False, add_eos=False)

BertTokenizer的使用

from naivenlp.tokenizers import BertTokenizer


tokenizer = BertTokenizer(vocab_file='vocab.txt', cls_token='[CLS]', sep_token='[SEP]', mask_token='[MASK]')

tokenizer.tokenize('Hello World, 你好世界')

tokenizer.encode('Hello World, 你好世界', add_bos=False, add_eos=False)

Correctors

Similarity

多种字符串相似度的度量。是对luozhouyang/python-string-similarity的包装。

import naivenlp

a = 'ACCTTTDEX'
b = 'CGGTTEEXX'

naivenlp.cosine_distance(a, b)

naivenlp.cosine_similarity(a, b)

naivenlp.jaccard_distance(a, b)

naivenlp.jaccard_similarity(a, b)

naivenlp.levenshtein_distance(a, b)

naivenlp.levenshtein_distance_normalized(a, b)

naivenlp.levenshtein_similarity(a, b)

naivenlp.weighted_levenshtein_distance(a, b)

naivenlp.damerau_distance(a, b)

naivenlp.lcs_distance(a, b)

naivenlp.lcs_length(a, b)

naivenlp.sorense_dice_distance(a, b)

naivenlp.sorense_dice_similarity(a, b)

naivenlp.optimal_string_alignment_distance(a, b)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

naivenlp-0.0.4.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

naivenlp-0.0.4-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file naivenlp-0.0.4.tar.gz.

File metadata

  • Download URL: naivenlp-0.0.4.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for naivenlp-0.0.4.tar.gz
Algorithm Hash digest
SHA256 fd6bed752870719a40d0ecc2c5bbd3145aba3ccfa3c5a2883991b14c96db7326
MD5 76a518f2f55d5d27b7b315de1e97a47d
BLAKE2b-256 3169591494e19cff0f32f95ce8e129b59175c21374d0e8971c477a855863d460

See more details on using hashes here.

File details

Details for the file naivenlp-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: naivenlp-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 25.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for naivenlp-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 88ad44c7597943252c57058425c4875ceb3730c16e0b7e817c5e7c2b381054c2
MD5 a485ee57cfa1e66c55030cdca505e570
BLAKE2b-256 2b4893353d440b1aa0123ed3cb1a7a4d193fb661f21b804bc84891335323b35b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page