nlp kit.
Project description
sk-nlp
📦 项目介绍 (for humans)
这个第三方仓库是由深圳市名通科技股份有限公司AI团队提供的。团队致力于为NLP领域,提供一个稳定可靠, 功能完善的NLP常见操作。
Installation
cd your_project
pip install sk-nlp
Content
-
sk_nlp package
0 使用ac自动机统计给定的词语的词频 1 获取tf-idf特征
class sk_nlp.nlp_feature_extract.feature.CountByAC(pattern_list=[])
Bases: "object"
基于ac自动机来统计模式串
Parameters: pattern_list -- 匹配的模式串列表
build_tree(pattern_list)
构建模式串前缀树
Parameters:
**pattern_list** -- 模式串列表
count(sentence)
统计sentence中关于给定的模式串的频率
Parameters:
**sentence** -- 句子
Returns:
word_count 每个关键词对应的频率
>>> ac = CountByAC(['杰伦的七', '周杰伦的', '七里香'])
>>> result = ac.count('周杰伦的七里香七里香')
>>> print(result)
{'周杰伦的': 1, '杰伦的七': 1, '七里香': 2}
class sk_nlp.nlp_feature_extract.feature.KeyWordExtract
Bases: "object"
关键词抽取算法,基于tf-idf
get_tf_idf(sentence_list, model_file)
加载tf-idf模型,返回sentence_list对应的特征和模型
Parameters:
* **sentence_list** -- 句子列表(分词后)
* **model_file** -- tf-idf模型文件
Returns:
tf_idf_model(模型实例), tfidf_feature(sentence_list对应的tf-
idf特征)
>>> tf_idf_model, tfidf_feature = kwe.get_tf_idf(['杰伦 是 台湾 歌手', '七里香 是 杰伦 创作'], file_conf.tf_idf_file_path)
>>> print(tfidf_feature)
(0, 4) 0.6316672017376245
(0, 3) 0.4494364165239821
(0, 2) 0.6316672017376245
(1, 3) 0.4494364165239821
(1, 1) 0.6316672017376245
(1, 0) 0.6316672017376245
get_topk_keywords(data_list, topk=200)
得到topk个关键词
Parameters:
* **data_list** -- 句子列表(分词后)
* **topk** -- tf-idf重要度排序后前topk
Returns:
keywords
>>> keywords = kwe.get_topk_keywords(['杰伦 是 台湾 歌手', '七里香 是 杰伦 创作'], topk=1)
>>> print(keywords)
[['歌手']['创作']]
train_tf_idf(sentence_list, model_file, ngram_range=(1, 1))
训练tf-idf模型,保存模型,返回模型和特征
Parameters:
* **sentence_list** -- 句子列表(分词后)
* **model_file** -- tf-idf模型保存文件
Returns:
tf_idf_model, tfidf_feature
敏感词汇过滤模块,共实现了3个类:NaiveFilter,BSFilter,DFAFilter
class sk_nlp.nlp_feature_extract.text_filter.BSFilter
Bases: "object"
宽度优先遍历的方式过滤
add(keyword)
新增一个敏感词
:param keyword:敏感词 :return:无
filter(message, repl='*')
过滤掉敏感词
Parameters:
* **message** -- 原始的输入句子
* **repl** -- 敏感词汇被替换成的字符
Returns:
message 屏蔽掉敏感词汇的句子
>>> f = BSFilter()
>>> question = "台湾是中国的吗"
>>> filter_question = f.filter(question)
>>> print(question, filter_question)
台湾是中国的吗 *是中国的吗
parse(path)
加载敏感词汇表
Parameters:
**path** -- 路径为/sk-nlp/data/dirty_word.txt
Returns:
class sk_nlp.nlp_feature_extract.text_filter.DFAFilter
Bases: "object"
DFA即Deterministic Finite Automaton,也就是确定有穷自动机。 算法核 心是建立了以敏感词为基础的许多敏感词树
add(keyword)
新增一个敏感词
:param keyword:敏感词 :return:无
detect(message)
判断message是否包含敏感词汇
:param message:用户输入的句子 :return: True/False
filter(message, repl='*')
过滤掉敏感词
Parameters:
* **message** -- 原始的输入句子
* **repl** -- 敏感词汇被替换成的字符
Returns:
message 屏蔽掉敏感词汇的句子
>>> f = DFAFilter()
>>> question = "台湾是中国的吗"
>>> filter_question = f.filter(question)
>>> print(question, filter_question)
台湾是中国的吗 *是中国的吗
parse(path)
加载敏感词汇表
Parameters:
**path** -- 路径为/sk-nlp/data/dirty_word.txt
Returns:
class sk_nlp.nlp_feature_extract.text_filter.NaiveFilter
Bases: "object"
普通的过滤方式:使用集合的方式过滤,时间复杂度跟集合的大小有关
filter(message, repl='*')
过滤掉敏感词
Parameters:
* **message** -- 原始的输入句子
* **repl** -- 敏感词汇被替换成的字符
Returns:
message:屏蔽掉敏感词汇的句子
>>> f = NaiveFilter()
>>> question = "台湾是中国的吗"
>>> filter_question = f.filter(question)
>>> print(question, filter_question)
台湾是中国的吗 *是中国的吗
parse(path)
加载敏感词汇表
Parameters:
**path** -- 路径为/sk-nlp/data/dirty_word.txt
Returns:
词语粒度的操作模块:分词,去停用词,同义词林转换
class sk_nlp.nlp_feature_extract.tokenizer.SentenceCut(is_lower=True, stopword_list=[], use_chinese_synonyms=False)
Bases: "object"
句子分词操作类 目前集成了jieba分词
cut_word(sentence_list)
对传进来的句子进行分词
:param sentence_list:['我爱中国', '我是中国人']
:return:seg_lists [['我', '爱', '中国'], ['我', '是', '中国', '
人']] token_count {'我': 2, '爱': 1, '中国': 2, '是': 1, '人':
1}
>>> sen_cut = SentenceCut(use_chinese_synonyms=True)
>>> seg_lists, token_count = sen_cut.cut_word(['我爱baidu', '我是中国人'])
>>> print(seg_lists, token_count)
[['我', '爱', '百度'], ['我', '是', '中国', '人']]
{'我': 2, '爱': 1, '百度': 1, '是': 1, '中国': 1, '人': 1}
load_chinese_synonyms()
加载同义词林
Returns:
union_find (并查集实例),word_list(同义词林所有的单词集合
)
class sk_nlp.nlp_feature_extract.tokenizer.StopWord(source='', define_stop_word=[])
Bases: "object"
停用词操作类: 停用词汇表路径存放在 sk-nlp/data/stopword
load_stop_word()
根据不同的self.source加载不同的停用词表
Returns:
stop_word_list 停用词列表
merge_stop_word(define_stop_word)
将用户自定义的停用词和用户指定的通用词库合并成一个list
Parameters:
**define_stop_word** -- 用户给的自定义停用词列表 list
Returns:
stop_word_list 停用词列表
bert基本模型加载
class sk_nlp.nlp_feature_embedding.bert.MaskLayer(output_dim=768, **kwargs)
Bases: "keras.engine.base_layer.Layer"
mask 层,屏蔽掉seg_id为0的词语
build(input_shape)
创建层的权重
:param input_shape:Keras tensor (future input to layer) or
list/tuple of Keras tensors :return:
call(x)
This is where the layer's logic lives.
# Arguments
inputs: Input tensor, or list/tuple of input tensors.
>>**<<kwargs: Additional keyword arguments.
# Returns
A tensor or list/tuple of tensors.
compute_output_shape(input_shape)
Computes the output shape of the layer.
Assumes that the layer will be built to match that input shape
provided.
# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the
layer). Shape tuples can include None for free dimensions,
instead of an integer.
# Returns
An output shape tuple.
class sk_nlp.nlp_feature_embedding.bert.ReverseMaskLayer(**kwargs)
Bases: "keras.engine.base_layer.Layer"
反转 mask 层,屏蔽掉seg_id为1的词语
call(x)
This is where the layer's logic lives.
# Arguments
inputs: Input tensor, or list/tuple of input tensors.
>>**<<kwargs: Additional keyword arguments.
# Returns
A tensor or list/tuple of tensors.
compute_output_shape(input_shape)
Computes the output shape of the layer.
Assumes that the layer will be built to match that input shape
provided.
# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the
layer). Shape tuples can include None for free dimensions,
instead of an integer.
# Returns
An output shape tuple.
class sk_nlp.nlp_feature_embedding.bert.SepLayer(**kwargs)
Bases: "keras.engine.base_layer.Layer"
sep mask 层,屏蔽掉sep位置的输出
call(x)
This is where the layer's logic lives.
# Arguments
inputs: Input tensor, or list/tuple of input tensors.
>>**<<kwargs: Additional keyword arguments.
# Returns
A tensor or list/tuple of tensors.
compute_output_shape(input_shape)
Computes the output shape of the layer.
Assumes that the layer will be built to match that input shape
provided.
# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the
layer). Shape tuples can include None for free dimensions,
instead of an integer.
# Returns
An output shape tuple.
sk_nlp.nlp_feature_embedding.bert.build_model_feature(origin_model, use_cls=False)
搭建新的句子模型
Parameters: * origin_model -- 原始模型,一般为bert
* **use_cls** -- 是否使用cls位置的输出
Returns: model:新模型
sk_nlp.nlp_feature_embedding.bert.encoder(model, data_list, dict_path='/machinelearn/wzh/sk_nlp/sk_nlp/model/bert/chinese_L-12_H-768_A-12/vocab.txt')
使用句向量模型,将句子转码成句向量
Parameters: * model -- 模型
* **data_list** -- 句子列表(没有分词)
* **dict_path** -- bert模型词汇表
Returns: data_list中的每个句子对应的句向量列表
origin_model = load_bert_model() new_model = build_model_feature(origin_model) question_list = ["我爱这个伟大的世界", "欣赏世界的风景"] sen_vector_lists = encoder(new_model, question_list) print(sen_vector_lists.shape)
sk_nlp.nlp_feature_embedding.bert.load_bert_model(with_mlm=True, with_pool=False, return_keras_model=True, config_path='/machinelearn/wzh/sk_nlp/sk_nlp/model/bert/chinese_L-12_H-768_A-12/bert_config.json', checkpoint_path='/machinelearn/wzh/sk_nlp/sk_nlp/model/bert/chinese_L-12_H-768_A-12/bert_model.ckpt')
加载bert 模型
Parameters: * with_mlm -- 是否正则化
* **with_pool** -- 是否池化
* **return_keras_model** -- 返回的是keras model 还是 tensorflow
模型
* **config_path** -- bert 模型配置文件路径
* **checkpoint_path** -- bert 模型路径
Returns: sk_nlp.nlp_feature_embedding.bert.masked_crossentropy(y_true, y_pred)
mask掉非预测部分,计算交叉熵
Parameters: * y_true -- 真实的Y标签
* **y_pred** -- 预测的Y标签
Returns: 损失值
计算各种距离
sk_nlp.nlp_feature_embedding.similarity.get_distance_sim_matrix(matrix1, matrix2, metric='cosine')
返回2个矩阵的各种距离和相似度
Parameters: * matrix1 -- 句子向量1
* **matrix2** -- 句子向量2
* **metric** -- 'braycurtis', 'canberra', 'chebyshev',
'cityblock', 'correlation',
'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'jensenshannon', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'wminkowski', 'yule' :return:
sk_nlp.nlp_feature_embedding.similarity.get_edit_distance(query_sen_list, candidate_sen_list)
计算编辑距离
Parameters: * query_sen_list -- 如['我爱中国', '美国总统特朗普']
* **candidate_sen_list** -- 如['我爱地球', '美国总统拜登']
Returns: sk_nlp.nlp_feature_embedding.similarity.get_edit_similarity(distance_matrix, norm=True)
先反转编辑距离矩阵,得到编辑相似度矩阵,然后可以选择归一化
Parameters: * distance_matrix -- 距离矩阵
* **norm** -- True/False
Returns: sk_nlp.nlp_feature_embedding.similarity.get_jaccard_sim(sen_list1, sen_list2, norm=False)
获得杰卡德相似度
Parameters: * sen_list1 -- [['我', '爱','中国'], ['美国', '总统', '特朗 普']]
* **sen_list2** -- [['我', '爱','地球'], ['美国', '总统', '拜登
']]
:param norm:是否对结果进行归一化 :return:
sk_nlp.nlp_feature_embedding.similarity.match_topk(sim_matrix, topk=1, order=0)
返回相似度矩阵前topk/或者后topk
Parameters: * sim_matrix --
* **topk** --
* **order** --
Returns: sk_nlp.nlp_feature_embedding.similarity.normalization(matrix, reversed=True)
归一化矩阵,按照最后一个维度
Parameters: * matrix --
* **reversed** --
Returns:
传统的w2v模型:包含skip-gram和cbow 目前有一个从wiki语料训练出来的100维 度的skip-gram模型
class sk_nlp.nlp_feature_embedding.w2v.WordEmbedding(model_file_path='/machinelearn/wzh/sk_nlp/sk_nlp/model/w2v/skip_gram_wiki2Vec.h5', embedding_dim=100)
Bases: "object"
fine_tune(new_seg_list, model_file_path)
基于已有的w2v模型,使用其他语料进行微调。然后保存模型路径。
Parameters:
* **new_seg_list** -- 新句子(分词后)
* **model_file_path** -- 模型的保存路径
Returns:
>>> model = WordEmbedding()
>>> model.get_embedding()
>>> new_seg_list = [['我', '爱','中国'], ['美国', '总统', '特朗普']]
>>> model.fine_tune(new_seg_list, file_conf.ft_wiki_sg_file_path)
get_embedding()
获取词向量模型的信息
Returns:
embedding_matrix:词向量矩阵;index_word:索引到单词的映射;
word_index:单词到索引的映射
op2model()
由于w2v的接口太多,不太好封装 这里给出了模型的一些常用操作范例
Returns:
train_vec(sentence_list, model_file_path, window=5, min_count=5, sg=0)
使用w2v训练词向量
Parameters:
* **sentence_list** -- 句子列表,[['我', '爱','中国'], ['美国
', '总统', '特朗普']]
* **model_file_path** -- 模型保存路径
* **window** -- 滑动窗口
* **min_count** -- 最小词频
* **sg** -- 0是使用cbow, 1是使用跳字模型
Returns:
Module contents
More Resources
- [where is bert pre-train model] https://github.com/google-research/bert
- [where is stopwords corpus] https://github.com/goto456/stopwords
- Official Python Packaging User Guide
- [The Hitchhiker's Guide to Packaging]
License
This is free and unencumbered software released into the public domain. Anyone is free to copy, modify, publish, use, compile, sell, or distribute this software, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sk_nlp-0.1.9-py3-none-any.whl.
File metadata
- Download URL: sk_nlp-0.1.9-py3-none-any.whl
- Upload date:
- Size: 317.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49af0f092876a964ae3b3092a1c9ee7a857e8f42876d6da31f04fe1d35d4b523
|
|
| MD5 |
79ccfc2fcb5025d61a13fd43947d1761
|
|
| BLAKE2b-256 |
8046ae1192c0599b76ea5aff729b9a6ed2bc717072aa021fe9481ba9261764ea
|