Chinese keyword extraction using transformer-based language models
Project description
Chinese_keyBERT
Chinese_keyBERT is a minimal Chinese keywords extraction library that leverage the contextual embeddings generated from BERT models to extract relevant keywords from the given texts.
Installation
pip install chinese_keybert
Get started
from chinese_keybert import Chinese_Extractor
kw_extractor = Chinese_Extractor()
text = [
'''
渾水創始人:七月開始調查貝殼,因為“好得難以置信” 2021年12月16日,做空機構渾水在社交媒體上公開表示,正在做空美股上市公司貝殼...
'''
]
result = kw_extractor.generate_keywords(text,top_k=5,rank_methods="mmr")
How it works
The core idea behind chinese_keyBERT is to utilize a word segmentation models to segments a piece of text into smaller n-grams and filter the n-grams according to the defined part-of-speech (as some pos are not suitable to be used as a keyword). Then, an embedding model (eg. BERT) is used to encode the text and filtered n_grams into embeddings and using some ranking methods (eg. maximun sum/maximun marginal relevance) to compute the cosine distances betweens the text and n-grams embeddings and rank the keywords according to the scores.
To-do
- Documentations
- Vectorization operations to speed-up processing of multiple documents
- Add support for other word segmentation, part-of-speech and embeddings model
Credit
Chinese_keyBERT was largely inspired by KeyBERT, a minimal library for embedding based keywords extractions. Besides, Chinese_keyBERT is also heavily relies on Chinese word segmentation and POS library from CKIP as well as sentence-transformer for generating quality embeddings.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for chinese_keybert-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bedfb59adbfeb84936f0f72386a955973d3096151d609c52fa503a9c858f6407 |
|
MD5 | cd690b060f4c8e7cd0883a8d5fcf7486 |
|
BLAKE2b-256 | 0543b389f81eece163a83777cf7a537d19e91e1551a07b5941e8f42ee63c86b8 |