Attention-based keyword extraction with ordered semantic word weights
Project description
KeyAtten
KeyAtten: Attention-based Keyword/Keyphrase Extraction English | 中文
Attention-based keyword extraction framework. Zero training, zero labeling, single forward pass. Supports Chinese and English.
Evaluated on 7 public datasets against 14 methods: +67% F1@10 over traditional baselines on Chinese news, +78% over the strongest external method on English long documents.
Default Release Path
- Default Chinese model:
thenlper/gte-small-zh - Default release method:
received_attn, plus_idfvariants when a corpus is available - Default deployment path: small encoder + interpretable attention + lightweight operators
The repository still treats gte-small-zh as the lightweight default production model, but the main library now ships decoder-only causal attention adaptation. When no layer is specified for a causal model, KeyAtten automatically recommends a middle-upper layer at roughly 3/4 depth instead of falling back to the last layer. For example, Qwen/Qwen3-Embedding-0.6B defaults to the middle-upper band around layer 21 rather than layer 27.
Method Categories
The README now groups the library into three public method categories:
1. Main Method: BIO Candidates + Fine-Tuned Attention Reranking
This is the current primary route.
In plain terms:
- replace the default candidate set with BIO candidates
- then use fine-tuned attention to rank them
Main entrypoints:
KeyAttenExtractor(candidate_scoring="bio")CandidateSegmentAttentionExtractor
2. Standalone Methods
- Attention-series methods
BIOExtractorQKLoRAExtractor
Attention-series methods include:
cls_attnreceived_attnsamrankfusion_attn- their
_idfvariants
3. Baselines / Other Methods
- used for comparison, legacy experiments, or external baselines
Features
- Extracts keywords directly from pretrained model attention weights — no fine-tuning or labeling required
- Attention-IDF hybrid strategy for significant gains on long documents and corpus-aware scenarios
- Word-level semantic weight output (weight value, position index, POS tag)
- Single-layer or multi-layer attention weighted fusion
- Candidate-segment attention reranking with BIO-generated phrase candidates
- Lightweight: 22M–33M parameter models, single forward pass
Installation
pip install keyatten
Minimal install only includes numpy so importing the package does not pull the full ML stack by default.
pip install "keyatten[inference,zh]" # Chinese keyword extraction
pip install "keyatten[inference,en]" # English keyword extraction
pip install "keyatten[inference,zh,lightweight]" # Chinese lightweight deployment
pip install "keyatten[full]" # All optional dependencies
Optional dependency groups:
inference:torch>=2.0,transformers>=4.30lightweight:onnx>=1.16,onnxruntime>=1.18,tokenizers>=0.15zh:jieba>=0.42en:scikit-learn>=1.0,nltk>=3.8
If you call extraction APIs without the required extras installed, KeyAtten now raises a direct install hint instead of failing during import keyatten.
Quick Start
Keyword Extraction
from keyatten import KeyAttenExtractor
ext = KeyAttenExtractor(model="thenlper/gte-small-zh", language="zh")
# Pure attention
keywords = ext.extract_keywords(
"自然语言处理是人工智能的重要方向",
method="received_attn",
)
Attention-IDF Hybrid
# Fit IDF from a corpus first
idf = ext.fit_idf(["自然语言处理是人工智能的重要方向", "关键词提取是文本挖掘任务"])
keywords = ext.extract_keywords(
"自然语言处理是人工智能的重要方向",
method="fusion_attn_idf",
idf_lookup=idf,
)
Cache And Incremental IDF
ext = KeyAttenExtractor(
model="Qwen/Qwen3-Embedding-0.6B",
language="zh",
device="cuda",
dtype="float16",
cache_enabled=True,
cache_dir="cache",
)
# fit_idf rebuilds IDF state from scratch.
idf = ext.fit_idf(["old text one", "old text two"])
# update_idf only appends document frequencies for new texts.
idf = ext.update_idf(["new text three"])
keywords = ext.extract_keywords(
"new text three",
method="fusion_attn_idf",
top_k=8,
idf_lookup=idf,
)
When caching is enabled, KeyAtten writes two cache layers:
cache/keyatten_documents/: pre-IDF cache with tokenization, candidates,token_counts, and attention word scores. It can be reused when IDF changes, avoiding another model forward pass.cache/keyatten_keywords/: post-IDF cache with final keywords for a specific IDF fingerprint. Repeated calls with the same config and IDF can return directly.
Word-Level Weights
weights = ext.extract_word_weights(
"自然语言处理是人工智能的重要方向",
method="received_attn",
)
for w in weights:
print(w.word, w.weight, w.pos_tag)
Batch Extraction
results = ext.extract_keywords_batch(
["文本一", "文本二", "文本三"],
method="fusion_attn",
)
External Token Input
keywords = ext.extract_keywords(
["空天信息", "系统", "优化"],
pos_tags=["n", "n", "v"],
method="received_attn",
)
Domain Dictionary
ext = KeyAttenExtractor(
model="thenlper/gte-small-zh",
language="zh",
user_dict=["空天信息", "星闪技术"],
)
keywords = ext.extract_keywords(
"空天信息系统优化方法",
method="received_attn",
)
Token-Span Candidate Scoring
ext = KeyAttenExtractor(
model="Qwen/Qwen3-Embedding-0.6B",
language="zh",
candidate_scoring="token_span",
)
keywords = ext.extract_keywords(
"水木年华被嘲讽已过气,卢庚戌回应称作品会留下来",
method="fusion_attn_idf",
idf_lookup=idf,
)
BIO Candidates Instead of Jieba Candidates
ext = KeyAttenExtractor(
model="Qwen/Qwen3-Embedding-0.6B",
language="zh",
candidate_scoring="bio",
bio_model_path="models/bio_ckipbert_extractive_ep13/bio_model_full.pt",
)
keywords = ext.extract_keywords(
"水木年华被嘲讽已过气,卢庚戌回应称作品会留下来",
method="received_attn",
)
Candidate-Segment Attention Reranking
from keyatten import CandidateSegmentAttentionExtractor
ext = CandidateSegmentAttentionExtractor(
model="Qwen/Qwen3-Embedding-0.6B",
adapter_path="models/candidate_segment_attn/qwen06_v2_2k_len1024_c30/best_adapter",
bio_model_path="models/bio_ckipbert_extractive_ep13/bio_model_full.pt",
max_candidates=30,
)
keywords = ext.extract_keywords(
"水木年华被嘲讽已过气,卢庚戌回应称作品会留下来",
random_seeds=[1, 2, 3],
)
Convenience Function
from keyatten import extract_keywords
keywords = extract_keywords(
"自然语言处理是人工智能的重要方向",
model="thenlper/gte-small-zh",
)
Attention-Series Methods
| Method | Description |
|---|---|
cls_attn |
Attention weights from [CLS] token to each token |
received_attn |
Total attention each token receives from all tokens |
samrank |
SAMRank formula (global attention + proportional redistribution) |
fusion_attn |
Normalized product of CLS and received attention |
Each method has a corresponding _idf hybrid variant (e.g., cls_attn_idf) that multiplies attention scores with TF-IDF, suitable for corpus-aware scenarios.
The
samrankformula is referenced from Kang & Shin (2023, EMNLP). The other methods (cls_attn,received_attn,fusion_attn) and all_idfhybrid strategies are original to this project.
Using Attention as a Secondary Method
received_attn is now the safest default starting point. When a corpus is available, _idf variants should be tried first; in the latest Chinese decoder-only rollup, received_attn_idf is the main CSL path and fusion_attn_idf is the main ShenCeCup path. cls_attn is still useful for high-distinctiveness tag-cloud style outputs, but it is no longer the default keyword-extraction method.
If your main metric is F1@5, the library now also exposes an optional nested-phrase de-dup post-ranking step. It only activates when top_k <= 5, filters substring/superstring duplicates such as natural language processing / natural language / language processing, and stays off by default so the @10 path is unchanged.
For raw string input, the library now also exposes an optional candidate_scoring="token_span" route. Candidate generation still follows the segmenter and POS filter, but ranking aggregates token attention directly over each candidate's character span, bypassing the previous word-level mean-of-means path.
For Chinese raw string input, the library also exposes candidate_scoring="bio". This replaces the default jieba/POS candidate generator with BIOExtractor candidates first, then scores those BIO candidates with attention.
For trained Chinese reranking, the library also exposes CandidateSegmentAttentionExtractor. This route uses BIOExtractor only for candidate generation, then reranks the explicit candidate list with attention over the full document + candidate segment input. If you use random candidate order, prefer multi-seed inference such as random_seeds=[1, 2, 3] to reduce order sensitivity.
Practical Examples
Side-by-side comparison of cls_attn vs samrank across domains (model: gte-small-zh, top_k=6):
| Domain | Input (excerpt) | cls_attn | samrank |
|---|---|---|---|
| Tech | OpenAI released GPT-4o with multimodal input... | OpenAI, GPT, model | OpenAI, model, GPT |
| Medical | mRNA vaccine encodes spike protein... Omicron variant... | mRNA, mRNA vaccine, COVID, Omicron variant | mRNA, mRNA vaccine, COVID, COVID virus |
| Finance | Fed announces 25bp rate hike... | rate hike, basis points, global stocks, rate | rate hike, basis points, rate, global stocks |
| Sports | Messi scores hat-trick in World Cup final... lifts trophy | Messi, trophy, hat-trick, final | trophy, Messi, hat-trick, penalty |
| History | Qin Shi Huang unified six states... centralized dynasty | centralization, feudal dynasty, standardization | centralization, standardization, feudal dynasty |
| Daily | Meet at Starbucks at 3pm... business trip to Beijing | Starbucks, Beijing, business trip | meet, Beijing, chat |
cls_attn favors the most distinctive entities (Messi, Starbucks, Omicron), ideal for tag clouds and summary displays. samrank provides broader coverage, better suited for retrieval and evaluation scenarios.
Recommended Models
| Language | Model | Parameters |
|---|---|---|
| Chinese | thenlper/gte-small-zh |
~33M |
| English | sentence-transformers/all-MiniLM-L6-v2 |
~22M |
Decoder-Only Support
The main library now includes the stable decoder-only gains:
- automatic causal model detection
- default Chinese causal prefix
核心关键词、关键实体、主题: - automatic middle-upper layer recommendation when
layer_indexis omitted, using a band at roughly 3/4 depth for causal models - current recommended Chinese decoder-only combination:
Qwen/Qwen3-Embedding-0.6B + fusion_attn_idf
Latest rollout details are documented in the project's internal experiment notes under docs/.
Lightweight Deployment
The recommended lightweight deployment path is gte-small-zh + ONNX Runtime. Internal validation shows that gte-small-zh can export token attention and reproduce received_attn word scores with stable numerical agreement, making it the default route for lightweight operators and deployment work.
Recommended install:
pip install "keyatten[zh,lightweight]"
Lightweight backend example:
from keyatten import KeyAttenExtractor
ext = KeyAttenExtractor(
model="/path/to/thenlper__gte-small-zh",
language="zh",
backend="onnx",
onnx_path="/path/to/attention_last.onnx",
)
keywords = ext.extract_keywords(
"自然语言处理用于关键词提取与文本分析",
method="received_attn",
)
Notes:
modelshould point to a localgte-small-zhdirectory so KeyAtten can readtokenizer.jsononnx_pathshould point to the exported attention ONNX file- the lightweight backend currently supports a single exported attention layer, which matches the default
gte-small-zhrelease path - if you want to export the ONNX file yourself, install
keyatten[inference,zh,lightweight]instead
See:
Benchmark Entry
Use one professional entrypoint instead of browsing scripts under benchmark/:
python -m keyatten.benchmark_cli --help
python -m keyatten.benchmark_cli keyword --root-dir "." --output-dir "outputs_smoke" --datasets csl_test --models thenlper/gte-small-zh --skip-yake --device cpu
After editable install, you can use:
keyatten-benchmark --help
keyatten-benchmark gte-onnx-probe
Main command mapping:
keyword->benchmark/eval/run_keyword_benchmark.pyhidden-head->benchmark/eval/run_hidden_head_benchmark.pygte-onnx-probe->benchmark/tools/gte_onnx_probe.pyllm-keyword->benchmark/eval/llm_keyword_benchmark.py
Full benchmark usage notes: benchmark/README.md
Evaluation Summary
Compared against TF-IDF, TextRank, KeyBERT and 14 methods total on 7 public datasets (F1@10):
| Scenario | KeyAtten Best | Method | vs Strongest Traditional | vs Strongest External |
|---|---|---|---|---|
| Chinese News (news55) | 0.4994 | BIO Viterbi | +224% | — |
| Chinese News (ShenCeCup 1000) | 0.3292 | QK LoRA | +113% | — |
| Chinese Academic (paper_test_800) | 0.2752 | CSA (high_recall) | — | — |
| Chinese Academic (CSL, zero-shot) | 0.2106 | samrank_idf |
+9% | — |
| English Long-doc (SemEval2010-fulltext) | 0.1344 | cls_attn_idf |
— | +78% |
| English Long-doc (Krapivin2009-fulltext) | 0.1268 | cls_attn_idf |
— | +79% |
| English Short-doc (3 datasets) | 0.1370 | fusion_attn |
— | On par |
The main method (BIO candidates + fine-tuned Candidate-Segment Attention reranking) achieves F1@10 = 0.4665 on news55, a +13.7% improvement over BIO-only clean baseline (0.3916).
Full evaluation report: EVALUATION-PUBLIC.md
API
KeyAttenExtractor
KeyAttenExtractor(
model: str, # Hugging Face model name
language: str = "zh", # "zh" or "en"
device: str = "cpu", # compute device
backend: str = "auto", # "auto" / "torch" / "onnx"
onnx_path: str | None = None, # ONNX attention file path
user_dict: str | list[str] | dict = None, # domain dictionary path / term list / term config
layer_index: int | None = None, # None = auto; causal models default to the middle-upper band at roughly 3/4 depth, -1 = explicit last layer
layer_indices: list[int] = None, # multi-layer indices
layer_weights: list[float] = None, # multi-layer weights
instruction_prefix: str | None = None, # optional prefix for causal models
is_causal_override: bool | None = None, # None=auto detect; False=force encoder-style readout; True=force decoder-style readout
dedup_nested_for_topk5: bool = False, # enable substring de-dup post-processing only when top_k<=5
candidate_scoring: str = "word", # "word" / "token_span" / "bio"
cache_enabled: bool = False, # enable disk cache
cache_dir: str | Path = "cache", # cache directory
)
| Method | Returns |
|---|---|
extract_keywords(text, method, top_k, idf_lookup) |
list[str] |
extract_keywords_batch(texts, method, top_k, idf_lookup) |
list[list[str]] |
extract_word_weights(text, method) |
list[WordWeight] |
fit_idf(texts) |
dict[str, float] |
update_idf(texts) |
dict[str, float] |
WordWeight fields: word, index, weight, pos_tag.
Notes:
extract_keywordsandextract_word_weightsalso accept pre-tokenizedlist[str]- when external tokens are provided,
pos_tagsis optional; Chinese defaults ton, English defaults toeng user_dictaccepts a dictionary file path, a term list, or mappings like{term: tag}/{term: (freq, tag)}extract_keywords()andextract_keywords_batch()now default toreceived_attn- if
layer_indexis omitted for a causal model, KeyAtten automatically uses the recommended middle-upper layer at roughly 3/4 depth is_causal_overrideonly overrides the attention readout mode; it does not change the underlying model architecture- when
dedup_nested_for_topk5=True, substring/superstring de-dup is applied only fortop_k<=5, not for@10 candidate_scoring="token_span"only applies to raw string input; external token input stays on the word-based ranking pathcandidate_scoring="bio"requiresbio_model_pathand only applies to raw string inputfit_idf()rebuilds IDF state;update_idf()incrementally appends new documents to the current state- when
cache_enabled=True, the word candidate path caches both pre-IDF document scores and post-IDF final keywords
Citation
The samrank method in this project references the ranking formula from:
Kang, B., & Shin, H. (2023). SAMRank: Unsupervised Keyphrase Extraction using Self-Attention Map in BERT and GPT-2. EMNLP 2023. DOI: 10.18653/v1/2023.emnlp-main.630
cls_attn, received_attn, fusion_attn and all _idf hybrid strategies are original to this project.
Acknowledgments
Thanks to the LinuxDo community for their support.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file keyatten-0.3.1.tar.gz.
File metadata
- Download URL: keyatten-0.3.1.tar.gz
- Upload date:
- Size: 62.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
264c45be99e0bd4993fe6f2235763e6b8cc17f4e7a229c6301a031b5b16ae608
|
|
| MD5 |
012d2806a7d816a6295a24148e691e43
|
|
| BLAKE2b-256 |
ac61a66d938c096b3b949c528be94179141737a00fab8cf1c10d6fcee8eb219a
|
File details
Details for the file keyatten-0.3.1-py3-none-any.whl.
File metadata
- Download URL: keyatten-0.3.1-py3-none-any.whl
- Upload date:
- Size: 73.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8215db490ea67fa391c3f9833f4975f7d1e7d812dcdb36ca8724c9c38c5bdc1
|
|
| MD5 |
1fbc0a1db576cf21f940f4fb6dd03bca
|
|
| BLAKE2b-256 |
35b90b9241e5598c4ba80b46c8eaf50ae1e789e3af380a28dac177e9eca49847
|