Attention-based keyword extraction with ordered semantic word weights
Project description
KeyAtten
Attention-based keyword extraction framework. Zero training, zero labeling, single forward pass. Supports Chinese and English.
Evaluated on 7 public datasets against 14 methods: +67% F1@10 over traditional baselines on Chinese news, +78% over the strongest external method on English long documents.
Features
- Extracts keywords directly from pretrained model attention weights — no fine-tuning or labeling required
- Attention-IDF hybrid strategy for significant gains on long documents and corpus-aware scenarios
- Word-level semantic weight output (weight value, position index, POS tag)
- Single-layer or multi-layer attention weighted fusion
- Lightweight: 22M–33M parameter models, single forward pass
Installation
pip install .
Dependencies: torch>=2.0 transformers>=4.30 jieba scikit-learn nltk numpy
Quick Start
Keyword Extraction
from keyatten import KeyAttenExtractor
ext = KeyAttenExtractor(model="thenlper/gte-small-zh", language="zh")
# Pure attention
keywords = ext.extract_keywords(
"自然语言处理是人工智能的重要方向",
method="cls_attn",
)
Attention-IDF Hybrid
# Fit IDF from a corpus first
idf = ext.fit_idf(["自然语言处理是人工智能的重要方向", "关键词提取是文本挖掘任务"])
keywords = ext.extract_keywords(
"自然语言处理是人工智能的重要方向",
method="samrank_idf",
idf_lookup=idf,
)
Word-Level Weights
weights = ext.extract_word_weights(
"自然语言处理是人工智能的重要方向",
method="received_attn",
)
for w in weights:
print(w.word, w.weight, w.pos_tag)
Batch Extraction
results = ext.extract_keywords_batch(
["文本一", "文本二", "文本三"],
method="fusion_attn",
)
Convenience Function
from keyatten import extract_keywords
keywords = extract_keywords(
"自然语言处理是人工智能的重要方向",
model="thenlper/gte-small-zh",
)
Methods
| Method | Description |
|---|---|
cls_attn |
Attention weights from [CLS] token to each token |
received_attn |
Total attention each token receives from all tokens |
samrank |
SAMRank formula (global attention + proportional redistribution) |
fusion_attn |
Normalized product of CLS and received attention |
Each method has a corresponding _idf hybrid variant (e.g., cls_attn_idf) that multiplies attention scores with TF-IDF, suitable for corpus-aware scenarios.
The
samrankformula is referenced from Kang & Shin (2023, EMNLP). The other methods (cls_attn,received_attn,fusion_attn) and all_idfhybrid strategies are original to this project.
Choosing a Method
samrank achieves the highest benchmark scores (F1@10) due to broader coverage and stronger recall. In practice, cls_attn is often more useful — it extracts the most distinctive core terms, making it ideal for tag clouds and summaries.
Practical Examples
Side-by-side comparison of cls_attn vs samrank across domains (model: gte-small-zh, top_k=6):
| Domain | Input (excerpt) | cls_attn | samrank |
|---|---|---|---|
| Tech | OpenAI released GPT-4o with multimodal input... | OpenAI, GPT, model | OpenAI, model, GPT |
| Medical | mRNA vaccine encodes spike protein... Omicron variant... | mRNA, mRNA vaccine, COVID, Omicron variant | mRNA, mRNA vaccine, COVID, COVID virus |
| Finance | Fed announces 25bp rate hike... | rate hike, basis points, global stocks, rate | rate hike, basis points, rate, global stocks |
| Sports | Messi scores hat-trick in World Cup final... lifts trophy | Messi, trophy, hat-trick, final | trophy, Messi, hat-trick, penalty |
| History | Qin Shi Huang unified six states... centralized dynasty | centralization, feudal dynasty, standardization | centralization, standardization, feudal dynasty |
| Daily | Meet at Starbucks at 3pm... business trip to Beijing | Starbucks, Beijing, business trip | meet, Beijing, chat |
cls_attn favors the most distinctive entities (Messi, Starbucks, Omicron), ideal for tag clouds and summary displays. samrank provides broader coverage, better suited for retrieval and evaluation scenarios.
Recommended Models
| Language | Model | Parameters |
|---|---|---|
| Chinese | thenlper/gte-small-zh |
~33M |
| English | sentence-transformers/all-MiniLM-L6-v2 |
~22M |
Evaluation Summary
Compared against TF-IDF, TextRank, KeyBERT and 14 methods total on 7 public datasets (F1@10):
| Scenario | KeyAtten Best | vs Strongest Traditional | vs Strongest External |
|---|---|---|---|
| Chinese News (ShenCeCup) | 0.2579 | +67% | — |
| Chinese Academic (CSL) | 0.2106 | +9% | — |
| English Long-doc (SemEval2010-fulltext) | 0.1344 | — | +78% |
| English Long-doc (Krapivin2009-fulltext) | 0.1268 | — | +79% |
| English Short-doc (3 datasets) | 0.1370 | — | On par |
Full evaluation report: EVALUATION-PUBLIC.md
API
KeyAttenExtractor
KeyAttenExtractor(
model: str, # Hugging Face model name
language: str = "zh", # "zh" or "en"
device: str = "cpu", # compute device
layer_index: int = -1, # single layer index (-1 = last layer)
layer_indices: list[int] = None, # multi-layer indices
layer_weights: list[float] = None, # multi-layer weights
attn_merge: bool = False, # attention-guided char merging for Chinese
merge_threshold: float = 0.3, # merge threshold (0.0–1.0)
)
| Method | Returns |
|---|---|
extract_keywords(text, method, top_k, idf_lookup) |
list[str] |
extract_keywords_batch(texts, method, top_k, idf_lookup) |
list[list[str]] |
extract_word_weights(text, method) |
list[WordWeight] |
fit_idf(texts) |
dict[str, float] |
WordWeight fields: word, index, weight, pos_tag.
Citation
The samrank method in this project references the ranking formula from:
Kang, B., & Shin, H. (2023). SAMRank: Unsupervised Keyphrase Extraction using Self-Attention Map in BERT and GPT-2. EMNLP 2023. DOI: 10.18653/v1/2023.emnlp-main.630
cls_attn, received_attn, fusion_attn and all _idf hybrid strategies are original to this project.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file keyatten-0.1.0.tar.gz.
File metadata
- Download URL: keyatten-0.1.0.tar.gz
- Upload date:
- Size: 16.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ab4838136b69c6d1a7e1fbd07165c88dd77ad2b8f536c4a07bc261315517791
|
|
| MD5 |
bca47b584d5801a9c25554ec6e5bf7d6
|
|
| BLAKE2b-256 |
0c436f8d214c758bcb9bf07b8e1e570d022e29050efba497f4f33a8c9bb41623
|
File details
Details for the file keyatten-0.1.0-py3-none-any.whl.
File metadata
- Download URL: keyatten-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37729585f5ad7c3341d428223ea884e448571e4f6e1de6369fe12753e4cc91fd
|
|
| MD5 |
f5836276c96110fa4ad68cd228278cea
|
|
| BLAKE2b-256 |
2720b1097b96b8c2415ec2901c6ed7fe1f65e719f57b2bcb81e810b385c883da
|