Attention-based keyword extraction with ordered semantic word weights

Project description

KeyAtten

Attention-based keyword extraction framework. Zero training, zero labeling, single forward pass. Supports Chinese and English.

Evaluated on 7 public datasets against 14 methods: +67% F1@10 over traditional baselines on Chinese news, +78% over the strongest external method on English long documents.

Features

Extracts keywords directly from pretrained model attention weights — no fine-tuning or labeling required
Attention-IDF hybrid strategy for significant gains on long documents and corpus-aware scenarios
Word-level semantic weight output (weight value, position index, POS tag)
Single-layer or multi-layer attention weighted fusion
Lightweight: 22M–33M parameter models, single forward pass

Installation

pip install .

Dependencies: torch>=2.0 transformers>=4.30 jieba scikit-learn nltk numpy

Quick Start

Keyword Extraction

from keyatten import KeyAttenExtractor

ext = KeyAttenExtractor(model="thenlper/gte-small-zh", language="zh")

# Pure attention
keywords = ext.extract_keywords(
    "自然语言处理是人工智能的重要方向",
    method="cls_attn",
)

Attention-IDF Hybrid

# Fit IDF from a corpus first
idf = ext.fit_idf(["自然语言处理是人工智能的重要方向", "关键词提取是文本挖掘任务"])

keywords = ext.extract_keywords(
    "自然语言处理是人工智能的重要方向",
    method="samrank_idf",
    idf_lookup=idf,
)

Word-Level Weights

weights = ext.extract_word_weights(
    "自然语言处理是人工智能的重要方向",
    method="received_attn",
)
for w in weights:
    print(w.word, w.weight, w.pos_tag)

Batch Extraction

results = ext.extract_keywords_batch(
    ["文本一", "文本二", "文本三"],
    method="fusion_attn",
)

Convenience Function

from keyatten import extract_keywords

keywords = extract_keywords(
    "自然语言处理是人工智能的重要方向",
    model="thenlper/gte-small-zh",
)

Methods

Method	Description
`cls_attn`	Attention weights from [CLS] token to each token
`received_attn`	Total attention each token receives from all tokens
`samrank`	SAMRank formula (global attention + proportional redistribution)
`fusion_attn`	Normalized product of CLS and received attention

Each method has a corresponding _idf hybrid variant (e.g., cls_attn_idf) that multiplies attention scores with TF-IDF, suitable for corpus-aware scenarios.

The samrank formula is referenced from Kang & Shin (2023, EMNLP). The other methods (cls_attn, received_attn, fusion_attn) and all _idf hybrid strategies are original to this project.

Choosing a Method

samrank achieves the highest benchmark scores (F1@10) due to broader coverage and stronger recall. In practice, cls_attn is often more useful — it extracts the most distinctive core terms, making it ideal for tag clouds and summaries.

Practical Examples

Side-by-side comparison of cls_attn vs samrank across domains (model: gte-small-zh, top_k=6):

Domain	Input (excerpt)	cls_attn	samrank
Tech	OpenAI released GPT-4o with multimodal input...	OpenAI, GPT, model	OpenAI, model, GPT
Medical	mRNA vaccine encodes spike protein... Omicron variant...	mRNA, mRNA vaccine, COVID, Omicron variant	mRNA, mRNA vaccine, COVID, COVID virus
Finance	Fed announces 25bp rate hike...	rate hike, basis points, global stocks, rate	rate hike, basis points, rate, global stocks
Sports	Messi scores hat-trick in World Cup final... lifts trophy	Messi, trophy, hat-trick, final	trophy, Messi, hat-trick, penalty
History	Qin Shi Huang unified six states... centralized dynasty	centralization, feudal dynasty, standardization	centralization, standardization, feudal dynasty
Daily	Meet at Starbucks at 3pm... business trip to Beijing	Starbucks, Beijing, business trip	meet, Beijing, chat

cls_attn favors the most distinctive entities (Messi, Starbucks, Omicron), ideal for tag clouds and summary displays. samrank provides broader coverage, better suited for retrieval and evaluation scenarios.

Recommended Models

Language	Model	Parameters
Chinese	`thenlper/gte-small-zh`	~33M
English	`sentence-transformers/all-MiniLM-L6-v2`	~22M

Evaluation Summary

Compared against TF-IDF, TextRank, KeyBERT and 14 methods total on 7 public datasets (F1@10):

Scenario	KeyAtten Best	vs Strongest Traditional	vs Strongest External
Chinese News (ShenCeCup)	0.2579	+67%	—
Chinese Academic (CSL)	0.2106	+9%	—
English Long-doc (SemEval2010-fulltext)	0.1344	—	+78%
English Long-doc (Krapivin2009-fulltext)	0.1268	—	+79%
English Short-doc (3 datasets)	0.1370	—	On par

Full evaluation report: EVALUATION-PUBLIC.md

API

KeyAttenExtractor

KeyAttenExtractor(
    model: str,                         # Hugging Face model name
    language: str = "zh",               # "zh" or "en"
    device: str = "cpu",                # compute device
    layer_index: int = -1,              # single layer index (-1 = last layer)
    layer_indices: list[int] = None,    # multi-layer indices
    layer_weights: list[float] = None,  # multi-layer weights
    attn_merge: bool = False,           # attention-guided char merging for Chinese
    merge_threshold: float = 0.3,       # merge threshold (0.0–1.0)
)

Method	Returns
`extract_keywords(text, method, top_k, idf_lookup)`	`list[str]`
`extract_keywords_batch(texts, method, top_k, idf_lookup)`	`list[list[str]]`
`extract_word_weights(text, method)`	`list[WordWeight]`
`fit_idf(texts)`	`dict[str, float]`

WordWeight fields: word, index, weight, pos_tag.

Citation

The samrank method in this project references the ranking formula from:

Kang, B., & Shin, H. (2023). SAMRank: Unsupervised Keyphrase Extraction using Self-Attention Map in BERT and GPT-2. EMNLP 2023. DOI: 10.18653/v1/2023.emnlp-main.630

cls_attn, received_attn, fusion_attn and all _idf hybrid strategies are original to this project.

License

MIT

Project details

Release history Release notifications | RSS feed

0.2.0

Apr 6, 2026

This version

0.1.0

Mar 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keyatten-0.1.0.tar.gz (16.9 kB view details)

Uploaded Mar 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

keyatten-0.1.0-py3-none-any.whl (15.9 kB view details)

Uploaded Mar 22, 2026 Python 3

File details

Details for the file keyatten-0.1.0.tar.gz.

File metadata

Download URL: keyatten-0.1.0.tar.gz
Upload date: Mar 22, 2026
Size: 16.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for keyatten-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8ab4838136b69c6d1a7e1fbd07165c88dd77ad2b8f536c4a07bc261315517791`
MD5	`bca47b584d5801a9c25554ec6e5bf7d6`
BLAKE2b-256	`0c436f8d214c758bcb9bf07b8e1e570d022e29050efba497f4f33a8c9bb41623`

See more details on using hashes here.

File details

Details for the file keyatten-0.1.0-py3-none-any.whl.

File metadata

Download URL: keyatten-0.1.0-py3-none-any.whl
Upload date: Mar 22, 2026
Size: 15.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for keyatten-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`37729585f5ad7c3341d428223ea884e448571e4f6e1de6369fe12753e4cc91fd`
MD5	`f5836276c96110fa4ad68cd228278cea`
BLAKE2b-256	`2720b1097b96b8c2415ec2901c6ed7fe1f65e719f57b2bcb81e810b385c883da`

See more details on using hashes here.

keyatten 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

KeyAtten

Features

Installation

Quick Start

Keyword Extraction

Attention-IDF Hybrid

Word-Level Weights

Batch Extraction

Convenience Function

Methods

Choosing a Method

Practical Examples

Recommended Models

Evaluation Summary

API

KeyAttenExtractor

Citation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes