Skip to main content

Attention-based keyword extraction with ordered semantic word weights

Project description

KeyAtten

English | 中文

Attention-based keyword extraction framework. Zero training, zero labeling, single forward pass. Supports Chinese and English.

Evaluated on 7 public datasets against 14 methods: +67% F1@10 over traditional baselines on Chinese news, +78% over the strongest external method on English long documents.

Features

  • Extracts keywords directly from pretrained model attention weights — no fine-tuning or labeling required
  • Attention-IDF hybrid strategy for significant gains on long documents and corpus-aware scenarios
  • Word-level semantic weight output (weight value, position index, POS tag)
  • Single-layer or multi-layer attention weighted fusion
  • Lightweight: 22M–33M parameter models, single forward pass

Installation

pip install .

Dependencies: torch>=2.0 transformers>=4.30 jieba scikit-learn nltk numpy

Quick Start

Keyword Extraction

from keyatten import KeyAttenExtractor

ext = KeyAttenExtractor(model="thenlper/gte-small-zh", language="zh")

# Pure attention
keywords = ext.extract_keywords(
    "自然语言处理是人工智能的重要方向",
    method="cls_attn",
)

Attention-IDF Hybrid

# Fit IDF from a corpus first
idf = ext.fit_idf(["自然语言处理是人工智能的重要方向", "关键词提取是文本挖掘任务"])

keywords = ext.extract_keywords(
    "自然语言处理是人工智能的重要方向",
    method="samrank_idf",
    idf_lookup=idf,
)

Word-Level Weights

weights = ext.extract_word_weights(
    "自然语言处理是人工智能的重要方向",
    method="received_attn",
)
for w in weights:
    print(w.word, w.weight, w.pos_tag)

Batch Extraction

results = ext.extract_keywords_batch(
    ["文本一", "文本二", "文本三"],
    method="fusion_attn",
)

Convenience Function

from keyatten import extract_keywords

keywords = extract_keywords(
    "自然语言处理是人工智能的重要方向",
    model="thenlper/gte-small-zh",
)

Methods

Method Description
cls_attn Attention weights from [CLS] token to each token
received_attn Total attention each token receives from all tokens
samrank SAMRank formula (global attention + proportional redistribution)
fusion_attn Normalized product of CLS and received attention

Each method has a corresponding _idf hybrid variant (e.g., cls_attn_idf) that multiplies attention scores with TF-IDF, suitable for corpus-aware scenarios.

The samrank formula is referenced from Kang & Shin (2023, EMNLP). The other methods (cls_attn, received_attn, fusion_attn) and all _idf hybrid strategies are original to this project.

Choosing a Method

samrank achieves the highest benchmark scores (F1@10) due to broader coverage and stronger recall. In practice, cls_attn is often more useful — it extracts the most distinctive core terms, making it ideal for tag clouds and summaries.

Practical Examples

Side-by-side comparison of cls_attn vs samrank across domains (model: gte-small-zh, top_k=6):

Domain Input (excerpt) cls_attn samrank
Tech OpenAI released GPT-4o with multimodal input... OpenAI, GPT, model OpenAI, model, GPT
Medical mRNA vaccine encodes spike protein... Omicron variant... mRNA, mRNA vaccine, COVID, Omicron variant mRNA, mRNA vaccine, COVID, COVID virus
Finance Fed announces 25bp rate hike... rate hike, basis points, global stocks, rate rate hike, basis points, rate, global stocks
Sports Messi scores hat-trick in World Cup final... lifts trophy Messi, trophy, hat-trick, final trophy, Messi, hat-trick, penalty
History Qin Shi Huang unified six states... centralized dynasty centralization, feudal dynasty, standardization centralization, standardization, feudal dynasty
Daily Meet at Starbucks at 3pm... business trip to Beijing Starbucks, Beijing, business trip meet, Beijing, chat

cls_attn favors the most distinctive entities (Messi, Starbucks, Omicron), ideal for tag clouds and summary displays. samrank provides broader coverage, better suited for retrieval and evaluation scenarios.

Recommended Models

Language Model Parameters
Chinese thenlper/gte-small-zh ~33M
English sentence-transformers/all-MiniLM-L6-v2 ~22M

Evaluation Summary

Compared against TF-IDF, TextRank, KeyBERT and 14 methods total on 7 public datasets (F1@10):

Scenario KeyAtten Best vs Strongest Traditional vs Strongest External
Chinese News (ShenCeCup) 0.2579 +67%
Chinese Academic (CSL) 0.2106 +9%
English Long-doc (SemEval2010-fulltext) 0.1344 +78%
English Long-doc (Krapivin2009-fulltext) 0.1268 +79%
English Short-doc (3 datasets) 0.1370 On par

Full evaluation report: EVALUATION-PUBLIC.md

API

KeyAttenExtractor

KeyAttenExtractor(
    model: str,                         # Hugging Face model name
    language: str = "zh",               # "zh" or "en"
    device: str = "cpu",                # compute device
    layer_index: int = -1,              # single layer index (-1 = last layer)
    layer_indices: list[int] = None,    # multi-layer indices
    layer_weights: list[float] = None,  # multi-layer weights
    attn_merge: bool = False,           # attention-guided char merging for Chinese
    merge_threshold: float = 0.3,       # merge threshold (0.0–1.0)
)
Method Returns
extract_keywords(text, method, top_k, idf_lookup) list[str]
extract_keywords_batch(texts, method, top_k, idf_lookup) list[list[str]]
extract_word_weights(text, method) list[WordWeight]
fit_idf(texts) dict[str, float]

WordWeight fields: word, index, weight, pos_tag.

Citation

The samrank method in this project references the ranking formula from:

Kang, B., & Shin, H. (2023). SAMRank: Unsupervised Keyphrase Extraction using Self-Attention Map in BERT and GPT-2. EMNLP 2023. DOI: 10.18653/v1/2023.emnlp-main.630

cls_attn, received_attn, fusion_attn and all _idf hybrid strategies are original to this project.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keyatten-0.1.0.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

keyatten-0.1.0-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file keyatten-0.1.0.tar.gz.

File metadata

  • Download URL: keyatten-0.1.0.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for keyatten-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8ab4838136b69c6d1a7e1fbd07165c88dd77ad2b8f536c4a07bc261315517791
MD5 bca47b584d5801a9c25554ec6e5bf7d6
BLAKE2b-256 0c436f8d214c758bcb9bf07b8e1e570d022e29050efba497f4f33a8c9bb41623

See more details on using hashes here.

File details

Details for the file keyatten-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: keyatten-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for keyatten-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 37729585f5ad7c3341d428223ea884e448571e4f6e1de6369fe12753e4cc91fd
MD5 f5836276c96110fa4ad68cd228278cea
BLAKE2b-256 2720b1097b96b8c2415ec2901c6ed7fe1f65e719f57b2bcb81e810b385c883da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page