Skip to main content

Attention-based keyword extraction with ordered semantic word weights

Project description

KeyAtten

KeyAtten: Attention-based Keyword/Keyphrase Extraction English | 中文

Attention-based keyword extraction framework. Zero training, zero labeling, single forward pass. Supports Chinese and English.

Evaluated on 7 public datasets against 14 methods: +67% F1@10 over traditional baselines on Chinese news, +78% over the strongest external method on English long documents.

Default Release Path

  • Default Chinese model: thenlper/gte-small-zh
  • Default release method: received_attn, plus _idf variants when a corpus is available
  • Default deployment path: small encoder + interpretable attention + lightweight operators

The repository still treats gte-small-zh as the lightweight default production model, but the main library now ships decoder-only causal attention adaptation. When no layer is specified for a causal model, KeyAtten automatically recommends a middle-upper layer instead of falling back to the last layer.

Features

  • Extracts keywords directly from pretrained model attention weights — no fine-tuning or labeling required
  • Attention-IDF hybrid strategy for significant gains on long documents and corpus-aware scenarios
  • Word-level semantic weight output (weight value, position index, POS tag)
  • Single-layer or multi-layer attention weighted fusion
  • Lightweight: 22M–33M parameter models, single forward pass

Installation

pip install keyatten

Minimal install only includes numpy so importing the package does not pull the full ML stack by default.

pip install "keyatten[inference,zh]"   # Chinese keyword extraction
pip install "keyatten[inference,en]"   # English keyword extraction
pip install "keyatten[inference,zh,lightweight]"  # Chinese lightweight deployment
pip install "keyatten[full]"           # All optional dependencies

Optional dependency groups:

  • inference: torch>=2.0, transformers>=4.30
  • lightweight: onnx>=1.16, onnxruntime>=1.18, tokenizers>=0.15
  • zh: jieba>=0.42
  • en: scikit-learn>=1.0, nltk>=3.8

If you call extraction APIs without the required extras installed, KeyAtten now raises a direct install hint instead of failing during import keyatten.

Quick Start

Keyword Extraction

from keyatten import KeyAttenExtractor

ext = KeyAttenExtractor(model="thenlper/gte-small-zh", language="zh")

# Pure attention
keywords = ext.extract_keywords(
    "自然语言处理是人工智能的重要方向",
    method="received_attn",
)

Attention-IDF Hybrid

# Fit IDF from a corpus first
idf = ext.fit_idf(["自然语言处理是人工智能的重要方向", "关键词提取是文本挖掘任务"])

keywords = ext.extract_keywords(
    "自然语言处理是人工智能的重要方向",
    method="fusion_attn_idf",
    idf_lookup=idf,
)

Word-Level Weights

weights = ext.extract_word_weights(
    "自然语言处理是人工智能的重要方向",
    method="received_attn",
)
for w in weights:
    print(w.word, w.weight, w.pos_tag)

Batch Extraction

results = ext.extract_keywords_batch(
    ["文本一", "文本二", "文本三"],
    method="fusion_attn",
)

External Token Input

keywords = ext.extract_keywords(
    ["空天信息", "系统", "优化"],
    pos_tags=["n", "n", "v"],
    method="received_attn",
)

Domain Dictionary

ext = KeyAttenExtractor(
    model="thenlper/gte-small-zh",
    language="zh",
    user_dict=["空天信息", "星闪技术"],
)

keywords = ext.extract_keywords(
    "空天信息系统优化方法",
    method="received_attn",
)

Token-Span Candidate Scoring

ext = KeyAttenExtractor(
    model="Qwen/Qwen3-Embedding-0.6B",
    language="zh",
    candidate_scoring="token_span",
)

keywords = ext.extract_keywords(
    "水木年华被嘲讽已过气,卢庚戌回应称作品会留下来",
    method="fusion_attn_idf",
    idf_lookup=idf,
)

Convenience Function

from keyatten import extract_keywords

keywords = extract_keywords(
    "自然语言处理是人工智能的重要方向",
    model="thenlper/gte-small-zh",
)

Methods

Method Description
cls_attn Attention weights from [CLS] token to each token
received_attn Total attention each token receives from all tokens
samrank SAMRank formula (global attention + proportional redistribution)
fusion_attn Normalized product of CLS and received attention

Each method has a corresponding _idf hybrid variant (e.g., cls_attn_idf) that multiplies attention scores with TF-IDF, suitable for corpus-aware scenarios.

The samrank formula is referenced from Kang & Shin (2023, EMNLP). The other methods (cls_attn, received_attn, fusion_attn) and all _idf hybrid strategies are original to this project.

Choosing a Method

received_attn is now the safest default starting point. When a corpus is available, _idf variants should be tried first; in the latest Chinese decoder-only rollup, received_attn_idf is the main CSL path and fusion_attn_idf is the main ShenCeCup path. cls_attn is still useful for high-distinctiveness tag-cloud style outputs, but it is no longer the default keyword-extraction method.

If your main metric is F1@5, the library now also exposes an optional nested-phrase de-dup post-ranking step. It only activates when top_k <= 5, filters substring/superstring duplicates such as natural language processing / natural language / language processing, and stays off by default so the @10 path is unchanged.

For raw string input, the library now also exposes an optional candidate_scoring="token_span" route. Candidate generation still follows the segmenter and POS filter, but ranking aggregates token attention directly over each candidate's character span, bypassing the previous word-level mean-of-means path.

Practical Examples

Side-by-side comparison of cls_attn vs samrank across domains (model: gte-small-zh, top_k=6):

Domain Input (excerpt) cls_attn samrank
Tech OpenAI released GPT-4o with multimodal input... OpenAI, GPT, model OpenAI, model, GPT
Medical mRNA vaccine encodes spike protein... Omicron variant... mRNA, mRNA vaccine, COVID, Omicron variant mRNA, mRNA vaccine, COVID, COVID virus
Finance Fed announces 25bp rate hike... rate hike, basis points, global stocks, rate rate hike, basis points, rate, global stocks
Sports Messi scores hat-trick in World Cup final... lifts trophy Messi, trophy, hat-trick, final trophy, Messi, hat-trick, penalty
History Qin Shi Huang unified six states... centralized dynasty centralization, feudal dynasty, standardization centralization, standardization, feudal dynasty
Daily Meet at Starbucks at 3pm... business trip to Beijing Starbucks, Beijing, business trip meet, Beijing, chat

cls_attn favors the most distinctive entities (Messi, Starbucks, Omicron), ideal for tag clouds and summary displays. samrank provides broader coverage, better suited for retrieval and evaluation scenarios.

Recommended Models

Language Model Parameters
Chinese thenlper/gte-small-zh ~33M
English sentence-transformers/all-MiniLM-L6-v2 ~22M

Decoder-Only Support

The main library now includes the stable decoder-only gains:

  • automatic causal model detection
  • default Chinese causal prefix 核心关键词、关键实体、主题:
  • automatic middle-upper layer recommendation when layer_index is omitted
  • current recommended Chinese decoder-only combination: Qwen/Qwen3-Embedding-0.6B + fusion_attn_idf

Latest rollout summary:

Lightweight Deployment

The recommended lightweight deployment path is gte-small-zh + ONNX Runtime. Internal validation shows that gte-small-zh can export token attention and reproduce received_attn word scores with stable numerical agreement, making it the default route for lightweight operators and deployment work.

Recommended install:

pip install "keyatten[zh,lightweight]"

Lightweight backend example:

from keyatten import KeyAttenExtractor

ext = KeyAttenExtractor(
    model="/path/to/thenlper__gte-small-zh",
    language="zh",
    backend="onnx",
    onnx_path="/path/to/attention_last.onnx",
)

keywords = ext.extract_keywords(
    "自然语言处理用于关键词提取与文本分析",
    method="received_attn",
)

Notes:

  • model should point to a local gte-small-zh directory so KeyAtten can read tokenizer.json
  • onnx_path should point to the exported attention ONNX file
  • the lightweight backend currently supports a single exported attention layer, which matches the default gte-small-zh release path
  • if you want to export the ONNX file yourself, install keyatten[inference,zh,lightweight] instead

See:

Evaluation Summary

Compared against TF-IDF, TextRank, KeyBERT and 14 methods total on 7 public datasets (F1@10):

Scenario KeyAtten Best vs Strongest Traditional vs Strongest External
Chinese News (ShenCeCup) 0.2579 +67%
Chinese Academic (CSL) 0.2106 +9%
English Long-doc (SemEval2010-fulltext) 0.1344 +78%
English Long-doc (Krapivin2009-fulltext) 0.1268 +79%
English Short-doc (3 datasets) 0.1370 On par

Full evaluation report: EVALUATION-PUBLIC.md

API

KeyAttenExtractor

KeyAttenExtractor(
    model: str,                         # Hugging Face model name
    language: str = "zh",               # "zh" or "en"
    device: str = "cpu",                # compute device
    backend: str = "auto",              # "auto" / "torch" / "onnx"
    onnx_path: str | None = None,       # ONNX attention file path
    user_dict: str | list[str] | dict = None,  # domain dictionary path / term list / term config
    layer_index: int | None = None,     # None = auto; causal models default to middle-upper layers, -1 = explicit last layer
    layer_indices: list[int] = None,    # multi-layer indices
    layer_weights: list[float] = None,  # multi-layer weights
    attn_merge: bool = False,           # attention-guided char merging for Chinese
    merge_threshold: float = 0.3,       # merge threshold (0.0–1.0)
    instruction_prefix: str | None = None,  # optional prefix for causal models
    is_causal_override: bool | None = None,  # None=auto detect; False=force encoder-style readout; True=force decoder-style readout
    dedup_nested_for_topk5: bool = False,    # enable substring de-dup post-processing only when top_k<=5
    candidate_scoring: str = "word",   # "word" / "token_span"
)
Method Returns
extract_keywords(text, method, top_k, idf_lookup) list[str]
extract_keywords_batch(texts, method, top_k, idf_lookup) list[list[str]]
extract_word_weights(text, method) list[WordWeight]
fit_idf(texts) dict[str, float]

WordWeight fields: word, index, weight, pos_tag.

Notes:

  • extract_keywords and extract_word_weights also accept pre-tokenized list[str]
  • when external tokens are provided, pos_tags is optional; Chinese defaults to n, English defaults to eng
  • user_dict accepts a dictionary file path, a term list, or mappings like {term: tag} / {term: (freq, tag)}
  • extract_keywords() and extract_keywords_batch() now default to received_attn
  • if layer_index is omitted for a causal model, KeyAtten automatically uses the recommended middle-upper layer
  • is_causal_override only overrides the attention readout mode; it does not change the underlying model architecture
  • when dedup_nested_for_topk5=True, substring/superstring de-dup is applied only for top_k<=5, not for @10
  • candidate_scoring="token_span" only applies to raw string input; external token input stays on the word-based ranking path

Citation

The samrank method in this project references the ranking formula from:

Kang, B., & Shin, H. (2023). SAMRank: Unsupervised Keyphrase Extraction using Self-Attention Map in BERT and GPT-2. EMNLP 2023. DOI: 10.18653/v1/2023.emnlp-main.630

cls_attn, received_attn, fusion_attn and all _idf hybrid strategies are original to this project.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keyatten-0.2.0.tar.gz (31.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

keyatten-0.2.0-py3-none-any.whl (28.0 kB view details)

Uploaded Python 3

File details

Details for the file keyatten-0.2.0.tar.gz.

File metadata

  • Download URL: keyatten-0.2.0.tar.gz
  • Upload date:
  • Size: 31.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for keyatten-0.2.0.tar.gz
Algorithm Hash digest
SHA256 056d2a7a467cc819f4f44e4d7c345791b2facc457db814ea75644ec2c4e00fb4
MD5 5c4211a1fa9d015306c8be716565afaa
BLAKE2b-256 8be78a9174b346ca069f3e402ee418ee88c7cab374072452ec32e6143edae493

See more details on using hashes here.

File details

Details for the file keyatten-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: keyatten-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 28.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for keyatten-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2fe539b6d475e4be7314333077e3a1702585b296110375ac4cc193b13c8811ba
MD5 feb64c6a4ab6263227bd2ab530387bf0
BLAKE2b-256 0821b68cc716cbfd9da58efdce8b85b71f69d84a97e5ee2e05df9d1053c84722

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page