Efficient and powerful class-specific keyword extraction.
Project description
CSKE: Class-Specific Keyword Extraction
CSKE is a high-performance Python library designed for iterative, class-specific keyword extraction. Unlike generic extractors, CSKE maintains coherence to a user-defined class, and leverages clustering techniques to ensure that expanded keyword sets remain semantically anchored to the target class.
This code was introduced in the KONVENS 2024 paper titled: An Improved Method for Class-specific Keyword Extraction: A Case Study in the German Business Registry
Key Features
- Iterative Expansion: Automatically discovers new keywords by walking through your dataset starting from a small seed list (defined by you!).
- Drift Prevention: Uses clustering and filtering to weed out out "semantic drift" keywords that do not fit will to your defined domain.
- Weighted Extraction: Balance the influence between the local document context and the global seed keywords.
- Hardware Accelerated: Built on top of PyTorch and Sentence Transformers with automatic support for CUDA.
Installation
pip install cske
Usage
import pandas as pd
from cske import CSKE
df = pd.DataFrame({
"text_content": [
"Neural networks are a subset of machine learning.",
"The transformer architecture revolutionized NLP.",
"Deep learning models require significant GPU resources.".
"..."
]
})
# 2. Initialize the extractor, using any sentence transformer model, i.e., from Hugging Face
extractor = CSKE(embedding_model="all-MiniLM-L6-v2")
# 3. Run the pipeline
keywords = extractor.keyword_pipeline(
starting_seed=["machine learning", "neural networks"],
df=df,
df_col_to_extract="text_content",
n_iterations=3, # number of iterations (partitions of your data)
number_newseed=2, # maximum number of "new" seeds per iteration
do_filter=True # whether to perform filtering to prevent drift
)
print(f"Expanded Keyword Set: {keywords}")
Key Parameters, and what they mean
| Parameter | Default | Description |
|---|---|---|
n_iterations |
5 |
How many rounds of expansion to perform. |
seed_weight |
1.0 |
Importance given to the original seed keywords. |
doc_weight |
0.0 |
Importance given to document context. |
do_filter |
True |
Whether or not to apply filtering. |
topk |
None |
If set, limits the final output to the top-k keywords. |
Citation
If you find CSKE useful or utilize it for your work, please considering citing:
@inproceedings{meisenbacher-etal-2024-improved,
title = "An Improved Method for Class-specific Keyword Extraction: A Case Study in the {G}erman Business Registry",
author = "Meisenbacher, Stephen and
Schopf, Tim and
Yan, Weixin and
Holl, Patrick and
Matthes, Florian",
editor = "Luz de Araujo, Pedro Henrique and
Baumann, Andreas and
Gromann, Dagmar and
Krenn, Brigitte and
Roth, Benjamin and
Wiegand, Michael",
booktitle = "Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024)",
month = sep,
year = "2024",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.konvens-main.18/",
pages = "159--165"
}
This work makes use of the KeyBERT library, so please also consider citing it: KeyBERT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cske-0.2.0.tar.gz.
File metadata
- Download URL: cske-0.2.0.tar.gz
- Upload date:
- Size: 10.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b946cba630578bb6d4f0e77ef62359dc88d444fed729ec9c84435b93735545d
|
|
| MD5 |
657fdc7745dd9f40b119ed1c4586c1f0
|
|
| BLAKE2b-256 |
32c37881ff1068dda80b9f49d5d83b83a9b7c2fe7ef322fe3184338920015970
|
File details
Details for the file cske-0.2.0-py3-none-any.whl.
File metadata
- Download URL: cske-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d761d3a60b8fef91fad9b123b3b86be5e30d3fc41c74d2edbbc96b7dba815d1a
|
|
| MD5 |
29659b06f92ab4dca47aea2034c563db
|
|
| BLAKE2b-256 |
b37dde0f237e554171b2da8caf432a55f0a02ad57f67cfa54fc7c1de96df37b0
|