Skip to main content

Efficient and powerful class-specific keyword extraction.

Project description

PyPI version License

CSKE: Class-Specific Keyword Extraction

CSKE is a high-performance Python library designed for iterative, class-specific keyword extraction. Unlike generic extractors, CSKE maintains coherence to a user-defined class, and leverages clustering techniques to ensure that expanded keyword sets remain semantically anchored to the target class.

This code was introduced in the KONVENS 2024 paper titled: An Improved Method for Class-specific Keyword Extraction: A Case Study in the German Business Registry

Key Features

  • Iterative Expansion: Automatically discovers new keywords by walking through your dataset starting from a small seed list (defined by you!).
  • Drift Prevention: Uses clustering and filtering to weed out out "semantic drift" keywords that do not fit will to your defined domain.
  • Weighted Extraction: Balance the influence between the local document context and the global seed keywords.
  • Hardware Accelerated: Built on top of PyTorch and Sentence Transformers with automatic support for CUDA.

Installation

pip install cske

Usage

import pandas as pd
from cske import CSKE

df = pd.DataFrame({
    "text_content": [
        "Neural networks are a subset of machine learning.",
        "The transformer architecture revolutionized NLP.",
        "Deep learning models require significant GPU resources.".
        "..."
    ]
})

# 2. Initialize the extractor, using any sentence transformer model, i.e., from Hugging Face
extractor = CSKE(embedding_model="all-MiniLM-L6-v2")

# 3. Run the pipeline
keywords = extractor.keyword_pipeline(
    starting_seed=["machine learning", "neural networks"],
    df=df,
    df_col_to_extract="text_content",
    n_iterations=3, # number of iterations (partitions of your data)
    number_newseed=2, # maximum number of "new" seeds per iteration
    do_filter=True  # whether to perform filtering to prevent drift
)

print(f"Expanded Keyword Set: {keywords}")

Key Parameters, and what they mean

Parameter Default Description
n_iterations 5 How many rounds of expansion to perform.
seed_weight 1.0 Importance given to the original seed keywords.
doc_weight 0.0 Importance given to document context.
do_filter True Whether or not to apply filtering.
topk None If set, limits the final output to the top-k keywords.

Citation

If you find CSKE useful or utilize it for your work, please considering citing:

@inproceedings{meisenbacher-etal-2024-improved,
    title = "An Improved Method for Class-specific Keyword Extraction: A Case Study in the {G}erman Business Registry",
    author = "Meisenbacher, Stephen  and
      Schopf, Tim  and
      Yan, Weixin  and
      Holl, Patrick  and
      Matthes, Florian",
    editor = "Luz de Araujo, Pedro Henrique  and
      Baumann, Andreas  and
      Gromann, Dagmar  and
      Krenn, Brigitte  and
      Roth, Benjamin  and
      Wiegand, Michael",
    booktitle = "Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024)",
    month = sep,
    year = "2024",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.konvens-main.18/",
    pages = "159--165"
}

This work makes use of the KeyBERT library, so please also consider citing it: KeyBERT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cske-0.2.0.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cske-0.2.0-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file cske-0.2.0.tar.gz.

File metadata

  • Download URL: cske-0.2.0.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for cske-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5b946cba630578bb6d4f0e77ef62359dc88d444fed729ec9c84435b93735545d
MD5 657fdc7745dd9f40b119ed1c4586c1f0
BLAKE2b-256 32c37881ff1068dda80b9f49d5d83b83a9b7c2fe7ef322fe3184338920015970

See more details on using hashes here.

File details

Details for the file cske-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: cske-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for cske-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d761d3a60b8fef91fad9b123b3b86be5e30d3fc41c74d2edbbc96b7dba815d1a
MD5 29659b06f92ab4dca47aea2034c563db
BLAKE2b-256 b37dde0f237e554171b2da8caf432a55f0a02ad57f67cfa54fc7c1de96df37b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page