Skip to main content

Efficient and powerful class-specific keyword extraction.

Project description

CSKE: Class-Specific Keyword Extraction

CSKE is a high-performance Python library designed for iterative, class-specific keyword extraction. Unlike generic extractors, CSKE maintains coherence to a user-defined class, and leverages clustering techniques to ensure that expanded keyword sets remain semantically anchored to the target class.

This code was introduced in the KONVENS 2024 paper titled: An Improved Method for Class-specific Keyword Extraction: A Case Study in the German Business Registry

Key Features

  • Iterative Expansion: Automatically discovers new keywords by walking through your dataset starting from a small seed list (defined by you!).
  • Drift Prevention: Uses clustering and filtering to weed out out "semantic drift" keywords that do not fit will to your defined domain.
  • Weighted Extraction: Balance the influence between the local document context and the global seed keywords.
  • Hardware Accelerated: Built on top of PyTorch and Sentence Transformers with automatic support for CUDA.

Installation

pip install cske

Usage

import pandas as pd
from cske import CSKE

df = pd.DataFrame({
    "text_content": [
        "Neural networks are a subset of machine learning.",
        "The transformer architecture revolutionized NLP.",
        "Deep learning models require significant GPU resources.".
        "..."
    ]
})

# 2. Initialize the extractor, using any sentence transformer model, i.e., from Hugging Face
extractor = CSKE(embedding_model="all-MiniLM-L6-v2")

# 3. Run the pipeline
keywords = extractor.keyword_pipeline(
    starting_seed=["machine learning", "neural networks"],
    df=df,
    df_col_to_extract="text_content",
    n_iterations=3, # number of iterations (partitions of your data)
    number_newseed=2, # maximum number of "new" seeds per iteration
    do_filter=True  # whether to perform filtering to prevent drift
)

print(f"Expanded Keyword Set: {keywords}")

Key Parameters, and what they mean

Parameter Default Description
n_iterations 5 How many rounds of expansion to perform.
seed_weight 1.0 Importance given to the original seed keywords.
doc_weight 0.0 Importance given to document context.
do_filter True Whether or not to apply filtering.
topk None If set, limits the final output to the top-k keywords.

Citation

If you find CSKE useful or utilize it for your work, please considering citing:

@inproceedings{meisenbacher-etal-2024-improved,
    title = "An Improved Method for Class-specific Keyword Extraction: A Case Study in the {G}erman Business Registry",
    author = "Meisenbacher, Stephen  and
      Schopf, Tim  and
      Yan, Weixin  and
      Holl, Patrick  and
      Matthes, Florian",
    editor = "Luz de Araujo, Pedro Henrique  and
      Baumann, Andreas  and
      Gromann, Dagmar  and
      Krenn, Brigitte  and
      Roth, Benjamin  and
      Wiegand, Michael",
    booktitle = "Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024)",
    month = sep,
    year = "2024",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.konvens-main.18/",
    pages = "159--165"
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cske-0.1.0.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cske-0.1.0-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file cske-0.1.0.tar.gz.

File metadata

  • Download URL: cske-0.1.0.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for cske-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4b1d7e7914017e9535e58d0b29f5b5d5a0aa4c12db7acd57f86e801b17164a61
MD5 7d20ad913a51a2f0f3ad580bfc5ee803
BLAKE2b-256 d4582c9e2bba5d23353eb590b52e6374c403f814d653e97f8823ba8b5ca510de

See more details on using hashes here.

File details

Details for the file cske-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cske-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for cske-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 513dbb4cec698617e980d4192a31d72b6f8cac0ba7290a37f96d32e29ff22c47
MD5 c85b63054af0ecc2a4114313ede8e9de
BLAKE2b-256 856fe83b99212fbbf6347a13201d3360cc3bbcb7aec1dc1dd5d23bb551fd6e20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page