Light/easy keyword extraction.

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
Topic
- Scientific/Engineering

Project description

Kex

Kex is a python library for unsurpervised keyword extractions:

Easy interface for keyword extraction with a variety of algorithms
Quick benchmarking over 15 English public datasets
Custom keyword extractor implementation support

Get Started

Install via pip

pip install kex

Extract Keywords with `kex`

Built-in algorithms in kex is below:

FirstN: heuristic baseline to pick up first n phrases as keywords
TF: scoring by term frequency
TFIDF: scoring by TFIDF
LexSpec: scoring by lexical specificity
TextRank: Mihalcea et al., 04
SingleRank: Wan et al., 08
TopicalPageRank: Liu et al.,10
SingleTPR: Sterckx et al.,15
TopicRank: Bougouin et al.,13
PositionRank: Florescu et al.,18
TFIDFRank: SingleRank + TFIDF based population term
LexRank: SingleRank + lexical specificity based population term

Basic usage:

>>> import kex
>>> model = kex.SingleRank()  # any algorithm listed above
>>> sample = '''
We propose a novel unsupervised keyphrase extraction approach that filters candidate keywords using outlier detection.
It starts by training word embeddings on the target document to capture semantic regularities among the words. It then
uses the minimum covariance determinant estimator to model the distribution of non-keyphrase word vectors, under the
assumption that these vectors come from the same distribution, indicative of their irrelevance to the semantics
expressed by the dimensions of the learned vector representation. Candidate keyphrases only consist of words that are
detected as outliers of this dominant distribution. Empirical results show that our approach outperforms state
of-the-art and recent unsupervised keyphrase extraction methods.
'''
>>> model.get_keywords(sample, n_keywords=2)
[{'stemmed': 'non-keyphras word vector',
  'pos': 'ADJ NOUN NOUN',
  'raw': ['non-keyphrase word vectors'],
  'offset': [[47, 49]],
  'count': 1,
  'score': 0.06874471825637762,
  'n_source_tokens': 112},
 {'stemmed': 'semant regular word',
  'pos': 'ADJ NOUN NOUN',
  'raw': ['semantic regularities words'],
  'offset': [[28, 32]],
  'count': 1,
  'score': 0.06001468574146248,
  'n_source_tokens': 112}]

Compute a prior

Algorithms such as TF, TFIDF, TFIDFRank, LexSpec, LexRank, TopicalPageRank, and SingleTPR need to compute a prior distribution beforehand:

>>> import kex
>>> model = kex.SingleTPR()
>>> test_sentences = ['documentA', 'documentB', 'documentC']
>>> model.train(test_sentences, export_directory='./tmp')

Priors are cached and can be loaded on the fly:

>>> import kex
>>> model = kex.SingleTPR()
>>> model.load('./tmp')

Supported language

Currently algorithms are available only in English, but soon we will relax the constrain to allow other language to be supported.

Benchmark on 15 Public Datasets

Users can fetch 15 public keyword extraction datasets via kex.get_benchmark_dataset.

>>> import kex
>>> json_line, language = kex.get_benchmark_dataset('Inspec', keep_only_valid_label=False)
>>> json_line[0]
{
    'keywords': ['kind infer', 'type check', 'overload', 'nonstrict pure function program languag', ...],
    'source': 'A static semantics for Haskell\nThis paper gives a static semantics for Haskell 98, a non-strict ...',
    'id': '1053.txt'
}

High level statistics of each dataset can be found here, and the benchmark results below:

A prior distributions are computed within each dataset, and complexity is an average over 100 trial on Inspec dataset. To reproduce the above benchmark results, please take a look an example script.

Implement Custom Extractor with `kex`

We provide an API to run a basic pipeline for preprocessing, by which one can implement a custom keyword extractor.

import kex

class CustomExtractor:
    """ Custom keyword extractor example: First N keywords extractor """

    def __init__(self, maximum_word_number: int = 3):
        """ First N keywords extractor """
        self.phrase_constructor = kex.PhraseConstructor(maximum_word_number=maximum_word_number)

    def get_keywords(self, document: str, n_keywords: int = 10):
        """ Get keywords

         Parameter
        ------------------
        document: str
        n_keywords: int

         Return
        ------------------
        a list of dictionary consisting of 'stemmed', 'pos', 'raw', 'offset', 'count'.
        eg) {'stemmed': 'grid comput', 'pos': 'ADJ NOUN', 'raw': ['grid computing'], 'offset': [[11, 12]], 'count': 1}
        """
        phrase_instance, stemmed_tokens = self.phrase_constructor.tokenize_and_stem_and_phrase(document)
        sorted_phrases = sorted(phrase_instance.values(), key=lambda x: x['offset'][0][0])
        return sorted_phrases[:min(len(sorted_phrases), n_keywords)]

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

2.0.6

Jun 8, 2023

2.0.5

Feb 10, 2021

2.0.4

Feb 9, 2021

2.0.3

Feb 2, 2021

This version

1.0.1

Jan 27, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kex-1.0.1.tar.gz (19.6 kB view details)

Uploaded Jan 27, 2021 Source

File details

Details for the file kex-1.0.1.tar.gz.

File metadata

Download URL: kex-1.0.1.tar.gz
Upload date: Jan 27, 2021
Size: 19.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1

File hashes

Hashes for kex-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`6ac528aefb6787e56d110c14a823bf24bc127b5f69569fc97ec1ee5c4ce69275`
MD5	`75f7311d282836f0ab98ea8a3357db4d`
BLAKE2b-256	`2cead8e39ff3555bbcad71e00f945ff5f5d299359280ccbd8b8eae8c1f3ded39`

See more details on using hashes here.

kex 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Kex

Get Started

Extract Keywords with `kex`

Compute a prior

Supported language

Benchmark on 15 Public Datasets

Implement Custom Extractor with `kex`

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

kex 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Kex

Get Started

Extract Keywords with kex

Compute a prior

Supported language

Benchmark on 15 Public Datasets

Implement Custom Extractor with kex

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Extract Keywords with `kex`

Implement Custom Extractor with `kex`