Skip to main content

Light/easy keyword extraction from documents.

Project description

Kex

Kex is a python library for unsurpervised keyword extractions, supporting the following features:

Get Started

Install via pip

pip install kex

Extract Keywords with Kex

Built-in algorithms in kex is below:

Basic usage:

>>> import kex
>>> model = kex.SingleRank()  # any algorithm listed above
>>> sample = '''
We propose a novel unsupervised keyphrase extraction approach that filters candidate keywords using outlier detection.
It starts by training word embeddings on the target document to capture semantic regularities among the words. It then
uses the minimum covariance determinant estimator to model the distribution of non-keyphrase word vectors, under the
assumption that these vectors come from the same distribution, indicative of their irrelevance to the semantics
expressed by the dimensions of the learned vector representation. Candidate keyphrases only consist of words that are
detected as outliers of this dominant distribution. Empirical results show that our approach outperforms state
of-the-art and recent unsupervised keyphrase extraction methods.
'''
>>> model.get_keywords(sample, n_keywords=2)
[{'stemmed': 'non-keyphras word vector',
  'pos': 'ADJ NOUN NOUN',
  'raw': ['non-keyphrase word vectors'],
  'offset': [[47, 49]],
  'count': 1,
  'score': 0.06874471825637762,
  'n_source_tokens': 112},
 {'stemmed': 'semant regular word',
  'pos': 'ADJ NOUN NOUN',
  'raw': ['semantic regularities words'],
  'offset': [[28, 32]],
  'count': 1,
  'score': 0.06001468574146248,
  'n_source_tokens': 112}]

Compute a statistical prior

Algorithms such as TF, TFIDF, TFIDFRank, LexSpec, LexRank, TopicalPageRank, and SingleTPR need to compute a prior distribution beforehand by

>>> import kex
>>> model = kex.SingleTPR()
>>> test_sentences = ['documentA', 'documentB', 'documentC']
>>> model.train(test_sentences, export_directory='./tmp')

Priors are cached and can be loaded on the fly as

>>> import kex
>>> model = kex.SingleTPR()
>>> model.load('./tmp')

Supported language

Currently algorithms are available only in English, but soon we will relax the constrain to allow other language to be supported.

Benchmark on 15 Public Datasets

Users can fetch 15 public keyword extraction datasets via kex.get_benchmark_dataset.

>>> import kex
>>> json_line, language = kex.get_benchmark_dataset('Inspec')
>>> json_line[0]
{
    'keywords': ['kind infer', 'type check', 'overload', 'nonstrict pure function program languag', ...],
    'source': 'A static semantics for Haskell\nThis paper gives a static semantics for Haskell 98, a non-strict ...',
    'id': '1053.txt'
}

Please take a look an example script to run a benchmark on those datasets.

Implement Custom Extractor with Kex

We provide an API to run a basic pipeline for preprocessing, by which one can implement a custom keyword extractor.

import kex

class CustomExtractor:
    """ Custom keyword extractor example: First N keywords extractor """

    def __init__(self, maximum_word_number: int = 3):
        """ First N keywords extractor """
        self.phrase_constructor = kex.PhraseConstructor(maximum_word_number=maximum_word_number)

    def get_keywords(self, document: str, n_keywords: int = 10):
        """ Get keywords

         Parameter
        ------------------
        document: str
        n_keywords: int

         Return
        ------------------
        a list of dictionary consisting of 'stemmed', 'pos', 'raw', 'offset', 'count'.
        eg) {'stemmed': 'grid comput', 'pos': 'ADJ NOUN', 'raw': ['grid computing'], 'offset': [[11, 12]], 'count': 1}
        """
        phrase_instance, stemmed_tokens = self.phrase_constructor.tokenize_and_stem_and_phrase(document)
        sorted_phrases = sorted(phrase_instance.values(), key=lambda x: x['offset'][0][0])
        return sorted_phrases[:min(len(sorted_phrases), n_keywords)]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kex-2.0.5.tar.gz (18.0 kB view details)

Uploaded Source

File details

Details for the file kex-2.0.5.tar.gz.

File metadata

  • Download URL: kex-2.0.5.tar.gz
  • Upload date:
  • Size: 18.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1

File hashes

Hashes for kex-2.0.5.tar.gz
Algorithm Hash digest
SHA256 70faf12f5d6c633e622fbf2ecf4066fca1d0d2bc9135c840feb782c3a859db78
MD5 b07e666ad02c7b151d2d2e4e684e2434
BLAKE2b-256 3b6830a5d43afd0786c056caf099de3ad0d0eddece8ec962382be6e69b18b34a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page