Skip to main content

KeyBERT performs keyword extraction with state-of-the-art transformer models.

Project description

PyPI - Python PyPI - License PyPI - PyPi Build

KeyBERT

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

Corresponding medium post can be found here.

Table of Contents

  1. About the Project
  2. Getting Started
    2.1. Installation
    2.2. Basic Usage

1. About the Project

Back to ToC

Although that are already many methods available for keyword generation (e.g., Rake, YAKE!, TF-IDF, etc.) I wanted to create a very basic, but powerful method for extracting keywords and keyphrases. This is where KeyBERT comes in! Which uses BERT-embeddings and simple cosine similarity to find the sub-phrases in a document that are the most similar to the document itself.

First, document embeddings are extracted with BERT to get a document-level representation. Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity to find the words/phrases that are the most similar to the document. The most similar words could then be identified as the words that best describe the entire document.

KeyBERT is by no means unique and is created as a quick and easy method for creating keywords and keyphrases. Although there are many great papers and solutions out there that use BERT-embeddings (e.g., 1, 2, 3, ), I could not find a BERT-based solution that did not have to be trained from scratch and could be used for beginners (correct me if I'm wrong!). Thus, the goal was a pip install keybert and at most 3 lines of code in usage.

NOTE: If you use MMR to select the candidates instead of simple cosine similarity, this repo is essentially a simplified implementation of EmbedRank with BERT-embeddings.

2. Getting Started

Back to ToC

2.1. Installation

PyTorch 1.2.0 or higher is recommended. If the install below gives an error, please install pytorch first here.

Installation can be done using pypi:

pip install keybert

2.2. Usage

The most minimal example can be seen below for the extraction of keywords:

from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs.[1] It infers a
         function from labeled training data consisting of a set of training examples.[2]
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).
      """
model = KeyBERT('distilbert-base-nli-mean-tokens')
keywords = model.extract_keywords(doc)

You can set keyphrase_length to set the length of the resulting keyphras:

>>> model.extract_keywords(doc, keyphrase_length=1, stop_words=None)
['learning', 
 'training', 
 'algorithm', 
 'class', 
 'mapping']

To extract keyphrases, simply set keyphrase_length to 2 or higher depending on the number of words you would like in the resulting keyphrases:

>>> model.extract_keywords(doc, keyphrase_length=3, stop_words=None)
['learning algorithm',
 'learning machine',
 'machine learning',
 'supervised learning',
 'learning function']

To diversify the results, we can use Maximal Margin Relevance (MMR) to create keywords / keyphrases also based on cosine similarity:

>>> model.extract_keywords(doc, keyphrase_length=3, stop_words=None, use_mmr=True, diversity=0.7)

References

Below, you can find several resources that were used for the creation of KeyBERT but most importantly, are amazing resources for creating impressive keyword extraction models:

Papers:

Github Repos:

MMR:
The selection of keywords/keyphrases was modelled after:

NOTE: If you find a paper or github repo that has an easy-to-use implementation of BERT-embeddings for keyword/keyphrase extraction, let me know! I'll make sure to add it a reference to this repo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keybert-0.1.0.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

keybert-0.1.0-py2.py3-none-any.whl (8.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file keybert-0.1.0.tar.gz.

File metadata

  • Download URL: keybert-0.1.0.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.23.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.4

File hashes

Hashes for keybert-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4f87877317577f55596e4d22ffd1c93a83799f9c9db5e132470959be284e0bb6
MD5 e9ba5aa99c00f0044a79ac7b4ea51c23
BLAKE2b-256 4a6894d1fab6454200028732f1bdb71652863e35aa1839702391771cac12f398

See more details on using hashes here.

File details

Details for the file keybert-0.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: keybert-0.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.23.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.4

File hashes

Hashes for keybert-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 4833dee921e123c9777dc613a13b67a9c9fa3d1c691cf328aa03bdaf1278ff25
MD5 795113626eed7f61d87c7e6cbd0427a7
BLAKE2b-256 565fe4408383be64f156c6bbb384fa1d0385fd472342193f362670c06fe7c124

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page