Skip to main content

A spaCy pipeline component for extracting keywords from text using cosine similarity.

Project description

GitHub Stars PyPi Version PyPi Downloads

🔑 Keyword spaCy

keyword spacy

Keyword spaCy is a spaCy pipeline component for extracting keywords from text using cosine similarity. The basis for this comes from KeyBERT: A Minimal Method for Keyphrase Extraction using BERT, a transformer-based approach to keyword extraction. The methods employed by Keyword spaCy follow this methodology closely. It allows users to specify the range of n-grams to consider and can operate in a strict mode, which limits results to the specified n-gram range.

Installation

Before using Keyword spaCy, make sure you have spaCy installed:

pip install keyword-spacy

Then, download the en_core_web_md model:

python -m spacy download en_core_web_md

Usage

To use the Keyword Extractor, first, create a spaCy nlp object:

import spacy
nlp = spacy.load("en_core_web_md")

Then, add the KeywordExtractor to the pipeline:

nlp.add_pipe("keyword_extractor", last=True, config={"top_n": 10, "min_ngram": 3, "max_ngram": 3, "strict": True})

Now you can process text and extract keywords:

text = "Natural language processing is a fascinating domain of artificial intelligence. It allows computers to understand and generate human language."
doc = nlp(text)
print("Top Keywords:", doc._.keywords)

Output:

Top Keywords: ['generate human language', 'Natural language processing']

Each token that is not a punctuation also receives a special attribute ._.keyword_value, this is the value of a given word's similarity to the doc.vector. This may be helpful for other downstream tasks.

Configuration

The KeywordExtractor can be configured using the following parameters:

  • top_n: The number of top keywords to extract.
  • min_ngram: The minimum size for n-grams.
  • max_ngram: The maximum size for n-grams.
  • strict: If set to True, only n-grams within the min_ngram to max_ngram range are considered. If False, individual tokens and the specified range of n-grams are considered.

Methodology

The methodology employed by Keyword spaCy is inspired by KeyBERT. It utilizes cosine similarity between tokens (and n-grams) and the entire document to determine the relevance of terms. The most similar terms are then considered as keywords.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keyword_spacy-0.1.2.tar.gz (3.6 kB view details)

Uploaded Source

File details

Details for the file keyword_spacy-0.1.2.tar.gz.

File metadata

  • Download URL: keyword_spacy-0.1.2.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.15

File hashes

Hashes for keyword_spacy-0.1.2.tar.gz
Algorithm Hash digest
SHA256 f1235f8e5fbff1429f70cd07953e3993d7e71df7925b45fa46d6915a14f16bbf
MD5 46b5ee0d8aba15185dda37f4588fc397
BLAKE2b-256 b56a6ac144946514b8564d9854a8c0e1743b0d6c01f16004d222f1ef46843954

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page