Skip to main content

No project description provided

Project description

DistinctKeywords

This is a utility function to extract semantically distinct keywords. This is an unsupervised method based on word2vec. Current implementation used a word2vec model trained in simplewiki. Hilbert curve act as a Locality-sensitive hashing.

Methodology

After creating word2vec, the words are mapped to a hilbert space and the results are stored in a key-value pair (every word has a hilbert hash). Now for a new document, the words and phrases are cleaned, hashed using the dictionary. One word from each different prefix is then selected using wordnet ranking from NLTK (rare words are prioritized). The implementation of grouping and look up is made fast using Trie and SortedDict

Installation dependancies

NLTK and spacy with en_core_web_sm to be loaded before usage

Benchmarks

Currently this is tested against KPTimes test dataset (20000 articles). A recall score of 30% is achieved when compared to the manual keywords given in the dataset.

Usage

from keywords import DistinctKeywords

doc = """ Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a 'reasonable' way (see inductive bias). """

Initialize the class

distinct_keywords=DistinctKeywords()

distinct_keywords.get_keywords(doc)

Output

['machine learning', 'pairs', 'mapping', 'vector', 'typically', 'supervised', 'bias', 'supervisory', 'task', 'algorithm', 'unseen', 'training']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distinct_keywords-0.2.tar.gz (3.2 kB view details)

Uploaded Source

Built Distribution

distinct_keywords-0.2-py3-none-any.whl (3.2 kB view details)

Uploaded Python 3

File details

Details for the file distinct_keywords-0.2.tar.gz.

File metadata

  • Download URL: distinct_keywords-0.2.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.8.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for distinct_keywords-0.2.tar.gz
Algorithm Hash digest
SHA256 597ad9d3ace56b8211de92f9dd458519d1006267bca870e6d5612608f76e0cd0
MD5 75d774388f4e7dd32bbdfcbd1f2c91ca
BLAKE2b-256 914a33c46e77fd80416bda1236a520c355b1928ce3f26e34f21175a0152a5d02

See more details on using hashes here.

Provenance

File details

Details for the file distinct_keywords-0.2-py3-none-any.whl.

File metadata

  • Download URL: distinct_keywords-0.2-py3-none-any.whl
  • Upload date:
  • Size: 3.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.8.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for distinct_keywords-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ad38739fbd710def67a07370aa7dea8a50850849178ca5a8b61a8ed57cc7a497
MD5 990435677c12c704c427d4fdcc01adaf
BLAKE2b-256 67368e5639de42ed8b570a44124192cd09cdc07eb419dd3599ed5c98c34aad01

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page