No project description provided
Project description
DistinctKeywords
This is a utility function to extract semantically distinct keywords. This is an unsupervised method based on word2vec. Current implementation used a word2vec model trained in simplewiki. Hilbert curve act as a Locality-sensitive hashing.
Methodology
After creating word2vec, the words are mapped to a hilbert space and the results are stored in a key-value pair (every word has a hilbert hash). Now for a new document, the words and phrases are cleaned, hashed using the dictionary. One word from each different prefix is then selected using wordnet ranking from NLTK (rare words are prioritized). The implementation of grouping and look up is made fast using Trie and SortedDict
Installation dependancies
NLTK and spacy with en_core_web_sm to be loaded before usage
Benchmarks
Currently this is tested against KPTimes test dataset (20000 articles). A recall score of 30% is achieved when compared to the manual keywords given in the dataset.
Usage
from keywords import DistinctKeywords
doc = """ Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a 'reasonable' way (see inductive bias). """
Initialize the class
distinct_keywords=DistinctKeywords()
distinct_keywords.get_keywords(doc)
Output
['machine learning', 'pairs', 'mapping', 'vector', 'typically', 'supervised', 'bias', 'supervisory', 'task', 'algorithm', 'unseen', 'training']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file distinct_keywords-0.2.tar.gz
.
File metadata
- Download URL: distinct_keywords-0.2.tar.gz
- Upload date:
- Size: 3.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.8.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 597ad9d3ace56b8211de92f9dd458519d1006267bca870e6d5612608f76e0cd0 |
|
MD5 | 75d774388f4e7dd32bbdfcbd1f2c91ca |
|
BLAKE2b-256 | 914a33c46e77fd80416bda1236a520c355b1928ce3f26e34f21175a0152a5d02 |
Provenance
File details
Details for the file distinct_keywords-0.2-py3-none-any.whl
.
File metadata
- Download URL: distinct_keywords-0.2-py3-none-any.whl
- Upload date:
- Size: 3.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.8.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad38739fbd710def67a07370aa7dea8a50850849178ca5a8b61a8ed57cc7a497 |
|
MD5 | 990435677c12c704c427d4fdcc01adaf |
|
BLAKE2b-256 | 67368e5639de42ed8b570a44124192cd09cdc07eb419dd3599ed5c98c34aad01 |