To generate a word2vec model, but using multi-word keywords instead of single words.

These details have not been verified by PyPI

Project links

Homepage

Project description

keywords2vec

A simple and fast way to generate a word2vec model, with multi-word keywords instead of single words.

Example result

Finding similar keywords for "obesity"

index	term
0	overweight
1	obese
2	physical inactivity
3	excess weight
4	obese adults
5	high bmi
6	obese adults
7	obese people
8	obesity-related outcomes
9	obesity among children
10	poor sleep quality
11	ssbs
12	obese populations
13	cardiometabolic risk
14	abdominal obesity

Install

pip install keywords2vec

How to use

Lets download some example data

data_filepath = "epistemonikos_data_sample.tsv.gz"

!wget "https://s3.amazonaws.com/episte-labs/epistemonikos_data_sample.tsv.gz" -O "{data_filepath}"

We create the model. If you need the vectors, take a look here

labels, tree = similars_tree(data_filepath)

processing file: epistemonikos_data_sample.tsv.gz

Then we can get the most similars keywords

get_similars(tree, labels, "obesity")

['obesity',
 'overweight',
 'obese',
 'physical inactivity',
 'excess weight',
 'high bmi',
 'obese adults',
 'obese people',
 'obesity-related outcomes',
 'obesity among children',
 'poor sleep quality',
 'ssbs',
 'obese populations',
 'cardiometabolic risk',
 'abdominal obesity']

get_similars(tree, labels, "heart failure")

['heart failure',
 'hf',
 'chf',
 'chronic heart failure',
 'reduced ejection fraction',
 'unstable angina',
 'peripheral vascular disease',
 'peripheral arterial disease',
 'angina',
 'congestive heart failure',
 'left ventricular systolic dysfunction',
 'acute coronary syndrome',
 'heart failure patients',
 'acute myocardial infarction',
 'left ventricular dysfunction']

Motivation

The idea started in the Epistemonikos database www.epistemonikos.org, a database of scientific articles for people making decisions concerning clinical or health-policy questions. In this context the scientific/health language used is complex. You can easily find keywords like:

asthma
heart failure
medial compartment knee osteoarthritis
preserved left ventricular systolic function
non-selective non-steroidal anti-inflammatory drugs

We tried some approaches to find those keywords, like ngrams, ngrams + tf-idf, identify entities, among others. But we didn't get really good results.

Our approach

We found that tokenizing using stopwords + non word characters was really useful for "finding" the keywords. An example:

input: "Timing of replacement therapy for acute renal failure after cardiac surgery"
output: [ "timing", "replacement therapy", "acute renal failure", "cardiac surgery" ]

So we basically split the text when we find:

a stopword
a non word character(/,!?. etc) (except from - and ')

That's it.

But as there were some problem with some keywords that cointain stopwords, like:

Vitamin A
Hepatitis A
Web of Science

So we decided to add another method (nltk with some grammar definition) to cover most of the cases. To use this, you need to add the parameter keywords_w_stopwords=True, this method is approx 20x slower.

References

Seem to be an old idea (2004):

Mihalcea, Rada, and Paul Tarau. "Textrank: Bringing order into text." Proceedings of the 2004 conference on empirical methods in natural language processing. 2004.

Reading an implementation of textrank, I realize they used stopwords to separate and create the graph. Then I though in using it as tokenizer for word2vec

As pointed by @deliprao in this twitter thread. It's also used by Rake (2010):

Rose, Stuart & Engel, Dave & Cramer, Nick & Cowley, Wendy. (2010). Automatic Keyword Extraction from Individual Documents. 10.1002/9780470689646.ch1.

As noted by @astent in the Twitter thread, this concept is called chinking (chunking by exclusion) https://www.nltk.org/book/ch07.html#Chinking

Multi-lingual

We worked in an implementation, that could be used in multiple languages. Of course not all languages are sutable for using this approach. We have tried with good results in English, Spanish and Portuguese

Try it online

You can try it here (takes time to load, lowercase only, doesn't work in mobile yet) MPV :)

These embedding were created using 827,341 title/abstract from @epistemonikos database. With keywords that repeat at least 10 times. The total vocab is 349,080 keywords (really manageable number)

Vocab size

One of the main benefit of this method, is the size of the vocabulary. For example, using keywords that repeat at least 10 times, for the Epistemonikos dataset (827,341 title/abstract), we got the following vocab size:

ngrams	keywords	comp
1	127,824	36%
1,2	1,360,550	388%
1-3	3,204,099	914%
1-4	4,461,930	1,272%
1-5	5,133,619	1,464%

stopword tokenizer	350,529	100%

More information regarding the comparison, take a look to the folder analyze.

Credits

This project has been created using nbdev

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Feb 26, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keywords2vec-0.1.0.tar.gz (16.2 kB view details)

Uploaded Feb 26, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

keywords2vec-0.1.0-py3-none-any.whl (14.2 kB view details)

Uploaded Feb 26, 2020 Python 3

File details

Details for the file keywords2vec-0.1.0.tar.gz.

File metadata

Download URL: keywords2vec-0.1.0.tar.gz
Upload date: Feb 26, 2020
Size: 16.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for keywords2vec-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ef2df2744a34f76e7a2d03bb5305fc8fa2316bad17ab3a1924c4616843551ab9`
MD5	`3db33ece58a1a33766d3032f30df98e6`
BLAKE2b-256	`f8244cd17f2317cd6d78d841fbac775e50ecb895432205420a3d83ced59905bd`

See more details on using hashes here.

File details

Details for the file keywords2vec-0.1.0-py3-none-any.whl.

File metadata

Download URL: keywords2vec-0.1.0-py3-none-any.whl
Upload date: Feb 26, 2020
Size: 14.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for keywords2vec-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c3613fa6c3d715ab2c5cac18e0a5d583da6a1d63e4f6f464c8efdd41b1d78b39`
MD5	`72faaa6fad95d732d744fe5d3e251f29`
BLAKE2b-256	`31b176ebda28cb7e64528b9d079b29b2200e9cb0c763b8f31f6d2c6cfb7ab781`

See more details on using hashes here.

keywords2vec 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

keywords2vec

Example result

Install

How to use

Motivation

Our approach

References

Multi-lingual

Try it online

Vocab size

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes