finding topics in a set of documents in an unsupervised fashion.
Project description
Topic Finder works in a three part process.
To install:
pip3 install topicfinder
- tokenizing the document set. This applies the default wildgram analysis to tokenize the document set. documents must be in the structure below, a list of dictionaries. docID is optional as a parameter (if not included, the id is the index in the array).
from topicfinder import getTokenLists
ret = getTokenLists([{"text": "boo", "docID": "1"}])
- sorting the entire tokenset by the unique tokens and their frequencies.
from topicfinder import getTokenLists, getPhrases
ret = getPhrases(getTokenLists([{"text": "boo", "docID": "1"}]))
- grouping the phrases by similarity, with some lemmatization and fuzzy matching. It considers any word or lemmatized word less than 3 letters impossible to group. You can also set a threshold similarity (1.0 is an exact match). Default is (for now) 0.75.
from topicfinder import getTokenLists, getPhrases, groupPhrasesBySimilarity
ret = groupPhrasesBySimilarity(getPhrases(getTokenLists([{"text": "boo", "docID": "1"}])))
ret = groupPhrasesBySimilarity(getPhrases(getTokenLists([{"text": "boo", "docID": "1"}])),threshold=0.8)
## ret is a sorted list (descending order) of the following style of dictionaries
## {"phrases": [#list of phrases#], "tokens": [##list of unique tokens from the documents]}
## descending order by the size of the tokens array
you can also just call:
from topicfinder import synopsis
ret = synopsis([{"text": "boo", "docID": "1"}], threshold=0.8)
this will do steps 1-3 in one go. I split it up because with medical things, sometimes you need to run and store the data off given the size of the dataset or do a map reduce situation or be extra careful about PHI.
For anyone worrying about PHI -- aint none of this connected to the internet. It's all local baby. You can check the code if you need to also.
Future work will include
- being able to apply a annotation normalization function (e.g. a set of synonyms rolling up)
- making some default normalization functions (E.g. querying UMLS, etc.) ---- note that this might need to be connected to the internet to work.
- dealing with acronyms and short words better
- dealing with numbers and time durations better
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
topicfinder-0.0.4.tar.gz
(3.4 kB
view hashes)
Built Distribution
Close
Hashes for topicfinder-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25517c37f2c72b66021e70dac25a5a899a791981d720ec68d2d83e4796163059 |
|
MD5 | 24a18f9acb70212b0af327af6d529ef6 |
|
BLAKE2b-256 | df55c8d9846fb0840df79fa03b6d3ef9fc3c20fca50803b64efbda9a14713dee |