finding topics in a set of documents in an unsupervised fashion.
Project description
Topic Finder works in a three part process.
To install:
pip3 install topicfinder
- tokenizing the document set. This applies the default wildgram analysis to tokenize the document set. documents must be in the structure below, a list of dictionaries. docID is optional as a parameter (if not included, the id is the index in the array).
from topicfinder import getTokenLists
ret = getTokenLists([{"text": "boo", "docID": "1"}])
- sorting the entire tokenset by the unique tokens and their frequencies.
from topicfinder import getTokenLists, getPhrases
ret = getPhrases(getTokenLists([{"text": "boo", "docID": "1"}]))
- grouping the phrases by similarity, with some lemmatization and fuzzy matching.
from topicfinder import getTokenLists, getPhrases, groupPhrasesBySimilarity
ret = groupPhrasesBySimilarity(getPhrases(getTokenLists([{"text": "boo", "docID": "1"}])))
## ret is a sorted list (descending order) of the following style of dictionaries
## {"phrases": [#list of phrases#], "tokens": [##list of unique tokens from the documents]}
## descending order by the size of the tokens array
you can also just call:
from topicfinder import synopsis
ret = synopsis([{"text": "boo", "docID": "1"}])
this will do steps 1-3 in one go. I split it up because with medical things, sometimes you need to run and store the data off given the size of the dataset or do a map reduce situation or be extra careful about PHI.
For anyone worrying about PHI -- aint none of this connected to the internet. It's all local baby. You can check the code if you need to also.
Future work will include
- being able to apply a annotation normalization function (e.g. a set of synonyms rolling up)
- making some default normalization functions (E.g. querying UMLS, etc.) ---- note that this might need to be connected to the internet to work.
- dealing with acronyms better
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
topicfinder-0.0.1.tar.gz
(3.2 kB
view hashes)
Built Distribution
Close
Hashes for topicfinder-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 246869e7ca15a5b6d67d6fc0ff9a8a033a455f1ebb952a72411d1afd8e8e25a1 |
|
MD5 | f1cc92d1e242f6dd9363c89acb70641c |
|
BLAKE2b-256 | 4f0cde2efb6887ace2c9221f2a1fa4f578ca215855bd9d623e5c73dd4e51a06f |