finding topics in a set of documents in an unsupervised fashion.
Project description
Topic Finder works in a three part process.
To install:
pip3 install topicfinder
- tokenizing the document set. This applies the default wildgram analysis to tokenize the document set. documents must be in the structure below, a list of dictionaries. docID is optional as a parameter (if not included, the id is the index in the array).
from topicfinder import getTokenLists
ret = getTokenLists([{"text": "boo", "docID": "1"}])
- sorting the entire tokenset by the unique tokens and their frequencies.
from topicfinder import getTokenLists, getPhrases
ret = getPhrases(getTokenLists([{"text": "boo", "docID": "1"}]))
- grouping the phrases by similarity, with some lemmatization and fuzzy matching. It considers any word or lemmatized word less than 3 letters impossible to group.
from topicfinder import getTokenLists, getPhrases, groupPhrasesBySimilarity
ret = groupPhrasesBySimilarity(getPhrases(getTokenLists([{"text": "boo", "docID": "1"}])))
## ret is a sorted list (descending order) of the following style of dictionaries
## {"phrases": [#list of phrases#], "tokens": [##list of unique tokens from the documents]}
## descending order by the size of the tokens array
you can also just call:
from topicfinder import synopsis
ret = synopsis([{"text": "boo", "docID": "1"}])
this will do steps 1-3 in one go. I split it up because with medical things, sometimes you need to run and store the data off given the size of the dataset or do a map reduce situation or be extra careful about PHI.
For anyone worrying about PHI -- aint none of this connected to the internet. It's all local baby. You can check the code if you need to also.
Future work will include
- being able to apply a annotation normalization function (e.g. a set of synonyms rolling up)
- making some default normalization functions (E.g. querying UMLS, etc.) ---- note that this might need to be connected to the internet to work.
- dealing with acronyms and short words better
- dealing with numbers and time durations better
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
topicfinder-0.0.3.tar.gz
(3.3 kB
view hashes)
Built Distribution
Close
Hashes for topicfinder-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a3e7c344533774f403a57962e168818781d74c7c41521c416fcc908e7b72e68 |
|
MD5 | e7badf9769e9d63d747a8d9a7dc6807f |
|
BLAKE2b-256 | 27e0a574f8a05543bfbd1f36554e5f7d5b0802ca307927bcc861bc41c78b17b1 |