Skip to main content

finding topics in a set of documents in an unsupervised fashion.

Project description

Topic Finder (unsupervised) works in a three part process.

To install:

pip3 install topicfinder
  1. tokenizing the document set. This applies the default wildgram analysis to tokenize the document set. documents must be in the structure below, a list of dictionaries. docID is optional as a parameter (if not included, the id is the index in the array).
from topicfinder import getTokenLists
ret = getTokenLists([{"text": "boo", "docID": "1"}])
  1. sorting the entire tokenset by the unique tokens and their frequencies.
from topicfinder import getTokenLists, getPhrases
ret = getPhrases(getTokenLists([{"text": "boo", "docID": "1"}]))
  1. grouping the phrases by similarity, with some lemmatization and fuzzy matching. It considers any word or lemmatized word less than 3 letters impossible to group. You can also set a threshold similarity (1.0 is an exact match). Default is (for now) 0.75. If you have a custom tokenType (e.g. not "token" or "noise") generated from wildgram it will group the tokens on that. For example, with default settings in synopsis it automatically deals with grouping numbers together and negations.
from topicfinder import getTokenLists, getPhrases, groupPhrasesBySimilarity
ret = groupPhrasesBySimilarity(getPhrases(getTokenLists([{"text": "boo", "docID": "1"}])))

ret = groupPhrasesBySimilarity(getPhrases(getTokenLists([{"text": "boo", "docID": "1"}])),threshold=0.8)

## ret is a sorted list (descending order) of the following style of dictionaries
## {"phrases": [#list of phrases#], "tokens": [##list of unique tokens from the documents]}
## descending order by the size of the tokens array

you can also just call:

from topicfinder import synopsis
ret = synopsis([{"text": "boo", "docID": "1"}], threshold=0.8)

this will do steps 1-3 in one go. I split it up because with medical things, sometimes you need to run and store the data off given the size of the dataset or do a map reduce situation or be extra careful about PHI.

For anyone worrying about PHI -- aint none of this connected to the internet. It's all local baby. You can check the code if you need to also.

Topic Finder can also apply known categories through the function topicfinder.

from topicfinder import topicfinder
ret = topicfinder({"text": "boo", "docID": "1"}, [{"unit": "TEST", "value": "BOO", "token": "boo", "frequency": 10}], threshold=0.8, normalizationCutoff=2)

topicfinder accepts a single doc, in the form of a dictionary where the text parameter is what needs to be analyzed. topicfinder accepts a list of topics (second parameter), where each topic is a dictionary like so: unit - the unit in reference (e.g. UMLS, ICD-10, Custom) value - the value (e.g. the code ) token - the example phrase that needs to match frequency - either arbitrary or real based on past data. If arbitrary, a higher frequency will get checked first.

It works like this:

  1. for each token created by wildgram,
  2. for every topic, find the topic with the highest frequency whose token example is similar to the token snippet over the threshold and assign that topic. the normalization cuffoff is a simple lemmatization measure that shortens the longer comparison string if it is less than the cutoff*the length of the shorter string.
  3. keep going until all tokens are assigned.

Tokens that have custom tokenTypes or noise ("negation", "noise") are not checked.

It returns a list of tokens like what is generated from wildgram, but each token dictionary is assigned a topic key that is the topic given (if matched) or an empty dictionary (if not matched).

Future work will include

  1. being able to apply a annotation normalization function (e.g. a set of synonyms rolling up)
  2. making some default normalization functions (E.g. querying UMLS, etc.) ---- note that this might need to be connected to the internet to work.
  3. dealing with acronyms and short words better
  4. dealing with numbers and time durations better

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topicfinder-0.1.1.tar.gz (4.5 kB view hashes)

Uploaded Source

Built Distribution

topicfinder-0.1.1-py3-none-any.whl (6.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page