Skip to main content

finding topics in a set of documents in an unsupervised fashion.

Project description

Topic Finder works in a three part process.

To install:

pip3 install topicfinder
  1. tokenizing the document set. This applies the default wildgram analysis to tokenize the document set. documents must be in the structure below, a list of dictionaries. docID is optional as a parameter (if not included, the id is the index in the array).
from topicfinder import getTokenLists
ret = getTokenLists([{"text": "boo", "docID": "1"}])
  1. sorting the entire tokenset by the unique tokens and their frequencies.
from topicfinder import getTokenLists, getPhrases
ret = getPhrases(getTokenLists([{"text": "boo", "docID": "1"}]))
  1. grouping the phrases by similarity, with some lemmatization and fuzzy matching. It considers any word or lemmatized word less than 3 letters impossible to group.
from topicfinder import getTokenLists, getPhrases, groupPhrasesBySimilarity
ret = groupPhrasesBySimilarity(getPhrases(getTokenLists([{"text": "boo", "docID": "1"}])))

## ret is a sorted list (descending order) of the following style of dictionaries
## {"phrases": [#list of phrases#], "tokens": [##list of unique tokens from the documents]}
## descending order by the size of the tokens array

you can also just call:

from topicfinder import synopsis
ret = synopsis([{"text": "boo", "docID": "1"}])

this will do steps 1-3 in one go. I split it up because with medical things, sometimes you need to run and store the data off given the size of the dataset or do a map reduce situation or be extra careful about PHI.

For anyone worrying about PHI -- aint none of this connected to the internet. It's all local baby. You can check the code if you need to also.

Future work will include

  1. being able to apply a annotation normalization function (e.g. a set of synonyms rolling up)
  2. making some default normalization functions (E.g. querying UMLS, etc.) ---- note that this might need to be connected to the internet to work.
  3. dealing with acronyms and short words better
  4. dealing with numbers and time durations better

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topicfinder-0.0.3.tar.gz (3.3 kB view hashes)

Uploaded Source

Built Distribution

topicfinder-0.0.3-py3-none-any.whl (5.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page