Skip to main content

finding topics in a set of documents in an unsupervised fashion.

Project description

Topic Finder (unsupervised) works in a three part process.

To install:

pip3 install topicfinder
  1. tokenizing the document set. This applies the default wildgram analysis to tokenize the document set. documents must be in the structure below, a list of dictionaries. docID is optional as a parameter (if not included, the id is the index in the array).
from topicfinder import getTokenLists
ret = getTokenLists([{"text": "boo", "docID": "1"}])
  1. sorting the entire tokenset by the unique tokens and their frequencies.
from topicfinder import getTokenLists, getPhrases
ret = getPhrases(getTokenLists([{"text": "boo", "docID": "1"}]))
  1. grouping the phrases by similarity, with some lemmatization and fuzzy matching. It considers any word or lemmatized word less than 3 letters impossible to group. You can also set a threshold similarity (1.0 is an exact match). Default is (for now) 0.75. If you have a custom tokenType (e.g. not "token" or "noise") generated from wildgram it will group the tokens on that. For example, with default settings in synopsis it automatically deals with grouping numbers together and negations.
from topicfinder import getTokenLists, getPhrases, groupPhrasesBySimilarity
ret = groupPhrasesBySimilarity(getPhrases(getTokenLists([{"text": "boo", "docID": "1"}])))

ret = groupPhrasesBySimilarity(getPhrases(getTokenLists([{"text": "boo", "docID": "1"}])),threshold=0.8)

## ret is a sorted list (descending order) of the following style of dictionaries
## {"phrases": [#list of phrases#], "tokens": [##list of unique tokens from the documents]}
## descending order by the size of the tokens array

you can also just call:

from topicfinder import synopsis
ret = synopsis([{"text": "boo", "docID": "1"}], threshold=0.8)

this will do steps 1-3 in one go. I split it up because with medical things, sometimes you need to run and store the data off given the size of the dataset or do a map reduce situation or be extra careful about PHI.

For anyone worrying about PHI -- aint none of this connected to the internet. It's all local baby. You can check the code if you need to also.

Topic Finder can also apply known categories through the function topicfinder.

from topicfinder import topicfinder
ret = topicfinder({"text": "boo", "docID": "1"}, [{"unit": "TEST", "value": "BOO", "token": "boo", "frequency": 10}], threshold=0.8, normalizationCutoff=2)

topicfinder accepts a single doc, in the form of a dictionary where the text parameter is what needs to be analyzed. topicfinder accepts a list of topics (second parameter), where each topic is a dictionary like so: unit - the unit in reference (e.g. UMLS, ICD-10, Custom) value - the value (e.g. the code ) token - the example phrase that needs to match frequency - either arbitrary or real based on past data. If arbitrary, a higher frequency will get checked first.

It works like this:

  1. for each token created by wildgram,
  2. for every topic, find the topic with the highest frequency whose token example is similar to the token snippet over the threshold and assign that topic. the normalization cuffoff is a simple lemmatization measure that shortens the longer comparison string if it is less than the cutoff*the length of the shorter string.
  3. keep going until all tokens are assigned.

Tokens that have custom tokenTypes or noise ("negation", "noise") are not checked.

It returns a list of tokens like what is generated from wildgram, but each token dictionary is assigned a topic key that is the topic given (if matched) or an empty dictionary (if not matched).

Future work will include

  1. being able to apply a annotation normalization function (e.g. a set of synonyms rolling up)
  2. making some default normalization functions (E.g. querying UMLS, etc.) ---- note that this might need to be connected to the internet to work.
  3. dealing with acronyms and short words better
  4. dealing with numbers and time durations better

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topicfinder-0.1.1.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topicfinder-0.1.1-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file topicfinder-0.1.1.tar.gz.

File metadata

  • Download URL: topicfinder-0.1.1.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.8

File hashes

Hashes for topicfinder-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2d9933113648f10e92f906dd43b9bbf6f8bef55699c34ebee5b7ca5d41860745
MD5 23c5ee39d703b4a9d486070b549c86bd
BLAKE2b-256 479d7537e8e2b773b46095d5a5a477c37e929c66316b78abb464a7eaee2a4737

See more details on using hashes here.

File details

Details for the file topicfinder-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: topicfinder-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.8

File hashes

Hashes for topicfinder-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 01547bb50e7eb4f7eb6d0bd945bdc20d390773fc36b1a0e60f2c6c36e8529cde
MD5 d492e280997cfc4be75d2cfb4532a315
BLAKE2b-256 5c05eb1a0679425c185dea09ee814e2c40c5848f4c694dc7c28067e91e8da88e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page