Skip to main content

https://amyolex.github.io/medtop/

Project description

CI

Documentation is available at https://amyolex.github.io/medtop/.

MedTop

Extracting topics from reflective medical writings.

Requirements

MedTop is only compatible with 64-bit python. You can check which version of python you're using in your virtual environment with the following code.

import platform; platform.architecture()[0];

pip install medtop

How to use

A template pipeline is provided below using a test dataset. You can read more about the test_data dataset here

Each step of the pipeline has configuration options for experimenting with various methods. These are detailed in the documentation for each method. Notably, the import_docs, get_cluster_topics, visualize_clustering, and evaluate methods all include the option to save results to a file.

Example Pipeline

Import data

Import and pre-process documents from a text file containing a list of all documents.

from medtop.core import *
data, doc_df = import_from_files('test_data/corpus_file_list.txt', stop_words_file='stop_words.txt', save_results = False)

You can also consolidate your documents into a single, pipe-delimited csv file with the columns "doc_name" and "text".

data, doc_df = import_from_csv('test_data/corpus.txt', stop_words_file='stop_words.txt', save_results = False)

Transform data

Create word vectors from the most expressive phrase in each sentence of the imported documents. Seed documents can be passed as a single CSV similar to corpus documents in the import step.

NOTE: If doc_df is NOT passed to create_tfidf, you must set include_input_in_tfidf=False in get_phrases.

tfidf, dictionary = create_tfidf(doc_df, path_to_seed_topics_file_list='test_data/seed_topics_file_list.txt')
data = get_phrases(data, dictionary.token2id, tfidf, include_input_in_tfidf = True, include_sentiment=True)
data = get_vectors("tfidf", data, dictionary = dictionary, tfidf = tfidf)
Removed 67 sentences without phrases.

Cluster data

Cluster the sentences into groups expressing similar ideas or topics. If you aren't sure how many true clusters exist in the data, try running assign_clusters with the optional parameter show_chart = True to visual cluster quality with varying numbers of clusters. When using method='hac', you can also use show_dendrogram = True see the cluster dendrogram.

data = assign_clusters(data, method = "hac")
cluster_df = get_cluster_topics(data, doc_df, save_results = False)
visualize_clustering(data, method = "umap", show_chart = False)

Evaluate results

gold_file = "test_data/gold.txt"
results_df = evaluate(data, gold_file="test_data/gold.txt", save_results = False)

Document Clustering

IMPORTANT: This feature is still in alpha, meaning that we have adapted the pipeline to accomodate the clustering of documents, but have made no rigorous efforts the ensure that it works well.

To cluster documents, simply import data and create the TF-IDF as above, but extract phrase, create the vectors, and cluster using the doc_df dataframe. Passing the parameter window_size=-1 to get_phrases tells the method to use all tokens instead of selecting a subset of length window_size.

doc_df = get_phrases(doc_df, dictionary.token2id, tfidf, include_input_in_tfidf = True, window_size=-1)
doc_df = get_vectors("tfidf", doc_df, dictionary = dictionary, tfidf = tfidf)
doc_df = assign_clusters(doc_df, method = "kmeans", k=4)
cluster_df = get_cluster_topics(data, doc_df, save_results = False)
visualize_clustering(data, method = "svd", show_chart = False)
Removed 0 sentences without phrases.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medtop-0.0.8.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

medtop-0.0.8-py3-none-any.whl (18.9 kB view details)

Uploaded Python 3

File details

Details for the file medtop-0.0.8.tar.gz.

File metadata

  • Download URL: medtop-0.0.8.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for medtop-0.0.8.tar.gz
Algorithm Hash digest
SHA256 a10b7ceda3a4702e0ca2364277c8d19a5bfb238b9008ace0c43916691fe241bd
MD5 84e5bca055d37e9d8ac06cfad2919c0b
BLAKE2b-256 f5511ad20b580bad1712268c6d0b24d393cd2341fa10ca10a63ba292c1a37f60

See more details on using hashes here.

File details

Details for the file medtop-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: medtop-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for medtop-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 24752bd7acfcf63da943b2ccd888f03c44a317e0317cd0cefb421e2834e5119a
MD5 815aabc59207423333bfd5881e43bf4e
BLAKE2b-256 09972843d14efbd3189669527bb880c79502f9b8709661f1ca30a42991044181

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page