Skip to main content

https://cctrbic.github.io/medtop/

Project description

CI

Documentation is available at https://amyolex.github.io/medtop/.

MedTop

Extracting topics from reflective medical writings.

Requirements

pip install medtop

python -m nltk.downloader all

How to use

A template pipeline is provided below using a test dataset. You can read more about the test_data dataset here

Each step of the pipeline has configuration options for experimenting with various methods. These are detailed in the documentation for each method. Notably, the import_docs, get_cluster_topics, visualize_clustering, and evaluate methods all include the option to save results to a file.

Example Pipeline

Import data

Import and pre-process documents from a text file containing a list of all documents.

from medtop.core import *
data, doc_df = import_docs('test_data/corpus_file_list.txt', save_results = True)
Results saved to output/DocumentSentenceList.txt

Transform data

Create word vectors from the most expressive phrase in each sentence of the imported documents.

NOTE: If doc_df is NOT passed to create_tfidf, you must set include_input_in_tfidf=False in get_phrases.

tfidf, dictionary = create_tfidf(doc_df, 'test_data/seed_topics_file_list.txt')
data = get_phrases(data, dictionary.token2id, tfidf, include_input_in_tfidf = True)
data = get_vectors("tfidf", data, dictionary = dictionary, tfidf = tfidf)
Removed 43 sentences without phrases.

Cluster data

Cluster the sentences into groups expressing similar ideas or topics. If you aren't sure how many true clusters exist in the data, try running assign_clusters with the optional parameter show_chart = True to visual cluster quality with varying numbers of clusters. When using method='hac', you can also use show_dendrogram = True see the cluster dendrogram.

data = assign_clusters(data, method = "kmeans", k=4)
cluster_df = get_cluster_topics(data, doc_df, save_results = True)
visualize_clustering(data, method = "svd", show_chart = False)
Results saved to output/TopicClusterResults.txt

Evaluate results

gold_file = "test_data/gold.txt"
results_df = evaluate(data, gold_file="test_data/gold.txt", save_results = False)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medtop-0.0.2.tar.gz (18.7 kB view hashes)

Uploaded Source

Built Distribution

medtop-0.0.2-py3-none-any.whl (17.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page