Documentation is available at https://amyolex.github.io/medtop/.
Extracting topics from reflective medical writings.
MedTop is only compatible with 64-bit python. You can check which version of python you're using in your virtual environment with the following code.
import platform; platform.architecture();
pip install medtop
How to use
A template pipeline is provided below using a test dataset. You can read more about the test_data dataset here
Each step of the pipeline has configuration options for experimenting with various methods. These are detailed in the documentation for each method. Notably, the
evaluate methods all include the option to save results to a file.
Import and pre-process documents from a text file containing a list of all documents.
from medtop.core import * data, doc_df = import_from_files('test_data/corpus_file_list.txt', stop_words_file='stop_words.txt', save_results = False)
You can also consolidate your documents into a single, pipe-delimited csv file with the columns "doc_name" and "text".
data, doc_df = import_from_csv('test_data/corpus.txt', stop_words_file='stop_words.txt', save_results = False)
Create word vectors from the most expressive phrase in each sentence of the imported documents. Seed documents can be passed as a single CSV similar to corpus documents in the import step.
doc_df is NOT passed to
create_tfidf, you must set
tfidf, dictionary = create_tfidf(doc_df, path_to_seed_topics_file_list='test_data/seed_topics_file_list.txt') data = get_phrases(data, dictionary.token2id, tfidf, include_input_in_tfidf = True, include_sentiment=True) data = get_vectors("tfidf", data, dictionary = dictionary, tfidf = tfidf)
Removed 67 sentences without phrases.
Cluster the sentences into groups expressing similar ideas or topics. If you aren't sure how many true clusters exist in the data, try running
assign_clusters with the optional parameter
show_chart = True to visual cluster quality with varying numbers of clusters. When using
method='hac', you can also use
show_dendrogram = True see the cluster dendrogram.
data = assign_clusters(data, method = "hac") cluster_df = get_cluster_topics(data, doc_df, save_results = False) visualize_clustering(data, method = "umap", show_chart = False)
gold_file = "test_data/gold.txt" results_df = evaluate(data, gold_file="test_data/gold.txt", save_results = False)
IMPORTANT: This feature is still in alpha, meaning that we have adapted the pipeline to accomodate the clustering of documents, but have made no rigorous efforts the ensure that it works well.
To cluster documents, simply import data and create the TF-IDF as above, but extract phrase, create the vectors, and cluster using the
doc_df dataframe. Passing the parameter
get_phrases tells the method to use all tokens instead of selecting a subset of length
doc_df = get_phrases(doc_df, dictionary.token2id, tfidf, include_input_in_tfidf = True, window_size=-1) doc_df = get_vectors("tfidf", doc_df, dictionary = dictionary, tfidf = tfidf) doc_df = assign_clusters(doc_df, method = "kmeans", k=4) cluster_df = get_cluster_topics(data, doc_df, save_results = False) visualize_clustering(data, method = "svd", show_chart = False)
Removed 0 sentences without phrases.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size medtop-0.0.8.tar.gz (20.0 kB)||File type Source||Python version None||Upload date||Hashes View|
|Filename, size medtop-0.0.8-py3-none-any.whl (18.9 kB)||File type Wheel||Python version py3||Upload date||Hashes View|