https://amyolex.github.io/medtop/
Project description
Documentation is available at https://amyolex.github.io/medtop/.
MedTop
Extracting topics from reflective medical writings.
Requirements
MedTop is only compatible with 64-bit python. You can check which version of python you're using in your virtual environment with the following code.
import platform; platform.architecture()[0];
pip install medtop
How to use
A template pipeline is provided below using a test dataset. You can read more about the test_data dataset here
Each step of the pipeline has configuration options for experimenting with various methods. These are detailed in the documentation for each method. Notably, the import_docs
, get_cluster_topics
, visualize_clustering
, and evaluate
methods all include the option to save results to a file.
Example Pipeline
Import data
Import and pre-process documents from a text file containing a list of all documents.
from medtop.core import *
data, doc_df = import_from_files('test_data/corpus_file_list.txt', stop_words_file='stop_words.txt', save_results = False)
You can also consolidate your documents into a single, pipe-delimited csv file with the columns "doc_name" and "text".
data, doc_df = import_from_csv('test_data/corpus.txt', stop_words_file='stop_words.txt', save_results = False)
Transform data
Create word vectors from the most expressive phrase in each sentence of the imported documents. Seed documents can be passed as a single CSV similar to corpus documents in the import step.
NOTE: If doc_df
is NOT passed to create_tfidf
, you must set include_input_in_tfidf=False
in get_phrases
.
tfidf, dictionary = create_tfidf(doc_df, path_to_seed_topics_file_list='test_data/seed_topics_file_list.txt')
data = get_phrases(data, dictionary.token2id, tfidf, include_input_in_tfidf = True, include_sentiment=True)
data = get_vectors("tfidf", data, dictionary = dictionary, tfidf = tfidf)
Removed 67 sentences without phrases.
Cluster data
Cluster the sentences into groups expressing similar ideas or topics. If you aren't sure how many true clusters exist in the data, try running assign_clusters
with the optional parameter show_chart = True
to visual cluster quality with varying numbers of clusters. When using method='hac'
, you can also use show_dendrogram = True
see the cluster dendrogram.
data = assign_clusters(data, method = "hac")
cluster_df = get_cluster_topics(data, doc_df, save_results = False)
visualize_clustering(data, method = "umap", show_chart = False)
Evaluate results
gold_file = "test_data/gold.txt"
results_df = evaluate(data, gold_file="test_data/gold.txt", save_results = False)
Document Clustering
IMPORTANT: This feature is still in alpha, meaning that we have adapted the pipeline to accomodate the clustering of documents, but have made no rigorous efforts the ensure that it works well.
To cluster documents, simply import data and create the TF-IDF as above, but extract phrase, create the vectors, and cluster using the doc_df
dataframe. Passing the parameter window_size=-1
to get_phrases
tells the method to use all tokens instead of selecting a subset of length window_size
.
doc_df = get_phrases(doc_df, dictionary.token2id, tfidf, include_input_in_tfidf = True, window_size=-1)
doc_df = get_vectors("tfidf", doc_df, dictionary = dictionary, tfidf = tfidf)
doc_df = assign_clusters(doc_df, method = "kmeans", k=4)
cluster_df = get_cluster_topics(data, doc_df, save_results = False)
visualize_clustering(data, method = "svd", show_chart = False)
Removed 0 sentences without phrases.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file medtop-0.0.8.tar.gz
.
File metadata
- Download URL: medtop-0.0.8.tar.gz
- Upload date:
- Size: 20.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a10b7ceda3a4702e0ca2364277c8d19a5bfb238b9008ace0c43916691fe241bd |
|
MD5 | 84e5bca055d37e9d8ac06cfad2919c0b |
|
BLAKE2b-256 | f5511ad20b580bad1712268c6d0b24d393cd2341fa10ca10a63ba292c1a37f60 |
File details
Details for the file medtop-0.0.8-py3-none-any.whl
.
File metadata
- Download URL: medtop-0.0.8-py3-none-any.whl
- Upload date:
- Size: 18.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24752bd7acfcf63da943b2ccd888f03c44a317e0317cd0cefb421e2834e5119a |
|
MD5 | 815aabc59207423333bfd5881e43bf4e |
|
BLAKE2b-256 | 09972843d14efbd3189669527bb880c79502f9b8709661f1ca30a42991044181 |