Set of tools for use in research of rare disease related text.
Project description
RDSMproj
RDSMproj (Rare Diseases Social Media Project) for the National Center for Advancing Translational Sciences at the NIH. This project looks at mining information from social media (Reddit) and finding subreddits that are related to different rare diseases found in the GARD database. The project matches rare diseases to Reddit subreddits, downloads the post and comment data, and then analyzes the text data to find the different topics that people are talking about.
Overview
The project is split into four packages as part of rdsmproj:
- mapper is a python package that maps text to a rare disease(s) using nltk and spaCy. An alternate name for this package is NormMap V2.
- sm_reddit is a collection of scripts that utilizes pmaw to download Reddit post and comment text data for use in topic modeling or other text analyses.
- tm_t2v is a python package that creates topic models of text using Top2Vec.
- tm_lda is a (legacy) python package that creates topic models of text primarily using LDA as implemented by Gensim. This package was used in this paper.
Installation
Ensure that you have up to date copies of pip
, setuptools
, and wheel
prior to installation.
pip install --upgrade pip setuptools wheel
For now, each package above is installed separately.
pip install rdsmproj[mapper]
pip install rdsmproj[sm_reddit]
pip install rdsmproj[tm_t2v]
pip install rdsmproj[tm_tlda]
Quick Start
For more information view the API guide.
Examples using sm_reddit
sm_reddit.GetPosts
from rdsmproj import sm_reddit
pmaw_args = {'limit':1000}
# Example subreddit 'MachineLearning'.
# Passes pmaw arguments to search_submissions.
sm_reddit.GetPosts(name='MachineLearning', silence=False, pmaw_args=pmaw_args)
sm_reddit.GetRedditComments
from rdsmproj import utils
from pathlib import Path
# Default path to where the post data is located.
path = utils.get_data_path('posts')
data = utils.load_json(Path(path,'MachineLearning_posts.json'))
# Example passes pmaw arguments to search_submission_comment_ids.
sm_reddit.GetRedditComments(data=data, silence=False, pmaw_args=pmaw_args)
Example using preprocess to process text data.
preprocess.Preprocess
from rdsmproj import preprocess as pp
# Example processes the comment data for use with tm_lda or tm_t2v.
data = pp.PreProcess(name='MachineLearning')
documents, tokenized_documents, id2word, corpus = data()
Example using tm_t2v to create and analyze a top2vec model.
tm_t2v.Top2VecModel
from rdsmproj import tm_t2v
embedding_model = 'doc2vec'
name = 'MachineLearning'
clustering_method = 'leaf'
i = 0
# Creates and saves a model.
model = tm_t2v.Top2VecModel(name,
f'{name}_{embedding_model}_{clustering_method}_{i}',documents=documents,
embedding_model=embedding_model,
speed='fast-learn'
).fit()
tm_t2v.AnalyzeTopics
# Analyzes model and records the results.
tm_t2v.AnalyzeTopics(model=model,
model_name=f'{name}_{embedding_model}_{clustering_method}_{i}',
subreddit_name=name,
tokenized_docs=tokenized_documents,
id2word=id2word,
corpus=corpus,
model_type='Top2Vec')
To Do
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file rdsmproj-0.0.1.tar.gz
.
File metadata
- Download URL: rdsmproj-0.0.1.tar.gz
- Upload date:
- Size: 49.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f0de9e0ea0eee25b664863c3c290a1611c9b7149d53302d338ef308fd066eb2 |
|
MD5 | d819990bcf929548e2e10c452094e5bd |
|
BLAKE2b-256 | c6c46e6f98e360089c55aa703a194200696869d2cc50cccf4d4885f6773cbf8f |
File details
Details for the file rdsmproj-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: rdsmproj-0.0.1-py3-none-any.whl
- Upload date:
- Size: 57.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d12df1c8d12440fb712bfd9db4e3a70077c8fbe159b76a1433b807eb572df5cd |
|
MD5 | c5ea38e00436dce3d98ae39e958a9b77 |
|
BLAKE2b-256 | fd912946a1ec9752f3b52eb8b15f9a99f27b3e0ac639542feec0c1741dbdc223 |