Skip to main content

Set of tools for use in research of rare disease related text.

Project description

RDSMproj

RDSMproj (Rare Diseases Social Media Project) for the National Center for Advancing Translational Sciences at the NIH. This project looks at mining information from social media (Reddit) and finding subreddits that are related to different rare diseases found in the GARD database. The project matches rare diseases to Reddit subreddits, downloads the post and comment data, and then analyzes the text data to find the different topics that people are talking about.

Overview

The project is split into four packages as part of rdsmproj:

  1. mapper is a python package that maps text to a rare disease(s) using nltk and spaCy. An alternate name for this package is NormMap V2.
  2. sm_reddit is a collection of scripts that utilizes pmaw to download Reddit post and comment text data for use in topic modeling or other text analyses.
  3. tm_t2v is a python package that creates topic models of text using Top2Vec.
  4. tm_lda is a (legacy) python package that creates topic models of text primarily using LDA as implemented by Gensim. This package was used in this paper.

Installation

Ensure that you have up to date copies of pip, setuptools, and wheel prior to installation.

pip install --upgrade pip setuptools wheel

For now, each package above is installed separately.

pip install rdsmproj[mapper]
pip install rdsmproj[sm_reddit]
pip install rdsmproj[tm_t2v]
pip install rdsmproj[tm_tlda]

Quick Start

For more information view the API guide.

Examples using sm_reddit

sm_reddit.GetPosts

from rdsmproj import sm_reddit

pmaw_args = {'limit':1000}
# Example subreddit 'MachineLearning'.
# Passes pmaw arguments to search_submissions.
sm_reddit.GetPosts(name='MachineLearning', silence=False, pmaw_args=pmaw_args)

sm_reddit.GetRedditComments

from rdsmproj import utils
from pathlib import Path

# Default path to where the post data is located.
path = utils.get_data_path('posts')
data = utils.load_json(Path(path,'MachineLearning_posts.json'))
# Example passes pmaw arguments to search_submission_comment_ids.
sm_reddit.GetRedditComments(data=data, silence=False, pmaw_args=pmaw_args)

Example using preprocess to process text data.

preprocess.Preprocess

from rdsmproj import preprocess as pp

# Example processes the comment data for use with tm_lda or tm_t2v.
data = pp.PreProcess(name='MachineLearning')
documents, tokenized_documents, id2word, corpus = data()

Example using tm_t2v to create and analyze a top2vec model.

tm_t2v.Top2VecModel

from rdsmproj import tm_t2v

embedding_model = 'doc2vec'
name = 'MachineLearning'
clustering_method = 'leaf'
i = 0

# Creates and saves a model.
model = tm_t2v.Top2VecModel(name,
                            f'{name}_{embedding_model}_{clustering_method}_{i}',documents=documents,
                            embedding_model=embedding_model,
                            speed='fast-learn'
                            ).fit()

tm_t2v.AnalyzeTopics

# Analyzes model and records the results.
tm_t2v.AnalyzeTopics(model=model,
                     model_name=f'{name}_{embedding_model}_{clustering_method}_{i}',
                     subreddit_name=name,
                     tokenized_docs=tokenized_documents,
                     id2word=id2word,
                     corpus=corpus,
                     model_type='Top2Vec')

To Do

  • Test package install from TestPyPI.
  • Update main README.md Quick Start with examples for most packages.
  • Create sm_reddit README.md.
  • Create tm_t2v README.md.
  • Create tm_lda README.md.
  • Create API guide and documentation pages.
  • Add visualizations and flowcharts to the readme files.
  • Upload to PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdsmproj-0.0.1.tar.gz (49.9 kB view details)

Uploaded Source

Built Distribution

rdsmproj-0.0.1-py3-none-any.whl (57.1 kB view details)

Uploaded Python 3

File details

Details for the file rdsmproj-0.0.1.tar.gz.

File metadata

  • Download URL: rdsmproj-0.0.1.tar.gz
  • Upload date:
  • Size: 49.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.9

File hashes

Hashes for rdsmproj-0.0.1.tar.gz
Algorithm Hash digest
SHA256 1f0de9e0ea0eee25b664863c3c290a1611c9b7149d53302d338ef308fd066eb2
MD5 d819990bcf929548e2e10c452094e5bd
BLAKE2b-256 c6c46e6f98e360089c55aa703a194200696869d2cc50cccf4d4885f6773cbf8f

See more details on using hashes here.

File details

Details for the file rdsmproj-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: rdsmproj-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 57.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.9

File hashes

Hashes for rdsmproj-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d12df1c8d12440fb712bfd9db4e3a70077c8fbe159b76a1433b807eb572df5cd
MD5 c5ea38e00436dce3d98ae39e958a9b77
BLAKE2b-256 fd912946a1ec9752f3b52eb8b15f9a99f27b3e0ac639542feec0c1741dbdc223

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page