Tools to provide easy access to prepared data to data scientists that cannot be asked.
Project description
- Introduction
- Installation
- Getting some resources
- Poking around
- Potential Calculus Experimentations
- Extra scrap
- Acknowledgements
Table of contents generated with markdown-toc
Introduction
ODUS (for Older Drug User Study) contains data and tools to study the drug use of older drug users.
Essentially, there are these are tools:
-
To get prepared data on the 119 "trajectories" describing 31 variables (drug use, social, etc.) over time of 119 different respondents.
-
To vizualize these trajectories in various ways
-
To create pdfs of any selection of these trajectories and variables
-
To make count tables for any combinations of the variables: Essential step of any Markovian or Bayesian analysis.
-
To make probability (joint or conditional) tables from any combination of the variables
-
To operate on these count and probability tables, thus enabling inference operations
Installation
You need to have python 3.7+ to run this notebook.
And you'll need to have odus
, which you get by doing
pip install odus
(And if you don't have pip then, well... how to put it... ha ha ha!)
But if you're the type, you can also just get the source from https://github.com/thorwhalen/odus
.
Oh, and pull requests etc. are welcome!
Stars, likes, references, and coffee also welcome.
A simple flowchart about the architecture:
Getting some resources
from matplotlib.pylab import *
from numpy import *
import seaborn as sns
import os
from py2store.stores.local_store import RelativePathFormatStore
from py2store.mixins import ReadOnlyMixin
from py2store.base import Store
from io import BytesIO
from spyn.ppi.pot import Pot, ProbPot
from collections import UserDict, Counter
import numpy as np
import pandas as pd
from ut.ml.feature_extraction.sequential_var_sets import PVar, VarSet, DfData, VarSetFactory
from IPython.display import Image
from odus.analysis_utils import *
from odus.dacc import DfStore, counts_of_kps, Dacc, VarSetCountsStore, \
mk_pvar_struct, PotStore, _commun_columns_of_dfs, Struct, mk_pvar_str_struct, VarStr
from odus.plot_utils import plot_life_course
from odus import data_dir, data_path_of
survey_dir = data_dir
data_dir
'/D/Dropbox/dev/p3/proj/odus/odus/data'
df_store = DfStore(data_dir + '/{}.xlsx')
len(df_store)
cstore = VarSetCountsStore(df_store)
v = mk_pvar_struct(df_store, only_for_cols_in_all_dfs=True)
s = mk_pvar_str_struct(v)
f, df = cstore.df_store.head()
pstore = PotStore(df_store)
Poking around
df_store
A df_store is a key-value store where the key is the xls file and the value is the prepared dataframe
len(df_store)
119
it = iter(df_store.values())
for i in range(5): # skip five first
_ = next(it)
df = next(it) # get the one I want
df.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
category | RURAL | SUBURBAN | URBAN/CITY | HOMELESS | INCARCERATION | WORK | SON/DAUGHTER | SIBLING | FATHER/MOTHER | SPOUSE | ... | METHAMPHETAMINE | AS PRESCRIBED OPIOID | NOT AS PRESCRIBED OPIOID | HEROIN | OTHER OPIOID | INJECTED | IN TREATMENT | Selects States below | Georgia | Pennsylvania |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
age | |||||||||||||||||||||
11 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
12 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
13 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
3 rows × 31 columns
print(df.columns.values)
['RURAL' 'SUBURBAN' 'URBAN/CITY' 'HOMELESS' 'INCARCERATION' 'WORK'
'SON/DAUGHTER' 'SIBLING' 'FATHER/MOTHER' 'SPOUSE'
'OTHER (WHO?, FILL IN BRACKETS HERE)' 'FRIEND USER' 'FRIEND NON USER'
'MENTAL ILLNESS' 'PHYSICAL ILLNESS' 'LOSS OF LOVED ONE' 'TOBACCO'
'MARIJUANA' 'ALCOHOL' 'HAL/LSD/XTC/CLUBDRUG' 'COCAINE/CRACK'
'METHAMPHETAMINE' 'AS PRESCRIBED OPIOID' 'NOT AS PRESCRIBED OPIOID'
'HEROIN' 'OTHER OPIOID' 'INJECTED' 'IN TREATMENT' 'Selects States below'
'Georgia' 'Pennsylvania']
t = df[['ALCOHOL', 'TOBACCO']]
t.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
category | ALCOHOL | TOBACCO |
---|---|---|
age | ||
11 | 0 | 0 |
12 | 0 | 0 |
13 | 0 | 0 |
c = Counter()
for i, r in t.iterrows():
c.update([tuple(r.to_list())])
c
Counter({(0, 0): 6, (1, 0): 4, (1, 1): 9, (0, 1): 2})
def count_tuples(dataframe):
c = Counter()
for i, r in dataframe.iterrows():
c.update([tuple(r.to_list())])
return c
fields = ['ALCOHOL', 'TOBACCO']
# do it for every one
c = Counter()
for df in df_store.values():
c.update(count_tuples(df[fields]))
c
Counter({(0, 1): 903, (1, 1): 1343, (0, 0): 240, (1, 0): 179})
pd.Series(c)
0 1 903
1 1 1343
0 0 240
1 0 179
dtype: int64
# Powerful! You can use that with several pairs and get some nice probabilities. Look up Naive Bayes.
Viewing trajectories
import itertools
from functools import partial
from odus.util import write_images
from odus.plot_utils import plot_life, life_plots, write_trajectories_to_file
ihead = lambda it: itertools.islice(it, 0, 5)
Viewing a single trajectory
k = next(iter(df_store)) # get the first key
print(f"k: {k}") # print it
plot_life(df_store[k]) # plot the trajectory
k: surveys/B24.xlsx
plot_life(df_store[k], fields=[s.in_treatment, s.injected]) # only want two fields
Flip over all (or some) trajectories
gen = life_plots(df_store)
next(gen) # launch to get the next trajectory
<matplotlib.axes._subplots.AxesSubplot at 0x12b21f070>
Get three trajectories, but only over two fields.
# fields = [s.in_treatment, s.injected]
fields = [s.physical_illness, s.as_prescribed_opioid, s.heroin, s.other_opioid]
keys = list(df_store)[:10]
# print(f"keys={keys}")
axs = [x for x in life_plots(df_store, fields, keys=keys)];
Making a pdf of trajectories
write_trajectories_to_file(df_store, fields, keys, fp='three_respondents_two_fields.pdf');
write_trajectories_to_file(df_store, fp='all_respondents_all_fields.pdf');
Demo s and v
print(list(filter(lambda x: not x.startswith('__'), dir(s))))
['alcohol', 'as_prescribed_opioid', 'cocaine_crack', 'father_mother', 'hal_lsd_xtc_clubdrug', 'heroin', 'homeless', 'in_treatment', 'incarceration', 'injected', 'loss_of_loved_one', 'marijuana', 'mental_illness', 'methamphetamine', 'not_as_prescribed_opioid', 'other_opioid', 'physical_illness', 'rural', 'sibling', 'son_daughter', 'suburban', 'tobacco', 'urban_city', 'work']
s.heroin
'HEROIN'
v.heroin
PVar('HEROIN', 0)
v.heroin - 1
PVar('HEROIN', -1)
cstore
# cstore[v.alcohol, v.tobacco]
cstore[v.as_prescribed_opioid-1, v.heroin]
Counter({(0, 0): 1026, (1, 0): 264, (0, 1): 1108, (1, 1): 148})
pd.Series(cstore[v.as_prescribed_opioid-1, v.heroin])
0 0 1026
1 0 264
0 1 1108
1 1 148
dtype: int64
cstore[v.alcohol, v.tobacco, v.heroin]
Counter({(0, 0, 1): 427,
(1, 0, 1): 656,
(1, 1, 1): 687,
(0, 0, 0): 189,
(0, 1, 1): 476,
(0, 1, 0): 51,
(1, 0, 0): 133,
(1, 1, 0): 46})
cstore[v.alcohol-1, v.alcohol]
Counter({(0, 0): 994, (1, 1): 1375, (1, 0): 90, (0, 1): 87})
cstore[v.alcohol-1, v.alcohol, v.tobacco]
Counter({(0, 0, 1): 807,
(1, 1, 1): 1220,
(1, 0, 0): 26,
(0, 1, 1): 76,
(0, 0, 0): 187,
(1, 1, 0): 155,
(0, 1, 0): 11,
(1, 0, 1): 64})
t = pd.Series(cstore[v.alcohol-1, v.alcohol, v.tobacco])
t.loc[t.index]
<pandas.core.indexing._LocIndexer at 0x130955db0>
pstore
t = pstore[s.alcohol-1, s.alcohol]
t
pval
ALCOHOL-1 ALCOHOL
0 0 994
1 87
1 0 90
1 1375
t.tb
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
ALCOHOL-1 | ALCOHOL | pval | |
---|---|---|---|
0 | 0 | 994 | |
0 | 1 | 87 | |
1 | 0 | 90 | |
1 | 1 | 1375 |
t / []
pval
ALCOHOL-1 ALCOHOL
0 0 0.390416
1 0.034171
1 0 0.035350
1 0.540063
t[s.alcohol-1]
pval
ALCOHOL-1
0 1081
1 1465
t / t[s.alcohol-1] # cond prob!
pval
ALCOHOL-1 ALCOHOL
0 0 0.919519
1 0.080481
1 0 0.061433
1 0.938567
tt = pstore[s.alcohol, s.tobacco]
tt
pval
ALCOHOL TOBACCO
0 0 240
1 903
1 0 179
1 1343
tt / tt[s.alcohol]
pval
ALCOHOL TOBACCO
0 0 0.209974
1 0.790026
1 0 0.117608
1 0.882392
tt / tt[s.tobacco]
pval
ALCOHOL TOBACCO
0 0 0.572792
1 0 0.427208
0 1 0.402048
1 1 0.597952
Scrap place
t = pstore[s.as_prescribed_opioid-1, s.heroin-1, s.heroin]
t
pval
AS PRESCRIBED OPIOID-1 HEROIN-1 HEROIN
0 0 0 927
1 172
1 0 99
1 936
1 0 0 249
1 33
1 0 15
1 115
tt = t / t[s.as_prescribed_opioid-1, s.heroin-1] # cond prob!
tt
pval
AS PRESCRIBED OPIOID-1 HEROIN-1 HEROIN
0 0 0 0.843494
1 0.156506
1 0 0.095652
1 0.904348
1 0 0 0.882979
1 0.117021
1 0 0.115385
1 0.884615
tt.tb
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
AS PRESCRIBED OPIOID-1 | HEROIN-1 | HEROIN | pval | |
---|---|---|---|---|
0 | 0 | 0 | 0.843494 | |
0 | 0 | 1 | 0.156506 | |
0 | 1 | 0 | 0.095652 | |
0 | 1 | 1 | 0.904348 | |
1 | 0 | 0 | 0.882979 | |
1 | 0 | 1 | 0.117021 | |
1 | 1 | 0 | 0.115385 | |
1 | 1 | 1 | 0.884615 |
AS PRESCRIBED OPIOID-1 HEROIN-1 HEROIN
0 0 0 0.843494
0 0 1 0.156506
1 0 0 0.882979
1 0 1 0.117021
0.117021 / 0.156506
0.7477093529960512
prob_of_heroin_given_presc_op = 0.359223
prob_of_heroin_given_not_presc_op = 0.519213
prob_of_heroin_given_presc_op / prob_of_heroin_given_not_presc_op
0.6918605658949217
prob_of_heroin_given_not_presc_op / prob_of_heroin_given_presc_op
1.4453779407220584
Potential Calculus Experimentations
# survey_dir = '/D/Dropbox/others/Miriam/python/ProcessedSurveys'
df_store = DfStore(survey_dir + '/{}.xlsx')
len(df_store)
119
cstore = VarSetCountsStore(df_store)
v = mk_pvar_struct(df_store, only_for_cols_in_all_dfs=True)
s = mk_pvar_str_struct(v)
f, df = cstore.df_store.head()
df.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
category | RURAL | SUBURBAN | URBAN/CITY | HOMELESS | INCARCERATION | WORK | SON/DAUGHTER | SIBLING | FATHER/MOTHER | SPOUSE | ... | HAL/LSD/XTC/CLUBDRUG | COCAINE/CRACK | METHAMPHETAMINE | AS PRESCRIBED OPIOID | NOT AS PRESCRIBED OPIOID | HEROIN | OTHER OPIOID | INJECTED | IN TREATMENT | Massachusetts |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
age | |||||||||||||||||||||
16 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
17 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
18 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
3 rows × 29 columns
cstore = VarSetCountsStore(df_store)
cstore.mk_pvar_attrs()
from odus.dacc import DfStore, counts_of_kps, Dacc, plot_life_course, VarSetCountsStore, mk_pvar_struct, PotStore
pstore = PotStore(df_store)
pstore.mk_pvar_attrs()
p = pstore[v.homeless - 1, v.incarceration]
p
pval
HOMELESS-1 INCARCERATION
0 0 1690
1 577
1 0 192
1 87
p / []
pval
HOMELESS-1 INCARCERATION
0 0 0.663786
1 0.226630
1 0 0.075412
1 0.034171
pstore[v.incarceration]
pval
INCARCERATION
0 1989
1 676
pstore[v.alcohol-1, v.loss_of_loved_one]
pval
ALCOHOL-1 LOSS OF LOVED ONE
0 0 990
1 91
1 0 1321
1 144
tw = pstore[v.tobacco, v.work]
mw = pstore[v.marijuana, v.work]
aw = pstore[v.alcohol, v.work]
w = pstore[v.work]
evid_t = Pot.from_hard_evidence(**{s.tobacco: 1})
evid_m = Pot.from_hard_evidence(**{s.marijuana: 1})
evid_a = Pot.from_hard_evidence(**{s.alcohol: 1})
evid_a
pval
ALCOHOL
1 1
aw
pval
ALCOHOL WORK
0 0 431
1 712
1 0 448
1 1074
w / []
pval
WORK
0 0.329831
1 0.670169
(evid_m * mw) / []
pval
MARIJUANA WORK
1 0 0.350603
1 0.649397
(evid_t * tw) / []
pval
TOBACCO WORK
1 0 0.313001
1 0.686999
(evid_a * aw) / []
pval
ALCOHOL WORK
1 0 0.29435
1 0.70565
Extra scrap
# from graphviz import Digraph
# Digraph(body="""
# raw -> data -> count -> prob
# raw [label="excel files (one per respondent)" shape=folder]
# data [label="dataframes" shape=folder]
# count [label="counts for any combinations of the variables in the data" shape=box3d]
# prob [label="probabilities for any combinations of the variables in the data" shape=box3d]
# """.split('\n'))
Acknowledgements
This study was supported by the National Institutes of Drug Abuse R15DA041657 and R21DA025298, and . The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute on Drug Abuse or the National Institutes of Health.
Here are the grant numbers you worked on I think there are only two plus the one you got as PI from NAHDAP
National Institutes of Health, National Institute on Drug Abuse
2017-2020
1 R15 DA041657
Miriam Boeri, Aukje Lamonica, MPIs
Award: $341,565
“Suburban Opioid Study” (SOS)
National Institutes of Health, National Institute on Drug Abuse, American Recovery and Reinvestment Act
2009-2011
R21DA025298
Miriam Boeri, PI
Thor Whalen, Co-investigator
Award: $367,820
“Older Drug Users: A Life Course Study of Turning Points in Drug Use and Injection.”
National Addiction & HIV Data Archive Program (NAHDAP)
2010-2011
University of Michigan’s Inter-university Consortium for Political and Social
Research (ICPSR)
Thor Whalen, PI
Data archived at http://dx.doi.org/10.3886/ICPSR34296.v1
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file odus-0.0.8.tar.gz
.
File metadata
- Download URL: odus-0.0.8.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
42d09a31e099ae24e83a9682009c78bdc2d1da6b5055c8f8279df3d64eb55f6f
|
|
MD5 |
9022642ef4d5de2e4f5523fb505549be
|
|
BLAKE2b-256 |
d8522d62ef9d200d697fd21228a46ba5a98fd90cc2834ec1741296bb9ca4830f
|
File details
Details for the file odus-0.0.8-py3-none-any.whl
.
File metadata
- Download URL: odus-0.0.8-py3-none-any.whl
- Upload date:
- Size: 21.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
006f8acb6f2d61467e4650ed54f4a50f21185695ae9799be26effb47df76da15
|
|
MD5 |
55f532699f153cf32772f8863af35470
|
|
BLAKE2b-256 |
8e7eb5861f771f3c0f68119b8c84353815429dc87c0f00dcd2aa905940c3dc53
|