Tools to provide easy access to prepared data to data scientists that can't be asked.

Project description

1 Introduction
2 Installation
3 Getting some resources
4 Poking around
5 Potential Calculus Experimentations
6 Extra scrap

# %load_ext autoreload
# %autoreload 2

Introduction

ODUS (for Older Drug User Study) contains data and tools to study the drug use of older drug users.

Essentially, there are these are tools:

To get prepared data on the 119 "trajectories" describing 31 variables (drug use, social, etc.) over time of 119 different respondents.
To vizualize these trajectories in various ways
To create pdfs of any selection of these trajectories and variables
To make count tables for any combinations of the variables: Essential step of any Markovian or Bayesian analysis.
To make probability (joint or conditional) tables from any combination of the variables
To operate on these count and probability tables, thus enabling inference operations

Installation

You need to have python 3.7+ to run this notebook.

And you'll need to have odus, which you get by doing

pip install odus

(And if you don't have pip then, well... how to put it... ha ha ha!)

But if you're the type, you can also just get the source from https://github.com/thorwhalen/odus.

Oh, and pull requests etc. are welcome!

Stars, likes, references, and coffee also welcome.

And if you want to donate: Donate to a charity that will help the people understand and make policies surrounding the use of substances.

A simple flowchart about the architecture:

Getting some resources

from matplotlib.pylab import *
from numpy import *
import seaborn as sns

import os
from py2store.stores.local_store import RelativePathFormatStore
from py2store.mixins import ReadOnlyMixin
from py2store.base import Store


from io import BytesIO
from spyn.ppi.pot import Pot, ProbPot
from collections import UserDict, Counter
import numpy as np
import pandas as pd

from ut.ml.feature_extraction.sequential_var_sets import PVar, VarSet, DfData, VarSetFactory
from IPython.display import Image

from odus.analysis_utils import *

from odus.dacc import DfStore, counts_of_kps, Dacc, VarSetCountsStore, \
    mk_pvar_struct, PotStore, _commun_columns_of_dfs, Struct, mk_pvar_str_struct, VarStr

from odus.plot_utils import plot_life_course

from odus import data_dir, data_path_of
survey_dir = data_dir
data_dir

'/D/Dropbox/dev/p3/proj/odus/odus/data'

df_store = DfStore(data_dir + '/{}.xlsx')
len(df_store)
cstore = VarSetCountsStore(df_store)
v = mk_pvar_struct(df_store, only_for_cols_in_all_dfs=True)
s = mk_pvar_str_struct(v)
f, df = cstore.df_store.head()
pstore = PotStore(df_store)

Poking around

df_store

A df_store is a key-value store where the key is the xls file and the value is the prepared dataframe

len(df_store)

it = iter(df_store.values())
for i in range(5):  # skip five first
    _ = next(it)
df = next(it)  # get the one I want
df.head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

category	RURAL	SUBURBAN	URBAN/CITY	HOMELESS	INCARCERATION	WORK	SON/DAUGHTER	SIBLING	FATHER/MOTHER	SPOUSE	...	METHAMPHETAMINE	AS PRESCRIBED OPIOID	NOT AS PRESCRIBED OPIOID	HEROIN	OTHER OPIOID	INJECTED	IN TREATMENT	Selects States below	Georgia	Pennsylvania
age
11	0	1	0	0	0	0	1	1	0	0	...	0	0	0	0	0	0	0	1	1	0
12	0	1	0	0	0	0	1	1	0	0	...	0	1	0	0	0	0	0	1	1	0
13	0	1	0	0	0	0	1	1	0	0	...	0	0	0	0	0	0	0	1	1	0

3 rows × 31 columns

print(df.columns.values)

['RURAL' 'SUBURBAN' 'URBAN/CITY' 'HOMELESS' 'INCARCERATION' 'WORK'
 'SON/DAUGHTER' 'SIBLING' 'FATHER/MOTHER' 'SPOUSE'
 'OTHER (WHO?, FILL IN BRACKETS HERE)' 'FRIEND USER' 'FRIEND NON USER'
 'MENTAL ILLNESS' 'PHYSICAL ILLNESS' 'LOSS OF LOVED ONE' 'TOBACCO'
 'MARIJUANA' 'ALCOHOL' 'HAL/LSD/XTC/CLUBDRUG' 'COCAINE/CRACK'
 'METHAMPHETAMINE' 'AS PRESCRIBED OPIOID' 'NOT AS PRESCRIBED OPIOID'
 'HEROIN' 'OTHER OPIOID' 'INJECTED' 'IN TREATMENT' 'Selects States below'
 'Georgia' 'Pennsylvania']

t = df[['ALCOHOL', 'TOBACCO']]
t.head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

category	ALCOHOL	TOBACCO
age
11	0	0
12	0	0
13	0	0

c = Counter()
for i, r in t.iterrows():
    c.update([tuple(r.to_list())])
c

Counter({(0, 0): 6, (1, 0): 4, (1, 1): 9, (0, 1): 2})

def count_tuples(dataframe):
    c = Counter()
    for i, r in dataframe.iterrows():
        c.update([tuple(r.to_list())])
    return c

fields = ['ALCOHOL', 'TOBACCO']
# do it for every one
c = Counter()
for df in df_store.values():
    c.update(count_tuples(df[fields]))
c

Counter({(0, 1): 903, (1, 1): 1343, (0, 0): 240, (1, 0): 179})

pd.Series(c)

0  1     903
1  1    1343
0  0     240
1  0     179
dtype: int64

# Powerful! You can use that with several pairs and get some nice probabilities. Look up Naive Bayes.

Viewing trajectories

import itertools
from functools import partial
from odus.util import write_images
from odus.plot_utils import plot_life, life_plots, write_trajectories_to_file

ihead = lambda it: itertools.islice(it, 0, 5)

Viewing a single trajectory

k = next(iter(df_store))  # get the first key
print(f"k: {k}")  # print it
plot_life(df_store[k])  # plot the trajectory

k: surveys/B24.xlsx

png

plot_life(df_store[k], fields=[s.in_treatment, s.injected])  # only want two fields

png

Flip over all (or some) trajectories

gen = life_plots(df_store)

next(gen)  # launch to get the next trajectory

<matplotlib.axes._subplots.AxesSubplot at 0x12b21f070>

png

Get three trajectories, but only over two fields.

# fields = [s.in_treatment, s.injected]
fields = [s.physical_illness, s.as_prescribed_opioid, s.heroin, s.other_opioid]
keys = list(df_store)[:10]
# print(f"keys={keys}")
axs = [x for x in life_plots(df_store, fields, keys=keys)];

png

Making a pdf of trajectories

write_trajectories_to_file(df_store, fields, keys, fp='three_respondents_two_fields.pdf');

write_trajectories_to_file(df_store, fp='all_respondents_all_fields.pdf');

Demo s and v

print(list(filter(lambda x: not x.startswith('__'), dir(s))))

['alcohol', 'as_prescribed_opioid', 'cocaine_crack', 'father_mother', 'hal_lsd_xtc_clubdrug', 'heroin', 'homeless', 'in_treatment', 'incarceration', 'injected', 'loss_of_loved_one', 'marijuana', 'mental_illness', 'methamphetamine', 'not_as_prescribed_opioid', 'other_opioid', 'physical_illness', 'rural', 'sibling', 'son_daughter', 'suburban', 'tobacco', 'urban_city', 'work']

s.heroin

'HEROIN'

v.heroin

PVar('HEROIN', 0)

v.heroin - 1

PVar('HEROIN', -1)

cstore

# cstore[v.alcohol, v.tobacco]
cstore[v.as_prescribed_opioid-1, v.heroin]

Counter({(0, 0): 1026, (1, 0): 264, (0, 1): 1108, (1, 1): 148})

pd.Series(cstore[v.as_prescribed_opioid-1, v.heroin])

0  0    1026
1  0     264
0  1    1108
1  1     148
dtype: int64

cstore[v.alcohol, v.tobacco, v.heroin]

Counter({(0, 0, 1): 427,
         (1, 0, 1): 656,
         (1, 1, 1): 687,
         (0, 0, 0): 189,
         (0, 1, 1): 476,
         (0, 1, 0): 51,
         (1, 0, 0): 133,
         (1, 1, 0): 46})

cstore[v.alcohol-1, v.alcohol]

Counter({(0, 0): 994, (1, 1): 1375, (1, 0): 90, (0, 1): 87})

cstore[v.alcohol-1, v.alcohol, v.tobacco]

Counter({(0, 0, 1): 807,
         (1, 1, 1): 1220,
         (1, 0, 0): 26,
         (0, 1, 1): 76,
         (0, 0, 0): 187,
         (1, 1, 0): 155,
         (0, 1, 0): 11,
         (1, 0, 1): 64})

t = pd.Series(cstore[v.alcohol-1, v.alcohol, v.tobacco])
t.loc[t.index]

<pandas.core.indexing._LocIndexer at 0x130955db0>

pstore

t = pstore[s.alcohol-1, s.alcohol]
t

                   pval
ALCOHOL-1 ALCOHOL      
0         0         994
          1          87
1         0          90
          1        1375

t.tb

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

ALCOHOL-1	ALCOHOL	pval
0	0	994
0	1	87
1	0	90
1	1	1375

t / []

                       pval
ALCOHOL-1 ALCOHOL          
0         0        0.390416
          1        0.034171
1         0        0.035350
          1        0.540063

t[s.alcohol-1]

           pval
ALCOHOL-1      
0          1081
1          1465

t / t[s.alcohol-1]  # cond prob!

                       pval
ALCOHOL-1 ALCOHOL          
0         0        0.919519
          1        0.080481
1         0        0.061433
          1        0.938567

tt = pstore[s.alcohol, s.tobacco]
tt

                 pval
ALCOHOL TOBACCO      
0       0         240
        1         903
1       0         179
        1        1343

tt / tt[s.alcohol]

                     pval
ALCOHOL TOBACCO          
0       0        0.209974
        1        0.790026
1       0        0.117608
        1        0.882392

tt / tt[s.tobacco]

                     pval
ALCOHOL TOBACCO          
0       0        0.572792
1       0        0.427208
0       1        0.402048
1       1        0.597952

Scrap place

t = pstore[s.as_prescribed_opioid-1, s.heroin-1, s.heroin]
t

                                        pval
AS PRESCRIBED OPIOID-1 HEROIN-1 HEROIN      
0                      0        0        927
                                1        172
                       1        0         99
                                1        936
1                      0        0        249
                                1         33
                       1        0         15
                                1        115

tt = t / t[s.as_prescribed_opioid-1, s.heroin-1]  # cond prob!
tt

                                            pval
AS PRESCRIBED OPIOID-1 HEROIN-1 HEROIN          
0                      0        0       0.843494
                                1       0.156506
                       1        0       0.095652
                                1       0.904348
1                      0        0       0.882979
                                1       0.117021
                       1        0       0.115385
                                1       0.884615

tt.tb

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

AS PRESCRIBED OPIOID-1	HEROIN-1	HEROIN	pval
0	0	0	0.843494
0	0	1	0.156506
0	1	0	0.095652
0	1	1	0.904348
1	0	0	0.882979
1	0	1	0.117021
1	1	0	0.115385
1	1	1	0.884615

AS PRESCRIBED OPIOID-1	HEROIN-1	HEROIN	
0	0	0	0.843494
0	0	1	0.156506
1	0	0	0.882979
1	0	1	0.117021

0.117021 / 0.156506

0.7477093529960512

prob_of_heroin_given_presc_op = 0.359223
prob_of_heroin_given_not_presc_op = 0.519213

prob_of_heroin_given_presc_op / prob_of_heroin_given_not_presc_op

0.6918605658949217

prob_of_heroin_given_not_presc_op / prob_of_heroin_given_presc_op

1.4453779407220584

Potential Calculus Experimentations

# survey_dir = '/D/Dropbox/others/Miriam/python/ProcessedSurveys'
df_store = DfStore(survey_dir + '/{}.xlsx')
len(df_store)

cstore = VarSetCountsStore(df_store)
v = mk_pvar_struct(df_store, only_for_cols_in_all_dfs=True)
s = mk_pvar_str_struct(v)
f, df = cstore.df_store.head()
df.head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

category	RURAL	SUBURBAN	URBAN/CITY	HOMELESS	INCARCERATION	WORK	SON/DAUGHTER	SIBLING	FATHER/MOTHER	SPOUSE	...	HAL/LSD/XTC/CLUBDRUG	COCAINE/CRACK	METHAMPHETAMINE	AS PRESCRIBED OPIOID	NOT AS PRESCRIBED OPIOID	HEROIN	OTHER OPIOID	INJECTED	IN TREATMENT	Massachusetts
age
16	0	1	0	0	1	0	1	1	1	0	...	0	0	0	0	0	0	0	0	0	1
17	0	1	0	0	0	1	1	1	1	0	...	0	0	0	0	1	0	0	0	0	1
18	0	1	0	0	0	1	1	1	1	0	...	0	0	0	0	1	0	0	0	0	1

3 rows × 29 columns

cstore = VarSetCountsStore(df_store)
cstore.mk_pvar_attrs()

from odus.dacc import DfStore, counts_of_kps, Dacc, plot_life_course, VarSetCountsStore, mk_pvar_struct, PotStore
pstore = PotStore(df_store)
pstore.mk_pvar_attrs()
p = pstore[v.homeless - 1, v.incarceration]
p

                          pval
HOMELESS-1 INCARCERATION      
0          0              1690
           1               577
1          0               192
           1                87

p / []

                              pval
HOMELESS-1 INCARCERATION          
0          0              0.663786
           1              0.226630
1          0              0.075412
           1              0.034171

pstore[v.incarceration]

               pval
INCARCERATION      
0              1989
1               676

pstore[v.alcohol-1, v.loss_of_loved_one]

                             pval
ALCOHOL-1 LOSS OF LOVED ONE      
0         0                   990
          1                    91
1         0                  1321
          1                   144

tw = pstore[v.tobacco, v.work]
mw = pstore[v.marijuana, v.work]
aw = pstore[v.alcohol, v.work]
w = pstore[v.work]

evid_t = Pot.from_hard_evidence(**{s.tobacco: 1})
evid_m = Pot.from_hard_evidence(**{s.marijuana: 1})
evid_a = Pot.from_hard_evidence(**{s.alcohol: 1})
evid_a

         pval
ALCOHOL      
1           1

aw

              pval
ALCOHOL WORK      
0       0      431
        1      712
1       0      448
        1     1074

w / []

          pval
WORK          
0     0.329831
1     0.670169

(evid_m * mw) / []

                    pval
MARIJUANA WORK          
1         0     0.350603
          1     0.649397

(evid_t * tw) / []

                  pval
TOBACCO WORK          
1       0     0.313001
        1     0.686999

(evid_a * aw) / []

                 pval
ALCOHOL WORK         
1       0     0.29435
        1     0.70565

Extra scrap

# from graphviz import Digraph
# Digraph(body="""
# raw -> data -> count -> prob
# raw [label="excel files (one per respondent)" shape=folder]
# data [label="dataframes" shape=folder]
# count [label="counts for any combinations of the variables in the data" shape=box3d]
# prob [label="probabilities for any combinations of the variables in the data" shape=box3d]
# """.split('\n'))

Project details

Release history Release notifications | RSS feed

This version

0.0.6

Aug 3, 2020

0.0.5

Jul 5, 2020

0.0.3

Jul 5, 2020

0.0.1

May 12, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

odus-0.0.6.tar.gz (15.4 kB view hashes)

Uploaded Aug 3, 2020 Source

Built Distribution

odus-0.0.6-py3-none-any.whl (6.7 kB view hashes)

Uploaded Aug 3, 2020 Python 3

Hashes for odus-0.0.6.tar.gz

Hashes for odus-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`e43b83b217218c592b938968f034b6f4045d5511baef54ec3c0b1bda60dbcad3`
MD5	`5971f3669cc9563e2936836b725284bf`
BLAKE2b-256	`beeaa3d88b705ff73b7a7435335d1b2d6aa6202a78a9537d9ed219e4b77e428b`

Hashes for odus-0.0.6-py3-none-any.whl

Hashes for odus-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0dc996d310b924f8055e79ede0d24f62eb0e24a16f4ad50d17f03a156ff736ab`
MD5	`d1d0575f90cf1e3de322c6ffe028e17b`
BLAKE2b-256	`557e5433d19e19618fc86242778ff09df3307c46a9f95f6e52f8ab69a22de31e`