Skip to main content

"A tool to curate a test set from a set of constraints"

Project description

Curation Magic

Automagically curate test sets based on user given constraints

Did you ever need to sub-sample a pool of samples according to a strict set of conditions? Perhaps when designing a test set for an experiment? This package provides an easy way to sub-sample a dataframe.

The user provides two dataframes: the first has the sample pool, and the second has queries over these samples, with the specification of the intended amount of samples that should satisfy each query in the curated set.

Install

pip install curation_magic

Instructions

Our goal is to curate a subset from a general pool of samples, that will satisfy a list of conditions as close as possible.

The pool of samples is given in a dataframe, which we'll call df_samples, it has one row per sample, and the columns represent all sort of meta data and features of the samples.

Let's see an example:

# Load dataframe from file.
import pandas as pd

df_samples = pd.read_csv('csvs/curation_pool.csv', 
                         converters={'age':int, 'birad':int})
df_samples = df_samples.set_index('study_id')
df_samples.sample(7)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
exists data_source age density birad lesion_type largest_mass is_pos
study_id
219 1 optimam 60 4 0 calcification NaN 1
910 1 optimam 44 3 0 calcification NaN 1
499 1 optimam 57 4 2 NaN NaN 0
1250 1 optimam 56 0 0 NaN NaN 0
1438 1 imh 50 3 2 NaN NaN 0
1339 1 imh 44 2 1 NaN NaN 0
191 1 optimam 101 2 0 mass 10.31 1

The conditions are given in a second dataframe, df_cond_abs. Each row of df_cond_abs is indexed by a query that can be applied to the df_samples (i.e. by using df_samples.query(query_string)). For each query the user specifies constraints supplied, regarding how many samples in the curated subset should satisfy the query. The constraints are given as a lower-bound and upper bound (ignore the index_ref column).

# Get absolute numbers constraints 
df_cond_abs = pd.read_csv('csvs/curation_conditions_abs.csv').set_index('query')
df_cond_abs
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
min max index_ref
query
is_pos == "1" 400 400 -1
is_pos == "0" 400 400 -1
data_source == "optimam" & is_pos == "0" 160 240 -1
data_source == "imh" & is_pos == "0" 160 240 -1
data_source == "optimam" & is_pos == "1" 160 240 -1
data_source == "imh" & is_pos == "1" 160 240 -1
lesion_type == "mass" & is_pos == "1" 270 300 -1
lesion_type == "calcification" & is_pos == "1" 110 140 -1
birad == "1" & is_pos == "0" 300 320 -1
birad == "2" & is_pos == "0" 80 100 -1
lesion_type == "mass" & largest_mass<=10 30 40 -1
lesion_type == "mass" & largest_mass>10 & largest_mass<=20 140 180 -1
lesion_type == "mass" & largest_mass>20 & largest_mass<=50 75 110 -1
age<50 200 240 -1
age<60 & age>=50 216 264 -1
age<70 & age>=60 176 208 -1
age>=70 120 160 -1

The function get_query_features_df applies all the queries on the df_samples dataframe, and we obtain df_bool, a boolean dataframe which has the samples as rows and the queries as columns. df_bool indicates which sample matches which query.

df_bool = curator.get_query_features_df(df_samples, df_cond_abs.index)
df_bool.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
is_pos == "1" is_pos == "0" data_source == "optimam" & is_pos == "0" data_source == "imh" & is_pos == "0" data_source == "optimam" & is_pos == "1" data_source == "imh" & is_pos == "1" lesion_type == "mass" & is_pos == "1" lesion_type == "calcification" & is_pos == "1" birad == "1" & is_pos == "0" birad == "2" & is_pos == "0" lesion_type == "mass" & largest_mass<=10 lesion_type == "mass" & largest_mass>10 & largest_mass<=20 lesion_type == "mass" & largest_mass>20 & largest_mass<=50 age<50 age<60 & age>=50 age<70 & age>=60 age>=70
study_id
0 True False False False True False False True False False False False False False True False False
1 True False False False True False True False False False False True False False False False True
2 True False False False True False True False False False False True False False False False True
3 True False False False True False True False False False False True False False False True False
4 True False False False False True False False False False False False False True False False False

We can use this table to quickly see how many samples in our pool satisfy each query:

df_bool.sum()
is_pos == "1"                                                 811
is_pos == "0"                                                 655
data_source == "optimam" & is_pos == "0"                      301
data_source == "imh" & is_pos == "0"                          354
data_source == "optimam" & is_pos == "1"                      653
data_source == "imh" & is_pos == "1"                          158
lesion_type == "mass" & is_pos == "1"                         556
lesion_type == "calcification" & is_pos == "1"                188
birad == "1" & is_pos == "0"                                  399
birad == "2" & is_pos == "0"                                  195
lesion_type == "mass" & largest_mass<=10                       58
lesion_type == "mass" & largest_mass>10 & largest_mass<=20    310
lesion_type == "mass" & largest_mass>20 & largest_mass<=50    178
age<50                                                        256
age<60 & age>=50                                              489
age<70 & age>=60                                              520
age>=70                                                       201
dtype: int64

Curate a subset using absolute bounds

Let's use the AbsBoundariesCurator to build a curated set that satisfies all the conditions as much as possible:

abc = curator.AbsBoundariesCurator(df_samples, df_cond_abs)

# Note, we are using here the interior-point solver which is
# faster but less accurate than the default simplex solver.
included, summary = abc.run(method='interior-point')

# The summary shows how many were included from every query,
# and the total number of violations.
summary
Theoretical violations: 4.000000001349921
included: 799
actual violations: 5
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
cnt min max violation
is_pos == "1" 399 400 400 1
is_pos == "0" 400 400 400 0
data_source == "optimam" & is_pos == "0" 161 160 240 0
data_source == "imh" & is_pos == "0" 239 160 240 0
data_source == "optimam" & is_pos == "1" 241 160 240 1
data_source == "imh" & is_pos == "1" 158 160 240 2
lesion_type == "mass" & is_pos == "1" 269 270 300 1
lesion_type == "calcification" & is_pos == "1" 111 110 140 0
birad == "1" & is_pos == "0" 303 300 320 0
birad == "2" & is_pos == "0" 85 80 100 0
lesion_type == "mass" & largest_mass<=10 34 30 40 0
lesion_type == "mass" & largest_mass>10 & largest_mass<=20 147 140 180 0
lesion_type == "mass" & largest_mass>20 & largest_mass<=50 84 75 110 0
age<50 212 200 240 0
age<60 & age>=50 249 216 264 0
age<70 & age>=60 198 176 208 0
age>=70 140 120 160 0

As you can see above, the linear solver had 4 violations, but after we decoded the solution (round the $x_j$ values and decide which samples to include), there were 5 violations in total. The optimal LP target value is always going to be a lower bound on the integer progam target.

Our curated set has 799 members instead of 800, specifically one extra positive. Also, we have one extra positive study from optimam, and 2 too few studies from imh.

Now we can go back to the original samples dataframe, and generate the new set.

df_subset = df_samples[included]
print(len(df_subset))
799

Curate a subset using relative bounds

The fact that the condition boundaties are given in absolute integer numbers is actually a limitation: Say we are willing to have some flexibility with regard to the number of negatives we curate (i.e. anything in the range 320-480 is fine), but within the chosen set of negatives, we would like at most 25% to be with birad=2. Since we don't know how many negatives we'll turn up with, there is no way to put a tight upper bound (in absolute numbers) on the number of birad=2 samples.

What we want is to be able to bound a query relative to the (yet unknown) number of samples that satisfy a previous query. So an alternative way to provide boundaries is in the form of a fraction relative to the resulting set satisfying a different query.

# Get relative fraction constraints
df_cond_rel = pd.read_csv('csvs/curation_conditions_rel.csv').set_index('query')
df_cond_rel.reset_index()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
query min max index_ref
0 exists == "1" 800.00 800.00 -1
1 is_pos == "1" 0.40 0.60 0
2 is_pos == "0" 0.40 0.60 0
3 data_source == "optimam" & is_pos == "0" 0.40 0.60 2
4 data_source == "imh" & is_pos == "0" 0.40 0.60 2
5 data_source == "optimam" & is_pos == "1" 0.40 0.60 1
6 data_source == "imh" & is_pos == "1" 0.40 0.60 1
7 lesion_type == "mass" & is_pos == "1" 0.65 0.70 1
8 lesion_type == "calcification" & is_pos == "1" 0.30 0.35 1
9 birad == "1" & is_pos == "0" 0.75 0.80 2
10 birad == "2" & is_pos == "0" 0.20 0.25 2
11 lesion_type == "mass" & largest_mass<=10 0.10 0.15 7
12 lesion_type == "mass" & largest_mass>10 & larg... 0.50 0.60 7
13 lesion_type == "mass" & largest_mass>20 & larg... 0.25 0.30 7
14 age<50 0.25 0.30 0
15 age<60 & age>=50 0.27 0.33 0
16 age<70 & age>=60 0.22 0.26 0
17 age>=70 0.15 0.20 0

Here, in line 10, we ask that the number of samples satisfying the query [birad == "2" & is_pos == "0"] would be at least 20% and no more than 25% of the samples satisfying query 2 [is_pos == "0"], as indicated by the column index_ref. This is how we were able to define a condition relevant to the negative set without knowing how many negative we'll have at the end!

We still have to ground the solution in some absolute number of desired sample, so we used integer boundaries for the first query above, simply by setting index_ref=-1 (otherwise the solution is not well defined and the LP solver might not converge).

Let's run the RelBoundariesCurator to solve this (here with the simplex method):

cc = curator.RelBoundariesCurator(df_samples, df_cond_rel)
included, summary = cc.run()
summary
Theoretical violations: 1.7763568394002505e-15
included: 800
actual violations: 0
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
cnt min max violation
exists == "1" 800 800 800 0
is_pos == "1" 395 320 480 0
is_pos == "0" 405 320 480 0
data_source == "optimam" & is_pos == "0" 184 162 243 0
data_source == "imh" & is_pos == "0" 221 162 243 0
data_source == "optimam" & is_pos == "1" 237 158 237 0
data_source == "imh" & is_pos == "1" 158 158 237 0
lesion_type == "mass" & is_pos == "1" 259 257 276 0
lesion_type == "calcification" & is_pos == "1" 119 118 138 0
birad == "1" & is_pos == "0" 304 304 324 0
birad == "2" & is_pos == "0" 81 81 101 0
lesion_type == "mass" & largest_mass<=10 29 26 39 0
lesion_type == "mass" & largest_mass>10 & largest_mass<=20 155 130 155 0
lesion_type == "mass" & largest_mass>20 & largest_mass<=50 65 65 78 0
age<50 200 200 240 0
age<60 & age>=50 264 216 264 0
age<70 & age>=60 176 176 208 0
age>=70 160 120 160 0

And we reached an optimal solution!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

curation_magic-0.0.5.tar.gz (20.3 kB view hashes)

Uploaded Source

Built Distribution

curation_magic-0.0.5-py3-none-any.whl (13.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page