"A tool to curate a test set from a set of constraints"
Project description
Curation Magic
Automagically curate test sets based on user given constraints
Did you ever need to sub-sample a pool of samples according to a strict set of conditions? Perhaps when designing a test set for an experiment? This package provides an easy way to sub-sample a dataframe.
The user provides two dataframes: the first has the sample pool, and the second has queries over these samples, with the specification of the intended amount of samples that should satisfy each query in the curated set.
Install
pip install curation_magic
Instructions
Our goal is to curate a subset from a general pool of samples, that will satisfy a list of conditions as close as possible.
The pool of samples is given in a dataframe, which we'll call df_samples, it has one row per sample, and the columns represent all sort of meta data and features of the samples.
Let's see an example:
# Load dataframe from file.
import pandas as pd
df_samples = pd.read_csv('csvs/curation_pool.csv',
converters={'age':int, 'birad':int})
df_samples = df_samples.set_index('study_id')
df_samples.head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
exists | data_source | age | density | birad | lesion_type | largest_mass | is_pos | |
---|---|---|---|---|---|---|---|---|
study_id | ||||||||
0 | 1 | optimam | 56 | 2 | 0 | calcification | NaN | 1 |
1 | 1 | optimam | 70 | 4 | 0 | mass | 16.87 | 1 |
2 | 1 | optimam | 70 | 2 | 0 | mass | 10.15 | 1 |
3 | 1 | optimam | 66 | 2 | 0 | mass | 10.71 | 1 |
4 | 1 | imh | 49 | 3 | 0 | distortion | NaN | 1 |
5 | 1 | optimam | 67 | 2 | 0 | mass | 9.24 | 1 |
6 | 1 | optimam | 47 | 4 | 0 | mass | 14.35 | 1 |
7 | 1 | optimam | 51 | 3 | 0 | calcification | NaN | 1 |
8 | 1 | optimam | 50 | 4 | 0 | calcification | NaN | 1 |
9 | 1 | optimam | 59 | 3 | 0 | calcification | NaN | 1 |
The conditions are given in a second dataframe, df_cond_abs. Each row of df_cond_abs is indexed by a query that can be applied to the df_samples (i.e. by using df_samples.query(query_string)). For each query the user specifies constraints supplied, regarding how many samples in the curated subset should satisfy the query. The constraints are given as a lower-bound and upper bound (ignore the index_ref column).
# Get absolute numbers constraints
df_cond_abs = pd.read_csv('csvs/curation_conditions_abs.csv').set_index('query')
df_cond_abs
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
min | max | index_ref | |
---|---|---|---|
query | |||
is_pos == "1" | 400 | 400 | -1 |
is_pos == "0" | 400 | 400 | -1 |
data_source == "optimam" & is_pos == "0" | 160 | 240 | -1 |
data_source == "imh" & is_pos == "0" | 160 | 240 | -1 |
data_source == "optimam" & is_pos == "1" | 160 | 240 | -1 |
data_source == "imh" & is_pos == "1" | 160 | 240 | -1 |
lesion_type == "mass" & is_pos == "1" | 270 | 300 | -1 |
lesion_type == "calcification" & is_pos == "1" | 110 | 140 | -1 |
birad == "1" & is_pos == "0" | 300 | 320 | -1 |
birad == "2" & is_pos == "0" | 80 | 100 | -1 |
lesion_type == "mass" & largest_mass<=10 | 30 | 40 | -1 |
lesion_type == "mass" & largest_mass>10 & largest_mass<=20 | 140 | 180 | -1 |
lesion_type == "mass" & largest_mass>20 & largest_mass<=50 | 75 | 110 | -1 |
age<50 | 200 | 240 | -1 |
age<60 & age>=50 | 216 | 264 | -1 |
age<70 & age>=60 | 176 | 208 | -1 |
age>=70 | 120 | 160 | -1 |
Let's use the AbsBoundariesCurator to find a curated set:
abc = curator.AbsBoundariesCurator(df_samples, df_cond_abs)
# Note, we are using here the interior-point solver which is
# faster but less accurate than the default simplex solver.
included, summary = abc.run(method='interior-point')
# The summary shows how many were included from every query,
# and the total number of violations.
summary
Theoretical violations: 4.000000001349921
included: 799
actual violations: 5
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
cnt | min | max | violation | |
---|---|---|---|---|
is_pos == "1" | 399 | 400 | 400 | 1 |
is_pos == "0" | 400 | 400 | 400 | 0 |
data_source == "optimam" & is_pos == "0" | 161 | 160 | 240 | 0 |
data_source == "imh" & is_pos == "0" | 239 | 160 | 240 | 0 |
data_source == "optimam" & is_pos == "1" | 241 | 160 | 240 | 1 |
data_source == "imh" & is_pos == "1" | 158 | 160 | 240 | 2 |
lesion_type == "mass" & is_pos == "1" | 269 | 270 | 300 | 1 |
lesion_type == "calcification" & is_pos == "1" | 111 | 110 | 140 | 0 |
birad == "1" & is_pos == "0" | 303 | 300 | 320 | 0 |
birad == "2" & is_pos == "0" | 85 | 80 | 100 | 0 |
lesion_type == "mass" & largest_mass<=10 | 34 | 30 | 40 | 0 |
lesion_type == "mass" & largest_mass>10 & largest_mass<=20 | 147 | 140 | 180 | 0 |
lesion_type == "mass" & largest_mass>20 & largest_mass<=50 | 84 | 75 | 110 | 0 |
age<50 | 212 | 200 | 240 | 0 |
age<60 & age>=50 | 249 | 216 | 264 | 0 |
age<70 & age>=60 | 198 | 176 | 208 | 0 |
age>=70 | 140 | 120 | 160 | 0 |
As you can see above, the linear solver had 4 violations, but after we decoded the solution (round the $x_j$ values and decide which samples to include), there were 10 violations in total. Our curated set has 802 members instead of 800, specifically two extra positives. Also, we have 3 too many positive studies from optimam, and 3 too few studies from imh.
Now we can go back to the original samples dataframe, and add a new column indicating which samples would participate in the final set:
df_samples['included'] = included
df_samples.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
exists | data_source | age | density | birad | lesion_type | largest_mass | is_pos | included | |
---|---|---|---|---|---|---|---|---|---|
study_id | |||||||||
0 | 1 | optimam | 56 | 2 | 0 | calcification | NaN | 1 | True |
1 | 1 | optimam | 70 | 4 | 0 | mass | 16.87 | 1 | True |
2 | 1 | optimam | 70 | 2 | 0 | mass | 10.15 | 1 | True |
3 | 1 | optimam | 66 | 2 | 0 | mass | 10.71 | 1 | False |
4 | 1 | imh | 49 | 3 | 0 | distortion | NaN | 1 | True |
Using Relative bounds for the constraints
The fact that the condition boundaties are given in absolute integer numbers is actually a limitation: Say we are willing to have some flexibility with regard to the number of negatives we curate (i.e. anything in the range 350-450 is fine), but within the chosen set of negatives, we would like 25% to be with birad=2. Since we don't know how many negatives we'll turn up with, there is no way to put a tight bound (in absolute numbers) on the number of birad=2 samples.
What we want is to be able to bound a query relative to the (yet unknown) number of samples that satisfy a previous query. So an alternative way to provide boundaries is in the form of a fraction relative to the resulting set satisfying a different query.
# Get relative fraction constraints
df_cond_rel = pd.read_csv('csvs/curation_conditions_rel.csv').set_index('query')
df_cond_rel.reset_index()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
query | min | max | index_ref | |
---|---|---|---|---|
0 | exists == "1" | 800.00 | 800.00 | -1 |
1 | is_pos == "1" | 0.50 | 0.50 | 0 |
2 | is_pos == "0" | 0.50 | 0.50 | 0 |
3 | data_source == "optimam" & is_pos == "0" | 0.40 | 0.60 | 2 |
4 | data_source == "imh" & is_pos == "0" | 0.40 | 0.60 | 2 |
5 | data_source == "optimam" & is_pos == "1" | 0.40 | 0.60 | 1 |
6 | data_source == "imh" & is_pos == "1" | 0.40 | 0.60 | 1 |
7 | lesion_type == "mass" & is_pos == "1" | 0.65 | 0.70 | 1 |
8 | lesion_type == "calcification" & is_pos == "1" | 0.30 | 0.35 | 1 |
9 | birad == "1" & is_pos == "0" | 0.75 | 0.80 | 2 |
10 | birad == "2" & is_pos == "0" | 0.20 | 0.25 | 2 |
11 | lesion_type == "mass" & largest_mass<=10 | 0.10 | 0.15 | 7 |
12 | lesion_type == "mass" & largest_mass>10 & larg... | 0.50 | 0.60 | 7 |
13 | lesion_type == "mass" & largest_mass>20 & larg... | 0.25 | 0.30 | 7 |
14 | age<50 | 0.25 | 0.30 | 0 |
15 | age<60 & age>=50 | 0.27 | 0.33 | 0 |
16 | age<70 & age>=60 | 0.22 | 0.26 | 0 |
17 | age>=70 | 0.15 | 0.20 | 0 |
Here, in line 10, we ask that the number of samples satisfying the query [birad == "2" & is_pos == "0"] would be at least 20% and no more than 25% of the samples satisfying query 2 [is_pos == "0"], as indicated by the column index_ref. This is how we were able to define a condition relevant to the negative set without knowing how many negative we'll have at the end!
We still have to ground the solution in some absolute number of desired sample, so we used integer boundaries for the first query above, simply by setting index_ref=-1 (otherwise the solution is not well defined and the LP solver might not converge).
Let's run the RelBoundariesCurator to solve this (here with the simplex method):
cc = curator.RelBoundariesCurator(df_samples, df_cond_rel)
included, summary = cc.run()
df_samples['included'] = included
summary
Theoretical violations: 4.000000000000157
included: 800
actual violations: 4
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
cnt | min | max | violation | |
---|---|---|---|---|
exists == "1" | 800 | 800 | 800 | 0 |
is_pos == "1" | 400 | 400 | 400 | 0 |
is_pos == "0" | 400 | 400 | 400 | 0 |
data_source == "optimam" & is_pos == "0" | 160 | 160 | 240 | 0 |
data_source == "imh" & is_pos == "0" | 240 | 160 | 240 | 0 |
data_source == "optimam" & is_pos == "1" | 242 | 160 | 240 | 2 |
data_source == "imh" & is_pos == "1" | 158 | 160 | 240 | 2 |
lesion_type == "mass" & is_pos == "1" | 263 | 260 | 280 | 0 |
lesion_type == "calcification" & is_pos == "1" | 120 | 120 | 140 | 0 |
birad == "1" & is_pos == "0" | 300 | 300 | 320 | 0 |
birad == "2" & is_pos == "0" | 80 | 80 | 100 | 0 |
lesion_type == "mass" & largest_mass<=10 | 39 | 26 | 39 | 0 |
lesion_type == "mass" & largest_mass>10 & largest_mass<=20 | 135 | 132 | 158 | 0 |
lesion_type == "mass" & largest_mass>20 & largest_mass<=50 | 79 | 66 | 79 | 0 |
age<50 | 240 | 200 | 240 | 0 |
age<60 & age>=50 | 264 | 216 | 264 | 0 |
age<70 & age>=60 | 176 | 176 | 208 | 0 |
age>=70 | 120 | 120 | 160 | 0 |
In our decoded solution, the total number of violations was 4, exactly the same as in the optimal LP solution. This means that our solution is indeed optimal, since the optimal LP target value is always a lower bound on the integer progam target.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for curation_magic-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ff78706b91f162de198530cd120f99eee327744aafea422acea1eafc9fbba232 |
|
MD5 | 0e1ef3dc8d219f4b627078b59a2c0f65 |
|
BLAKE2b-256 | 3a9aa17102c07272d8631917dea2638d34e498bb8ec5857493d0047fc68d1946 |