Optimize adaptive sampling
Project description
Trufl
Trufl was initiated in the context of the IAEA (International Atomic Energy Agency) Coordinated Research Project titled “Monitoring and Predicting Radionuclide Uptake and Dynamics for Optimizing Remediation of Radioactive Contamination in Agriculture”.
While Trufl was originally developed to address the remediation of farmland affected by nuclear accidents, its approach and algorithms are applicable to a wide range of application domains. This includes managing legacy contaminants or monitoring any phenomenon that requires consideration of multiple decision criteria, potentially involving a large set of data.
This package leverages the work done by Floris Abrams in the context of his PhD at KU Leuven and Franck Albinet, International Consultant in Geospatial Data Science and currently PhD researcher in AI applied to nuclear remedation at KU Leuven.
Install
pip install trufl
Getting started
In highly sensitive and high-stakes situations, it is essential that decision making is informed, transparent, and accountable, with decisions being based on a thorough and objective analysis of the available data and the needs and concerns of affected communities being taken into account.
Given the time constraints and limited budgets that are often associated with data surveys (in particular ones supposed to informed highligh sensitive situation), it is crucial to make informed decisions about how to allocate resources. This is even more important when considering the many variables that can be taken into account, such as prior knowledge of the area, health and economic impacts, land use, whether remediation has already taken place, population density, and more. Our approach leverages Linear Programming techniques such as Multiple-criteria decision-making to optimize the data survey workflow:
In this demo, we will walk you through a typical workflow using the
Trufl
package. To help illustrate the process, we will use a “toy”
dataset that represents a typical spatial pattern of soil contaminants.
- We assume that we have access to the ground truth, which is a raster file that shows the spatial distribution of a soil contaminant;
- We will make decisions about how to sample and how much at
administrative units level, which in this case are simulated
as a grid (using the
gridder
utilities function); - Based on prior knowledge of the phenomenon of interest, such as
prior airborne surveys or other data, an
Optimizer
will rank each grid cell according to its priority for sampling; - We will then perform random sampling on the designated grid
cells (using a
Sampler
). To simulate the measurement process, we will use the ground truth to emulate measurements at each location (using aDataCollector
); - We will evaluate the new state of each cell based on the measurements and pass it to a new round of optimization. This process will be repeated iteratively to refine the sampling strategy.
Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import rasterio
import geopandas as gpd
from trufl.utils import gridder
from trufl.sampler import Sampler, rank_to_sample
from trufl.collector import DataCollector
from trufl.callbacks import (State, MaxCB, MinCB, StdCB, CountCB, MoranICB, PriorCB)
from trufl.optimizer import Optimizer
red, black = '#BF360C', '#263238'
Our simulated ground truth
The assumed ground truth reveals a typical spatial pattern of
contaminant such as Cs137
after a nuclear accident for instance.
fname_raster = './files/ground-truth-01-4326-simulated.tif'
with rasterio.open(fname_raster) as src:
plt.axis('off')
plt.imshow(src.read(1))
Simulate administrative units
The sampling strategy will be determined on a per-grid-cell basis within the administrative unit. We define below a 10 x 10 grid over the area of interest:
gdf_grid = gridder(fname_raster, nrows=10, ncols=10)
gdf_grid.head()
geometry | |
---|---|
loc_id | |
0 | POLYGON ((-1.20830 43.26950, -1.20830 43.26042... |
1 | POLYGON ((-1.20830 43.27858, -1.20830 43.26950... |
2 | POLYGON ((-1.20830 43.28766, -1.20830 43.27858... |
3 | POLYGON ((-1.20830 43.29673, -1.20830 43.28766... |
4 | POLYGON ((-1.20830 43.30581, -1.20830 43.29673... |
[!TIP]
Note how each administrative unit is uniquely identified by its
loc_id
.
gdf_grid.boundary.plot(color=black, lw=0.5)
plt.axis('off');
What prior knowledge do we have?
At the initial time $t_0$, data sampling has not yet begun, but we can often leverage existing prior knowledge of our phenomenon of interest to inform our sampling strategy/policy. In the context of nuclear remediation, this prior knowledge can often be obtained through mobile surveys, such as airborne or carborne surveys, which can provide a coarse estimation of soil contamination levels.
In the example below, we simulate prior information about the soil property of interest by calculating the average value of the property over each grid cell.
At this stage, we have no measurements, so we simply create an empty Geopandas GeoDataFrame.
samples_t0 = gpd.GeoDataFrame(index=pd.Index([], name='loc_id'),
geometry=None, data={'value': None})
[!TIP]
We need to set an index
loc_id
and have ageometry
andvalue
columns.
Now we get/“sense” the state of our grid cells based on the simulated
prior (Mean over each grid cell
PriorCB
):
state = State(samples_t0, gdf_grid, cbs=[PriorCB(fname_raster)])
# You have to call the instance
state_t0 = state(); state_t0.head()
Prior | |
---|---|
loc_id | |
0 | 0.102492 |
1 | 0.125727 |
2 | 0.161802 |
3 | 0.184432 |
4 | 0.201405 |
[!TIP]
We get the
Prior
for each individualloc_id
(here only the first 5 shown). The currentState
is only composed of a singlePriorCB
variable but can include many more variables as we will see below.
Optimize sampling based on prior at $t_0$
# Bugs?
# - fails when ran twice
benefit_criteria = [True]
optimizer = Optimizer(state=state_t0)
df_rank = optimizer.get_rank(is_benefit_x=benefit_criteria, w_vector = [1],
n_method=None, c_method = None,
w_method=None, s_method="CP")
df_rank.head()
rank | |
---|---|
loc_id | |
92 | 1 |
93 | 2 |
91 | 3 |
94 | 4 |
82 | 5 |
gdf_grid.join(df_rank, how='left').plot(column='rank',
cmap='viridis_r',
legend_kwds={'label': 'Rank'},
legend=True)
plt.axis('off');
Start sampling
[!TIP]
It’s worth noting that in the absence of any prior knowledge, a uniform sampling strategy over the area of interest may be used. However, this approach may not be the most efficient use of the available data collection and analysis budget.
Based on the ranks (sampling priority) calculated by the
Optimizer
and given sampling budget, let’s calculate the number of samples to
be collected for each administrative unit (loc_id
):
n = rank_to_sample(df_rank['rank'].sort_index().values,
budget=300, min=1); n
array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 2, 3, 2, 1, 1, 1, 1, 1, 1, 3, 4, 7, 3, 1, 1,
1, 1, 1, 2, 3, 4, 4, 2, 1, 1, 1, 2, 1, 2, 2, 2, 1,
1, 1, 1, 1, 3, 2, 2, 2, 2, 2, 1, 1, 2, 5, 12, 8, 10,
5, 4, 3, 1, 1, 6, 19, 58, 29, 14, 6, 3, 2, 1, 1])
We can now decide where to sample based on this sampling schema:
sampler = Sampler(gdf_grid)
sample_locs_t0 = sampler.sample(n, method='uniform')
print(sample_locs_t0.head())
ax = sample_locs_t0.plot(markersize=2, color=red)
gdf_grid.boundary.plot(color=black, lw=0.5, ax=ax)
plt.axis('off');
geometry
loc_id
0 POINT (-1.22083 43.26342)
1 POINT (-1.22006 43.27588)
2 POINT (-1.21642 43.28474)
3 POINT (-1.21536 43.29169)
4 POINT (-1.20880 43.30279)
Emulating data collection and analysis
The data collector collects measurements at the random sampling locations in the field. In our case, we emulate this process by extracting measurements from the provided raster file.
“Measuring” variable of interest from a given raster:
dc_emulator = DataCollector(fname_raster)
samples_t0 = dc_emulator.collect(sample_locs_t0)
print(samples_t0.head())
ax = samples_t0.plot(column='value', s=2, legend=True)
gdf_grid.boundary.plot(color=black, lw=0.5, ax=ax);
plt.axis('off');
geometry value
loc_id
0 POINT (-1.22083 43.26342) 0.144753
1 POINT (-1.22006 43.27588) 0.149566
2 POINT (-1.21642 43.28474) 0.174893
3 POINT (-1.21536 43.29169) 0.192808
4 POINT (-1.20880 43.30279) 0.208858
Getting current state
state = State(samples_t0, gdf_grid, cbs=[
MaxCB(), MinCB(), StdCB(), CountCB(), MoranICB(k=5), PriorCB(fname_raster)
])
# You have to call the instance
state_t0 = state(); state_t0
Max | Min | Standard Deviation | Count | Moran.I | Prior | |
---|---|---|---|---|---|---|
loc_id | ||||||
0 | 0.145231 | 0.074017 | 0.025631 | 8 | 0.915078 | 0.102492 |
1 | 0.152964 | 0.111763 | 0.015033 | 7 | 0.679907 | 0.125727 |
2 | 0.160229 | 0.160229 | 0.000000 | 1 | NaN | 0.161802 |
3 | 0.188177 | 0.164264 | 0.008248 | 6 | 0.940746 | 0.184432 |
4 | 0.261005 | 0.241690 | 0.009657 | 2 | NaN | 0.201405 |
... | ... | ... | ... | ... | ... | ... |
95 | 0.877876 | 0.800729 | 0.027397 | 5 | NaN | 0.803670 |
96 | 0.804376 | 0.795942 | 0.004217 | 2 | NaN | 0.763408 |
97 | 0.799064 | 0.672111 | 0.047581 | 8 | 0.948441 | 0.727797 |
98 | 0.708847 | 0.661171 | 0.013101 | 9 | 0.731343 | 0.646002 |
99 | 0.704167 | 0.673213 | 0.015477 | 2 | NaN | 0.655185 |
100 rows × 6 columns
Build the ranking of polygons based on several criteria
Criteria
- MaxCB()
- MinCB()
- StdCB()
- CountCB()
- MoranICB(k=5) – Gives 2 values (value , p-value)
- PriorCB
Criteria type
-
Benefit (high values –> high score –> rank high –> prioritized sampling needed)
-
Cost (high values –> low score –> low high –> Less sampling needed)
-
MaxCB() – Benefit
-
MinCB() – ???
-
StdCB() – Benefit
-
CountCB() – Cost (Low count – higher priority because more samples need)
-
MoranICB(k=5) – Cost (high value – highly correlated – less need for sampling ?? )
-
PriorCB – Benefit
MCDM techniques
- CP – low values – good alternative
- TOPSIS – High Value – good alternative
! Everything is converted to rank to account for these differences !
benefit_criteria = [True, True, True]
state = State(samples_t0, gdf_grid, cbs=[MaxCB(), MinCB(), StdCB()])
optimizer = Optimizer(state=state())
df = optimizer.rank(is_benefit_x=benefit_criteria, w_vector = [0.3, 0.3, 0.4],
n_method=None, c_method = None, w_method=None, s_method="CP")
df.head()
rank | |
---|---|
loc_id | |
93 | 1 |
92 | 2 |
84 | 3 |
91 | 4 |
83 | 5 |
# https://kapernikov.com/ipywidgets-with-matplotlib/
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.