Skip to main content

Optimize adaptive sampling

Project description

Trufl

Trufl was initiated in the context of the IAEA (International Atomic Energy Agency) Coordinated Research Project titled “Monitoring and Predicting Radionuclide Uptake and Dynamics for Optimizing Remediation of Radioactive Contamination in Agriculture”.

While Trufl was originally developed to address the remediation of farmland affected by nuclear accidents, its approach and algorithms are applicable to a wide range of application domains. This includes managing legacy contaminants or monitoring any phenomenon that requires consideration of multiple decision criteria, potentially involving a large set of data.

This package leverages the work done by Floris Abrams in the context of his PhD at KU Leuven and Franck Albinet, International Consultant in Geospatial Data Science and currently PhD researcher in AI applied to nuclear remedation at KU Leuven.

Install

pip install trufl

Getting started

In highly sensitive and high-stakes situations, it is essential that decision making is informed, transparent, and accountable, with decisions being based on a thorough and objective analysis of the available data and the needs and concerns of affected communities being taken into account.

Given the time constraints and limited budgets that are often associated with data surveys (in particular ones supposed to informed highligh sensitive situation), it is crucial to make informed decisions about how to allocate resources. This is even more important when considering the many variables that can be taken into account, such as prior knowledge of the area, health and economic impacts, land use, whether remediation has already taken place, population density, and more. Our approach leverages Linear Programming techniques such as Multiple-criteria decision-making to optimize the data survey workflow:

In this demo, we will walk you through a typical workflow using the Trufl package. To help illustrate the process, we will use a “toy” dataset that represents a typical spatial pattern of soil contaminants.

  1. We assume that we have access to the ground truth, which is a raster file that shows the spatial distribution of a soil contaminant;
  2. We will make decisions about how to sample and how much at administrative units level, which in this case are simulated as a grid (using the gridder utilities function);
  3. Based on prior knowledge of the phenomenon of interest, such as prior airborne surveys or other data, an Optimizer will rank each grid cell according to its priority for sampling;
  4. We will then perform random sampling on the designated grid cells (using a Sampler). To simulate the measurement process, we will use the ground truth to emulate measurements at each location (using a DataCollector);
  5. We will evaluate the new state of each cell based on the measurements and pass it to a new round of optimization. This process will be repeated iteratively to refine the sampling strategy.

Imports

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import rasterio
import geopandas as gpd

from trufl.utils import gridder
from trufl.sampler import Sampler, rank_to_sample
from trufl.collector import DataCollector
from trufl.callbacks import (State, MaxCB, MinCB, StdCB, CountCB, MoranICB, PriorCB)
from trufl.optimizer import Optimizer

red, black = '#BF360C', '#263238'

Our simulated ground truth

The assumed ground truth reveals a typical spatial pattern of contaminant such as Cs137 after a nuclear accident for instance.

fname_raster = './files/ground-truth-01-4326-simulated.tif'
with rasterio.open(fname_raster) as src:
    plt.axis('off')
    plt.imshow(src.read(1))

Simulate administrative units

The sampling strategy will be determined on a per-grid-cell basis within the administrative unit. We define below a 10 x 10 grid over the area of interest:

gdf_grid = gridder(fname_raster, nrows=10, ncols=10)
gdf_grid.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
geometry
loc_id
0 POLYGON ((-1.20830 43.26950, -1.20830 43.26042...
1 POLYGON ((-1.20830 43.27858, -1.20830 43.26950...
2 POLYGON ((-1.20830 43.28766, -1.20830 43.27858...
3 POLYGON ((-1.20830 43.29673, -1.20830 43.28766...
4 POLYGON ((-1.20830 43.30581, -1.20830 43.29673...

[!TIP]

Note how each administrative unit is uniquely identified by its loc_id.

gdf_grid.boundary.plot(color=black, lw=0.5)
plt.axis('off');

What prior knowledge do we have?

At the initial time $t_0$, data sampling has not yet begun, but we can often leverage existing prior knowledge of our phenomenon of interest to inform our sampling strategy/policy. In the context of nuclear remediation, this prior knowledge can often be obtained through mobile surveys, such as airborne or carborne surveys, which can provide a coarse estimation of soil contamination levels.

In the example below, we simulate prior information about the soil property of interest by calculating the average value of the property over each grid cell.

At this stage, we have no measurements, so we simply create an empty Geopandas GeoDataFrame.

samples_t0 = gpd.GeoDataFrame(index=pd.Index([], name='loc_id'), 
                              geometry=None, data={'value': None})

[!TIP]

We need to set an index loc_id and have a geometry and value columns.

Now we get/“sense” the state of our grid cells based on the simulated prior (Mean over each grid cell PriorCB):

state = State(samples_t0, gdf_grid, cbs=[PriorCB(fname_raster)])

# You have to call the instance
state_t0 = state(); state_t0.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
Prior
loc_id
0 0.102492
1 0.125727
2 0.161802
3 0.184432
4 0.201405

[!TIP]

We get the Prior for each individual loc_id (here only the first 5 shown). The current State is only composed of a single PriorCB variable but can include many more variables as we will see below.

Optimize sampling based on prior at $t_0$

# Bugs?
#   - fails when ran twice

benefit_criteria = [True]
optimizer = Optimizer(state=state_t0)
df_rank = optimizer.get_rank(is_benefit_x=benefit_criteria, w_vector = [1],  
                             n_method=None, c_method = None, 
                             w_method=None, s_method="CP")

df_rank.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
rank
loc_id
92 1
93 2
91 3
94 4
82 5
gdf_grid.join(df_rank, how='left').plot(column='rank',
                                        cmap='viridis_r', 
                                        legend_kwds={'label': 'Rank'}, 
                                        legend=True)
plt.axis('off');

Start sampling

[!TIP]

It’s worth noting that in the absence of any prior knowledge, a uniform sampling strategy over the area of interest may be used. However, this approach may not be the most efficient use of the available data collection and analysis budget.

Based on the ranks (sampling priority) calculated by the Optimizer and given sampling budget, let’s calculate the number of samples to be collected for each administrative unit (loc_id):

n = rank_to_sample(df_rank['rank'].sort_index().values, 
                   budget=300, min=1); n
array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  2,  3,  2,  1,  1,  1,  1,  1,  1,  3,  4,  7,  3,  1,  1,
        1,  1,  1,  2,  3,  4,  4,  2,  1,  1,  1,  2,  1,  2,  2,  2,  1,
        1,  1,  1,  1,  3,  2,  2,  2,  2,  2,  1,  1,  2,  5, 12,  8, 10,
        5,  4,  3,  1,  1,  6, 19, 58, 29, 14,  6,  3,  2,  1,  1])

We can now decide where to sample based on this sampling schema:

sampler = Sampler(gdf_grid)
sample_locs_t0 = sampler.sample(n, method='uniform')

print(sample_locs_t0.head())
ax = sample_locs_t0.plot(markersize=2, color=red)

gdf_grid.boundary.plot(color=black, lw=0.5, ax=ax)
plt.axis('off');
                         geometry
loc_id                           
0       POINT (-1.22083 43.26342)
1       POINT (-1.22006 43.27588)
2       POINT (-1.21642 43.28474)
3       POINT (-1.21536 43.29169)
4       POINT (-1.20880 43.30279)

Emulating data collection and analysis

The data collector collects measurements at the random sampling locations in the field. In our case, we emulate this process by extracting measurements from the provided raster file.

“Measuring” variable of interest from a given raster:

dc_emulator = DataCollector(fname_raster)
samples_t0 = dc_emulator.collect(sample_locs_t0)

print(samples_t0.head())
ax = samples_t0.plot(column='value', s=2, legend=True)
gdf_grid.boundary.plot(color=black, lw=0.5, ax=ax);
plt.axis('off');
                         geometry     value
loc_id                                     
0       POINT (-1.22083 43.26342)  0.144753
1       POINT (-1.22006 43.27588)  0.149566
2       POINT (-1.21642 43.28474)  0.174893
3       POINT (-1.21536 43.29169)  0.192808
4       POINT (-1.20880 43.30279)  0.208858

Getting current state

state = State(samples_t0, gdf_grid, cbs=[
    MaxCB(), MinCB(), StdCB(), CountCB(), MoranICB(k=5), PriorCB(fname_raster)
])

# You have to call the instance
state_t0 = state(); state_t0
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
Max Min Standard Deviation Count Moran.I Prior
loc_id
0 0.145231 0.074017 0.025631 8 0.915078 0.102492
1 0.152964 0.111763 0.015033 7 0.679907 0.125727
2 0.160229 0.160229 0.000000 1 NaN 0.161802
3 0.188177 0.164264 0.008248 6 0.940746 0.184432
4 0.261005 0.241690 0.009657 2 NaN 0.201405
... ... ... ... ... ... ...
95 0.877876 0.800729 0.027397 5 NaN 0.803670
96 0.804376 0.795942 0.004217 2 NaN 0.763408
97 0.799064 0.672111 0.047581 8 0.948441 0.727797
98 0.708847 0.661171 0.013101 9 0.731343 0.646002
99 0.704167 0.673213 0.015477 2 NaN 0.655185

100 rows × 6 columns

Build the ranking of polygons based on several criteria

Criteria

  • MaxCB()
  • MinCB()
  • StdCB()
  • CountCB()
  • MoranICB(k=5) – Gives 2 values (value , p-value)
  • PriorCB

Criteria type

  • Benefit (high values –> high score –> rank high –> prioritized sampling needed)

  • Cost (high values –> low score –> low high –> Less sampling needed)

  • MaxCB() – Benefit

  • MinCB() – ???

  • StdCB() – Benefit

  • CountCB() – Cost (Low count – higher priority because more samples need)

  • MoranICB(k=5) – Cost (high value – highly correlated – less need for sampling ?? )

  • PriorCB – Benefit

MCDM techniques

  • CP – low values – good alternative
  • TOPSIS – High Value – good alternative

! Everything is converted to rank to account for these differences !

benefit_criteria = [True, True, True]
state = State(samples_t0, gdf_grid, cbs=[MaxCB(), MinCB(), StdCB()])
optimizer = Optimizer(state=state())
df = optimizer.rank(is_benefit_x=benefit_criteria, w_vector = [0.3, 0.3, 0.4],  
                    n_method=None, c_method = None, w_method=None, s_method="CP")

df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
rank
loc_id
93 1
92 2
84 3
91 4
83 5
# https://kapernikov.com/ipywidgets-with-matplotlib/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trufl-0.0.4.tar.gz (27.3 kB view hashes)

Uploaded Source

Built Distribution

trufl-0.0.4-py3-none-any.whl (23.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page