Batgirl

Case Study 1 - Flood Disaster in Germany

Research objective

The 2021 Ahr flood in Germany on July 14 and 15 destroyed vast areas of land, costing more than 7 billion euros and resulting in the deaths of at least 133 people.

In this case study we want to take a look at the level of destruction from a satellite’s perspective. The Sentinel-2 polar orbiter provides data in 12 spectral bands at a spatial resolution of up to 10 meters and has a revisit time of five days.

Here, we will work with two Sentinel-2 observations, one from the overpass on June 13 (before the flooding) and one from July 21 (after the flooding) respectively. The scenes were chosen due to their low cloud cover. Additionally, a Digital Elevation Model (DEM) is provided. Further data are labels for training (flooded areas) and labels for testing (flooded areas).

Table of contents

1 Read the data
2 Data inspection (Visualization)
3 Preprocessing
3.1 Reshape the data into a tabular format
3.2 Split the data set into train and test data
4 Baseline model
4.1 Training
4.2 Validation
4.3 Application
5 Hyperparameter tuning
5.1 Random search
5.2 Grid search
6 Feature selection
6.1 Recursive feature elimination
6.2 Forward feature selection

1 Read the data

# Import the Path function of the pathlib module:
from pathlib import Path

# Import the xarray library:
import xarray as xr

# Set the directory:
datadir = Path("C:/Users/Florian/Documents/Studium_Physische_Geographie_(M.Sc.)_Uni_Marburg/3_Semester/Machine_Learning_Remote_Sensing/Case_Study1/Data")

# Read Sentinel-2 data
sentinel_before = xr.open_dataset(datadir / "ahr_hochwasser_before.nc")
sentinel_after = xr.open_dataset(datadir / "ahr_hochwasser_after.nc")

# Read label data (flooded areas)
# Pixel values in the label data have the following meaning: 1 = flooded areas, 0 = non-flooded areas, NaN = no data
flooded_train = xr.open_dataset(datadir / "ahr_hochwasser_flooded_areas.tif", engine="rasterio").band_data[0]
flooded_test  = xr.open_dataset(datadir / "ahr_hochwasser_flooded_areas_test.tif", engine="rasterio").band_data[0]

# Read the DEM
dem = xr.open_dataset(datadir / "dem_utm32n_clipped_cropped.tif", engine="rasterio").band_data[0]

2 Data inspection (Visualization)

# Import the numpy library
import numpy as np

# Helper function to normalize array to range 0-1 (with clips at lower and upper percentiles (1% - 99%))
def normArray(x):
    lp = 1
    up = 99
    x = np.clip(x, np.nanpercentile(x, lp), np.nanpercentile(x, up))
    x = (x - np.nanmin(x)) / (np.nanmax(x) - np.nanmin(x))
    return x
# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Select bands for plotting
red_b = normArray(sentinel_before.B4)
green_b = normArray(sentinel_before.B3)
blue_b = normArray(sentinel_before.B2)
red_a = normArray(sentinel_after.B4)
green_a = normArray(sentinel_after.B3)
blue_a = normArray(sentinel_after.B2)

# Plot the Sentinel-2 observations for an  exemplary region
fig = plt.figure(figsize=(15,22))
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
im_before = ax1.imshow(np.array([red_b[1000:1500,0:700],green_b[1000:1500,0:700],blue_b[1000:1500,0:700]]).transpose(1,2,0))
im_after = ax2.imshow(np.array([red_a[1000:1500,0:700],green_a[1000:1500,0:700],blue_a[1000:1500,0:700]]).transpose(1,2,0))
ax1.set_title("Overview of exemplary region (True Color Composite) before flood event", fontweight="bold")
ax2.set_title("Overview of exemplary region (True Color Composite) after flood event", fontweight="bold")
fig.tight_layout()
plt.show()
../../../_images/01_Batgirl_6_0.png
# Plot the DEM
from mpl_toolkits.axes_grid1 import make_axes_locatable

fig, ax = plt.subplots(1,1,figsize=(15,12))
im = ax.imshow(dem, cmap="terrain")
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="2%", pad=0.7)
cb = fig.colorbar(im, cax=cax, orientation="vertical")
cax.set_xlabel("Elevation [m]")
ax.set_title("Digital Elevation Model of the whole study area", fontweight="bold")
fig.tight_layout()
plt.show()
../../../_images/01_Batgirl_7_0.png

3 Preprocessing

3.1 Reshape the data into a tabular format

# Import the pandas library
import pandas as pd

# Get list of band names in satellite DataSet:
bands = list(sentinel_before.data_vars.keys())

# Instantiate empty pandas DataFrame
df = pd.DataFrame()

# Fill DataFrame with "flattened" satellite bands (--> each band is flattened to one column)
for i in range(12):
    df["b_"+bands[i]] = sentinel_before[bands[i]].values.flatten()
    df["a_"+bands[i]] = sentinel_after[bands[i]].values.flatten()

# Add DEM and labels to DataFrame 
df["dem"] = dem.values.flatten()
df["label_train"] = flooded_train.values.flatten()
df["label_test"] = flooded_test.values.flatten()

# Define predictor and target columns
predictors = df.keys()[:-2]
target = "label_train"

# Clip DataFrame to samples where all predictor bands have valid pixel values
df_valid = df.dropna(subset=predictors)
df_valid
b_B1 a_B1 b_B2 a_B2 b_B3 a_B3 b_B4 a_B4 b_B5 a_B5 ... a_B8A b_B9 a_B9 b_B11 a_B11 b_B12 a_B12 dem label_train label_test
4602 0.0172 0.0162 0.0217 0.0140 0.0335 0.0266 0.0166 0.0131 0.0683 0.0554 ... 0.3651 0.3935 0.3461 0.1745 0.1576 0.0744 0.0642 280.214813 NaN NaN
4603 0.0172 0.0162 0.0211 0.0150 0.0359 0.0271 0.0173 0.0120 0.0614 0.0536 ... 0.3622 0.3935 0.3461 0.1763 0.1583 0.0736 0.0642 279.540619 NaN NaN
4604 0.0166 0.0159 0.0215 0.0170 0.0373 0.0308 0.0168 0.0161 0.0614 0.0536 ... 0.3622 0.3935 0.3461 0.1763 0.1583 0.0736 0.0642 279.540619 NaN NaN
4605 0.0166 0.0159 0.0222 0.0150 0.0334 0.0290 0.0175 0.0148 0.0650 0.0545 ... 0.3877 0.4302 0.3806 0.1810 0.1630 0.0752 0.0682 279.311157 NaN NaN
4606 0.0166 0.0159 0.0198 0.0171 0.0407 0.0325 0.0164 0.0157 0.0650 0.0545 ... 0.3877 0.4302 0.3806 0.1810 0.1630 0.0752 0.0682 279.311157 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4801696 0.0221 0.0174 0.0305 0.0282 0.0597 0.0573 0.0303 0.0528 0.0932 0.0958 ... 0.3648 0.4347 0.3799 0.1859 0.1860 0.0863 0.0928 467.113068 0.0 NaN
4801697 0.0221 0.0174 0.0289 0.0251 0.0521 0.0440 0.0263 0.0298 0.0932 0.0958 ... 0.3648 0.4347 0.3799 0.1859 0.1860 0.0863 0.0928 464.409637 0.0 NaN
4801698 0.0219 0.0174 0.0225 0.0158 0.0447 0.0332 0.0187 0.0199 0.0750 0.0540 ... 0.3246 0.4347 0.3741 0.1743 0.1479 0.0759 0.0621 464.409637 0.0 NaN
4801699 0.0219 0.0174 0.0226 0.0142 0.0436 0.0309 0.0184 0.0193 0.0750 0.0540 ... 0.3246 0.4347 0.3741 0.1743 0.1479 0.0759 0.0621 461.661682 0.0 NaN
4801700 0.0219 0.0174 0.0218 0.0152 0.0430 0.0320 0.0198 0.0163 0.0787 0.0558 ... 0.3646 0.4347 0.3741 0.1854 0.1552 0.0807 0.0639 461.661682 0.0 NaN

1232367 rows × 27 columns

3.2 Split the data set into train and test data

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df_valid[~np.isnan(df_valid.label_train)].drop(columns=["label_test"]), random_state=42, test_size=0.2)

df_train
b_B1 a_B1 b_B2 a_B2 b_B3 a_B3 b_B4 a_B4 b_B5 a_B5 ... b_B8A a_B8A b_B9 a_B9 b_B11 a_B11 b_B12 a_B12 dem label_train
4198582 0.0470 0.0257 0.0621 0.0244 0.0997 0.0682 0.1050 0.0297 0.1913 0.1154 ... 0.3555 0.4480 0.3627 0.4233 0.3405 0.2128 0.1988 0.0945 433.486664 0.0
3703954 0.0265 0.0216 0.0254 0.0171 0.0441 0.0307 0.0250 0.0194 0.0632 0.0443 ... 0.2607 0.2271 0.2972 0.2580 0.1001 0.0841 0.0505 0.0362 318.111389 0.0
3904497 0.0270 0.0160 0.0297 0.0200 0.0462 0.0378 0.0254 0.0224 0.0741 0.0554 ... 0.3593 0.3105 0.3982 0.3281 0.1517 0.1225 0.0724 0.0538 447.801880 0.0
3886858 0.0458 0.0353 0.0426 0.0440 0.0781 0.0692 0.0525 0.0630 0.1160 0.1106 ... 0.3495 0.2752 0.3314 0.2770 0.2348 0.2019 0.1357 0.1201 223.937378 0.0
3682430 0.0604 0.0555 0.0498 0.0592 0.0761 0.0830 0.0714 0.1132 0.1267 0.1204 ... 0.3040 0.1989 0.2710 0.1940 0.2066 0.1892 0.1494 0.1382 246.357849 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3643284 0.0201 0.0143 0.0198 0.0155 0.0382 0.0265 0.0182 0.0156 0.0644 0.0481 ... 0.2994 0.2504 0.3160 0.2607 0.1311 0.1014 0.0587 0.0459 387.462341 0.0
4243491 0.0276 0.0167 0.0285 0.0206 0.0508 0.0387 0.0247 0.0201 0.0759 0.0577 ... 0.3644 0.3060 0.3952 0.3312 0.1811 0.1368 0.0868 0.0628 320.460297 0.0
3688515 0.0612 0.0371 0.0990 0.0376 0.1382 0.0652 0.1542 0.0543 0.1880 0.1009 ... 0.3728 0.3740 0.3399 0.3239 0.3136 0.1967 0.2438 0.1169 233.000000 0.0
3745470 0.0310 0.0234 0.0369 0.0370 0.0705 0.0731 0.0353 0.0555 0.1185 0.1276 ... 0.5096 0.4354 0.4476 0.4068 0.1918 0.2179 0.0892 0.1096 251.853851 0.0
3649765 0.0181 0.0119 0.0230 0.0150 0.0456 0.0291 0.0215 0.0165 0.0823 0.0548 ... 0.3672 0.3023 0.4110 0.3531 0.1558 0.1192 0.0729 0.0513 353.186096 0.0

222184 rows × 26 columns

4 Baseline model

4.1 Training

The preprocessed DataFrame can be handed over to an ML model. In this study, a random forest classifier is choosen.

import warnings
warnings.filterwarnings('ignore')
import joblib

from sklearn.ensemble import RandomForestClassifier

# Train a Random Forest Classifier
model = RandomForestClassifier(random_state=42,oob_score=True,n_jobs=-1)
model.fit(df_train[predictors],df_train[target])
joblib.dump(model, datadir / "trained.model")

4.2 Validation

# Check model performance on test data set
model_score = model.score(df_test[predictors], df_test[target])
print(f"The performance of the model is {model_score*100:.2f}%.")

# Check model performance on spatially independent test data set (label_test/flooded_test)
df_test2 = df_valid[~np.isnan(df_valid.label_test)].drop(columns=["label_train"])
model_score_1 = model.score(df_test2[predictors], df_test2["label_test"])
print(f"The performance of the model considering a spatially independent test data set is {model_score_1*100:.2f}%.")
The performance of the model is 99.17%.
The performance of the model considering a spatially independent test data set is 85.40%.
# Plot the models assessment of the feature importances
feat_imps = pd.DataFrame({"feat_importance" : model.feature_importances_,
                          "predictors" : predictors}).sort_values(by="feat_importance")

fig, ax = plt.subplots(1,1,figsize=(4,6))
ax.barh(range(len(feat_imps.feat_importance)),feat_imps.feat_importance)
ax.set_yticks(range(len(feat_imps.feat_importance)))
ax.set_yticklabels(feat_imps.predictors)
ax.set_title("Feature importances")
plt.show()
../../../_images/01_Batgirl_16_0.png

4.3 Application

Apply the model to the whole domain to see which areas it would classify as flooded.

model = joblib.load(datadir / "trained.model")

prediction = model.predict(df.fillna(0)[predictors]).reshape(dem.shape)
fig, ax = plt.subplots(1,1,figsize=(20,10))
im = ax.imshow(prediction, vmin=0, vmax=1,interpolation="None", cmap="Greys")
ax.set_title("Flooded areas classified by a random forest", fontweight="bold")
plt.show()
../../../_images/01_Batgirl_19_0.png

5 Hyperparameter tuning

6 Feature selection

6.1 Recursive feature elimination

# Use the RFECV tool to run a recursive feature elimination on the model
from sklearn.model_selection import KFold
from sklearn.feature_selection import RFECV
import joblib

#model = joblib.load(datadir / "trained_hyperparam.model")

# KFold cross validation with 5 splits in the data at each step
rfecv = RFECV(estimator=model, step=1, cv=KFold(5), n_jobs = -1, scoring='accuracy', min_features_to_select = 1)
rfecv.fit(df_train[predictors], df_train[[target]])
joblib.dump(rfecv, datadir / "rfecv.model")

print("Optimal number of features : %d" % rfecv.n_features_)
Optimal number of features : 16
# Plot model accuracy for each run
plt.figure(figsize=(15,8))
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (accuracy)")
plt.plot(range(1, len(rfecv.grid_scores_)+1), rfecv.grid_scores_)
plt.grid(axis="both")
plt.xticks(range(1,len(rfecv.grid_scores_)+1,1))
plt.title("Model accuracy for each run", fontweight="bold")
plt.show()
../../../_images/01_Batgirl_30_0.png

It can be seen that the model performance slightly increases from 25 to 16 predictors where it reaches its maximum. Afterwards it decreases, only interrupted by a slightly increase again from 5 to 4 predictors.

# List the features that add value to the model
df_train[predictors].columns[rfecv.ranking_==1]
Index(['b_B1', 'a_B1', 'a_B2', 'a_B3', 'a_B4', 'a_B5', 'b_B6', 'a_B6', 'a_B7',
       'a_B8', 'a_B8A', 'b_B9', 'a_B9', 'a_B11', 'a_B12', 'dem'],
      dtype='object')
# List all unimportant features
df_train[predictors].columns[rfecv.ranking_>1]
Index(['b_B2', 'b_B3', 'b_B4', 'b_B5', 'b_B7', 'b_B8', 'b_B8A', 'b_B11',
       'b_B12'],
      dtype='object')

6.2 Forward feature selection

Unfortunately, I wasn’t able to apply the forward feature selection due to the power of my laptop. I interrupted it after hours of computing. However, below can be seen how this could be implemented.

# Use the SequentialFeatureSelector to run a forward feature selection
import warnings
warnings.filterwarnings('ignore')
import joblib
from sklearn.feature_selection import SequentialFeatureSelector

model = joblib.load(datadir / "trained_hyperparam.model")

seqFeatSel = SequentialFeatureSelector(model, n_features_to_select=16, direction='forward', scoring="accuracy", cv=5, n_jobs=-1)
seqFeatSel.fit(df_train[predictors], df_train[[target]])
# List the features that were iteratively added to the model
selected_features = df_train[predictors].columns[seqFeatSel.support_]
selected_features
# Train/test the model again using only the selected predictors
model = RandomForestClassifier(random_state=42, n_estimators=150, max_depth=14, oob_score=True, n_jobs=-1)
model.fit(df_train[selected_features], df_train[target])
model_score = model.score(df_test[selected_features], df_test[target])
print(f"The performance of the model after forward feature selection is {model_score*100:.2f}%.")