Batgirl¶
Case Study 1 - Flood Disaster in Germany¶
Research objective¶
The 2021 Ahr flood in Germany on July 14 and 15 destroyed vast areas of land, costing more than 7 billion euros and resulting in the deaths of at least 133 people.
In this case study we want to take a look at the level of destruction from a satellite’s perspective. The Sentinel-2 polar orbiter provides data in 12 spectral bands at a spatial resolution of up to 10 meters and has a revisit time of five days.
Here, we will work with two Sentinel-2 observations, one from the overpass on June 13 (before the flooding) and one from July 21 (after the flooding) respectively. The scenes were chosen due to their low cloud cover. Additionally, a Digital Elevation Model (DEM) is provided. Further data are labels for training (flooded areas) and labels for testing (flooded areas).
Table of contents¶
1 Read the data
2 Data inspection (Visualization)
3 Preprocessing
3.1 Reshape the data into a tabular format
3.2 Split the data set into train and test data
4 Baseline model
4.1 Training
4.2 Validation
4.3 Application
5 Hyperparameter tuning
5.1 Random search
5.2 Grid search
6 Feature selection
6.1 Recursive feature elimination
6.2 Forward feature selection
1 Read the data¶
# Import the Path function of the pathlib module:
from pathlib import Path
# Import the xarray library:
import xarray as xr
# Set the directory:
datadir = Path("C:/Users/Florian/Documents/Studium_Physische_Geographie_(M.Sc.)_Uni_Marburg/3_Semester/Machine_Learning_Remote_Sensing/Case_Study1/Data")
# Read Sentinel-2 data
sentinel_before = xr.open_dataset(datadir / "ahr_hochwasser_before.nc")
sentinel_after = xr.open_dataset(datadir / "ahr_hochwasser_after.nc")
# Read label data (flooded areas)
# Pixel values in the label data have the following meaning: 1 = flooded areas, 0 = non-flooded areas, NaN = no data
flooded_train = xr.open_dataset(datadir / "ahr_hochwasser_flooded_areas.tif", engine="rasterio").band_data[0]
flooded_test = xr.open_dataset(datadir / "ahr_hochwasser_flooded_areas_test.tif", engine="rasterio").band_data[0]
# Read the DEM
dem = xr.open_dataset(datadir / "dem_utm32n_clipped_cropped.tif", engine="rasterio").band_data[0]
2 Data inspection (Visualization)¶
# Import the numpy library
import numpy as np
# Helper function to normalize array to range 0-1 (with clips at lower and upper percentiles (1% - 99%))
def normArray(x):
lp = 1
up = 99
x = np.clip(x, np.nanpercentile(x, lp), np.nanpercentile(x, up))
x = (x - np.nanmin(x)) / (np.nanmax(x) - np.nanmin(x))
return x
# Import matplotlib.pyplot
import matplotlib.pyplot as plt
# Select bands for plotting
red_b = normArray(sentinel_before.B4)
green_b = normArray(sentinel_before.B3)
blue_b = normArray(sentinel_before.B2)
red_a = normArray(sentinel_after.B4)
green_a = normArray(sentinel_after.B3)
blue_a = normArray(sentinel_after.B2)
# Plot the Sentinel-2 observations for an exemplary region
fig = plt.figure(figsize=(15,22))
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
im_before = ax1.imshow(np.array([red_b[1000:1500,0:700],green_b[1000:1500,0:700],blue_b[1000:1500,0:700]]).transpose(1,2,0))
im_after = ax2.imshow(np.array([red_a[1000:1500,0:700],green_a[1000:1500,0:700],blue_a[1000:1500,0:700]]).transpose(1,2,0))
ax1.set_title("Overview of exemplary region (True Color Composite) before flood event", fontweight="bold")
ax2.set_title("Overview of exemplary region (True Color Composite) after flood event", fontweight="bold")
fig.tight_layout()
plt.show()
# Plot the DEM
from mpl_toolkits.axes_grid1 import make_axes_locatable
fig, ax = plt.subplots(1,1,figsize=(15,12))
im = ax.imshow(dem, cmap="terrain")
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="2%", pad=0.7)
cb = fig.colorbar(im, cax=cax, orientation="vertical")
cax.set_xlabel("Elevation [m]")
ax.set_title("Digital Elevation Model of the whole study area", fontweight="bold")
fig.tight_layout()
plt.show()
3 Preprocessing¶
3.1 Reshape the data into a tabular format¶
# Import the pandas library
import pandas as pd
# Get list of band names in satellite DataSet:
bands = list(sentinel_before.data_vars.keys())
# Instantiate empty pandas DataFrame
df = pd.DataFrame()
# Fill DataFrame with "flattened" satellite bands (--> each band is flattened to one column)
for i in range(12):
df["b_"+bands[i]] = sentinel_before[bands[i]].values.flatten()
df["a_"+bands[i]] = sentinel_after[bands[i]].values.flatten()
# Add DEM and labels to DataFrame
df["dem"] = dem.values.flatten()
df["label_train"] = flooded_train.values.flatten()
df["label_test"] = flooded_test.values.flatten()
# Define predictor and target columns
predictors = df.keys()[:-2]
target = "label_train"
# Clip DataFrame to samples where all predictor bands have valid pixel values
df_valid = df.dropna(subset=predictors)
df_valid
b_B1 | a_B1 | b_B2 | a_B2 | b_B3 | a_B3 | b_B4 | a_B4 | b_B5 | a_B5 | ... | a_B8A | b_B9 | a_B9 | b_B11 | a_B11 | b_B12 | a_B12 | dem | label_train | label_test | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4602 | 0.0172 | 0.0162 | 0.0217 | 0.0140 | 0.0335 | 0.0266 | 0.0166 | 0.0131 | 0.0683 | 0.0554 | ... | 0.3651 | 0.3935 | 0.3461 | 0.1745 | 0.1576 | 0.0744 | 0.0642 | 280.214813 | NaN | NaN |
4603 | 0.0172 | 0.0162 | 0.0211 | 0.0150 | 0.0359 | 0.0271 | 0.0173 | 0.0120 | 0.0614 | 0.0536 | ... | 0.3622 | 0.3935 | 0.3461 | 0.1763 | 0.1583 | 0.0736 | 0.0642 | 279.540619 | NaN | NaN |
4604 | 0.0166 | 0.0159 | 0.0215 | 0.0170 | 0.0373 | 0.0308 | 0.0168 | 0.0161 | 0.0614 | 0.0536 | ... | 0.3622 | 0.3935 | 0.3461 | 0.1763 | 0.1583 | 0.0736 | 0.0642 | 279.540619 | NaN | NaN |
4605 | 0.0166 | 0.0159 | 0.0222 | 0.0150 | 0.0334 | 0.0290 | 0.0175 | 0.0148 | 0.0650 | 0.0545 | ... | 0.3877 | 0.4302 | 0.3806 | 0.1810 | 0.1630 | 0.0752 | 0.0682 | 279.311157 | NaN | NaN |
4606 | 0.0166 | 0.0159 | 0.0198 | 0.0171 | 0.0407 | 0.0325 | 0.0164 | 0.0157 | 0.0650 | 0.0545 | ... | 0.3877 | 0.4302 | 0.3806 | 0.1810 | 0.1630 | 0.0752 | 0.0682 | 279.311157 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4801696 | 0.0221 | 0.0174 | 0.0305 | 0.0282 | 0.0597 | 0.0573 | 0.0303 | 0.0528 | 0.0932 | 0.0958 | ... | 0.3648 | 0.4347 | 0.3799 | 0.1859 | 0.1860 | 0.0863 | 0.0928 | 467.113068 | 0.0 | NaN |
4801697 | 0.0221 | 0.0174 | 0.0289 | 0.0251 | 0.0521 | 0.0440 | 0.0263 | 0.0298 | 0.0932 | 0.0958 | ... | 0.3648 | 0.4347 | 0.3799 | 0.1859 | 0.1860 | 0.0863 | 0.0928 | 464.409637 | 0.0 | NaN |
4801698 | 0.0219 | 0.0174 | 0.0225 | 0.0158 | 0.0447 | 0.0332 | 0.0187 | 0.0199 | 0.0750 | 0.0540 | ... | 0.3246 | 0.4347 | 0.3741 | 0.1743 | 0.1479 | 0.0759 | 0.0621 | 464.409637 | 0.0 | NaN |
4801699 | 0.0219 | 0.0174 | 0.0226 | 0.0142 | 0.0436 | 0.0309 | 0.0184 | 0.0193 | 0.0750 | 0.0540 | ... | 0.3246 | 0.4347 | 0.3741 | 0.1743 | 0.1479 | 0.0759 | 0.0621 | 461.661682 | 0.0 | NaN |
4801700 | 0.0219 | 0.0174 | 0.0218 | 0.0152 | 0.0430 | 0.0320 | 0.0198 | 0.0163 | 0.0787 | 0.0558 | ... | 0.3646 | 0.4347 | 0.3741 | 0.1854 | 0.1552 | 0.0807 | 0.0639 | 461.661682 | 0.0 | NaN |
1232367 rows × 27 columns
3.2 Split the data set into train and test data¶
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_valid[~np.isnan(df_valid.label_train)].drop(columns=["label_test"]), random_state=42, test_size=0.2)
df_train
b_B1 | a_B1 | b_B2 | a_B2 | b_B3 | a_B3 | b_B4 | a_B4 | b_B5 | a_B5 | ... | b_B8A | a_B8A | b_B9 | a_B9 | b_B11 | a_B11 | b_B12 | a_B12 | dem | label_train | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4198582 | 0.0470 | 0.0257 | 0.0621 | 0.0244 | 0.0997 | 0.0682 | 0.1050 | 0.0297 | 0.1913 | 0.1154 | ... | 0.3555 | 0.4480 | 0.3627 | 0.4233 | 0.3405 | 0.2128 | 0.1988 | 0.0945 | 433.486664 | 0.0 |
3703954 | 0.0265 | 0.0216 | 0.0254 | 0.0171 | 0.0441 | 0.0307 | 0.0250 | 0.0194 | 0.0632 | 0.0443 | ... | 0.2607 | 0.2271 | 0.2972 | 0.2580 | 0.1001 | 0.0841 | 0.0505 | 0.0362 | 318.111389 | 0.0 |
3904497 | 0.0270 | 0.0160 | 0.0297 | 0.0200 | 0.0462 | 0.0378 | 0.0254 | 0.0224 | 0.0741 | 0.0554 | ... | 0.3593 | 0.3105 | 0.3982 | 0.3281 | 0.1517 | 0.1225 | 0.0724 | 0.0538 | 447.801880 | 0.0 |
3886858 | 0.0458 | 0.0353 | 0.0426 | 0.0440 | 0.0781 | 0.0692 | 0.0525 | 0.0630 | 0.1160 | 0.1106 | ... | 0.3495 | 0.2752 | 0.3314 | 0.2770 | 0.2348 | 0.2019 | 0.1357 | 0.1201 | 223.937378 | 0.0 |
3682430 | 0.0604 | 0.0555 | 0.0498 | 0.0592 | 0.0761 | 0.0830 | 0.0714 | 0.1132 | 0.1267 | 0.1204 | ... | 0.3040 | 0.1989 | 0.2710 | 0.1940 | 0.2066 | 0.1892 | 0.1494 | 0.1382 | 246.357849 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3643284 | 0.0201 | 0.0143 | 0.0198 | 0.0155 | 0.0382 | 0.0265 | 0.0182 | 0.0156 | 0.0644 | 0.0481 | ... | 0.2994 | 0.2504 | 0.3160 | 0.2607 | 0.1311 | 0.1014 | 0.0587 | 0.0459 | 387.462341 | 0.0 |
4243491 | 0.0276 | 0.0167 | 0.0285 | 0.0206 | 0.0508 | 0.0387 | 0.0247 | 0.0201 | 0.0759 | 0.0577 | ... | 0.3644 | 0.3060 | 0.3952 | 0.3312 | 0.1811 | 0.1368 | 0.0868 | 0.0628 | 320.460297 | 0.0 |
3688515 | 0.0612 | 0.0371 | 0.0990 | 0.0376 | 0.1382 | 0.0652 | 0.1542 | 0.0543 | 0.1880 | 0.1009 | ... | 0.3728 | 0.3740 | 0.3399 | 0.3239 | 0.3136 | 0.1967 | 0.2438 | 0.1169 | 233.000000 | 0.0 |
3745470 | 0.0310 | 0.0234 | 0.0369 | 0.0370 | 0.0705 | 0.0731 | 0.0353 | 0.0555 | 0.1185 | 0.1276 | ... | 0.5096 | 0.4354 | 0.4476 | 0.4068 | 0.1918 | 0.2179 | 0.0892 | 0.1096 | 251.853851 | 0.0 |
3649765 | 0.0181 | 0.0119 | 0.0230 | 0.0150 | 0.0456 | 0.0291 | 0.0215 | 0.0165 | 0.0823 | 0.0548 | ... | 0.3672 | 0.3023 | 0.4110 | 0.3531 | 0.1558 | 0.1192 | 0.0729 | 0.0513 | 353.186096 | 0.0 |
222184 rows × 26 columns
4 Baseline model¶
4.1 Training¶
The preprocessed DataFrame can be handed over to an ML model. In this study, a random forest classifier is choosen.
import warnings
warnings.filterwarnings('ignore')
import joblib
from sklearn.ensemble import RandomForestClassifier
# Train a Random Forest Classifier
model = RandomForestClassifier(random_state=42,oob_score=True,n_jobs=-1)
model.fit(df_train[predictors],df_train[target])
joblib.dump(model, datadir / "trained.model")
4.2 Validation¶
# Check model performance on test data set
model_score = model.score(df_test[predictors], df_test[target])
print(f"The performance of the model is {model_score*100:.2f}%.")
# Check model performance on spatially independent test data set (label_test/flooded_test)
df_test2 = df_valid[~np.isnan(df_valid.label_test)].drop(columns=["label_train"])
model_score_1 = model.score(df_test2[predictors], df_test2["label_test"])
print(f"The performance of the model considering a spatially independent test data set is {model_score_1*100:.2f}%.")
The performance of the model is 99.17%.
The performance of the model considering a spatially independent test data set is 85.40%.
# Plot the models assessment of the feature importances
feat_imps = pd.DataFrame({"feat_importance" : model.feature_importances_,
"predictors" : predictors}).sort_values(by="feat_importance")
fig, ax = plt.subplots(1,1,figsize=(4,6))
ax.barh(range(len(feat_imps.feat_importance)),feat_imps.feat_importance)
ax.set_yticks(range(len(feat_imps.feat_importance)))
ax.set_yticklabels(feat_imps.predictors)
ax.set_title("Feature importances")
plt.show()
4.3 Application¶
Apply the model to the whole domain to see which areas it would classify as flooded.
model = joblib.load(datadir / "trained.model")
prediction = model.predict(df.fillna(0)[predictors]).reshape(dem.shape)
fig, ax = plt.subplots(1,1,figsize=(20,10))
im = ax.imshow(prediction, vmin=0, vmax=1,interpolation="None", cmap="Greys")
ax.set_title("Flooded areas classified by a random forest", fontweight="bold")
plt.show()
5 Hyperparameter tuning¶
5.1 Random search¶
# Use the RandomizedSearchCV tool to randomly check different parameter combinations for the model and data set
from sklearn.model_selection import RandomizedSearchCV
# Only n_estimators (number of trees) and max_depth (maximum depth of the tree) are used in this example
parameters = {'n_estimators' : range(100,300,50),
'max_depth' : range(10,15,1)}
model = RandomizedSearchCV(RandomForestClassifier(), parameters, n_iter=5, cv = 5, n_jobs=-1, scoring="accuracy", verbose=1, random_state=42)
model.fit(df_train[predictors], df_train[target])
Fitting 5 folds for each of 5 candidates, totalling 25 fits
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=5,
n_jobs=-1,
param_distributions={'max_depth': range(10, 15),
'n_estimators': range(100, 300, 50)},
random_state=42, scoring='accuracy', verbose=1)
# Best parameter combination found by this random testing procedure
model.best_params_
{'n_estimators': 150, 'max_depth': 14}
5.2 Grid search¶
# Use the GridSearchCV tool to test through all possible combinations of hyperparameters around these “best parameters”
from sklearn.model_selection import GridSearchCV
parameters = {'n_estimators' : range(100,200,50),
'max_depth' : range(13,15,1)}
model = GridSearchCV(RandomForestClassifier(), parameters, n_jobs=-1, cv = 2, scoring="accuracy", verbose=1)
model.fit(df_train[predictors], df_train[target])
Fitting 2 folds for each of 4 candidates, totalling 8 fits
GridSearchCV(cv=2, estimator=RandomForestClassifier(), n_jobs=-1,
param_grid={'max_depth': range(13, 15),
'n_estimators': range(100, 200, 50)},
scoring='accuracy', verbose=1)
# Print the best parameter combination
model.best_params_
{'max_depth': 14, 'n_estimators': 150}
# Run the model again with these parameters
model = RandomForestClassifier(random_state=42, n_estimators=150, max_depth=14, oob_score=True, n_jobs=-1)
model.fit(df_train[predictors],df_train[target])
joblib.dump(model, datadir / "trained_hyperparam.model")
model_score = model.score(df_test[predictors], df_test[target])
print(f"The performance of the model after hyperparameter tuning is {model_score*100:.2f}%.")
The performance of the model after hyperparameter tuning is 99.06%.
It can be seen that the accuracy of the model has not increased. For better hyperparameter tuning, more parameters should taken into account. However, even the ‘normal’ model performance without hyperparameter tuning is already quite high (99.17%).
6 Feature selection¶
6.1 Recursive feature elimination¶
# Use the RFECV tool to run a recursive feature elimination on the model
from sklearn.model_selection import KFold
from sklearn.feature_selection import RFECV
import joblib
#model = joblib.load(datadir / "trained_hyperparam.model")
# KFold cross validation with 5 splits in the data at each step
rfecv = RFECV(estimator=model, step=1, cv=KFold(5), n_jobs = -1, scoring='accuracy', min_features_to_select = 1)
rfecv.fit(df_train[predictors], df_train[[target]])
joblib.dump(rfecv, datadir / "rfecv.model")
print("Optimal number of features : %d" % rfecv.n_features_)
Optimal number of features : 16
# Plot model accuracy for each run
plt.figure(figsize=(15,8))
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (accuracy)")
plt.plot(range(1, len(rfecv.grid_scores_)+1), rfecv.grid_scores_)
plt.grid(axis="both")
plt.xticks(range(1,len(rfecv.grid_scores_)+1,1))
plt.title("Model accuracy for each run", fontweight="bold")
plt.show()
It can be seen that the model performance slightly increases from 25 to 16 predictors where it reaches its maximum. Afterwards it decreases, only interrupted by a slightly increase again from 5 to 4 predictors.
# List the features that add value to the model
df_train[predictors].columns[rfecv.ranking_==1]
Index(['b_B1', 'a_B1', 'a_B2', 'a_B3', 'a_B4', 'a_B5', 'b_B6', 'a_B6', 'a_B7',
'a_B8', 'a_B8A', 'b_B9', 'a_B9', 'a_B11', 'a_B12', 'dem'],
dtype='object')
# List all unimportant features
df_train[predictors].columns[rfecv.ranking_>1]
Index(['b_B2', 'b_B3', 'b_B4', 'b_B5', 'b_B7', 'b_B8', 'b_B8A', 'b_B11',
'b_B12'],
dtype='object')
6.2 Forward feature selection¶
Unfortunately, I wasn’t able to apply the forward feature selection due to the power of my laptop. I interrupted it after hours of computing. However, below can be seen how this could be implemented.
# Use the SequentialFeatureSelector to run a forward feature selection
import warnings
warnings.filterwarnings('ignore')
import joblib
from sklearn.feature_selection import SequentialFeatureSelector
model = joblib.load(datadir / "trained_hyperparam.model")
seqFeatSel = SequentialFeatureSelector(model, n_features_to_select=16, direction='forward', scoring="accuracy", cv=5, n_jobs=-1)
seqFeatSel.fit(df_train[predictors], df_train[[target]])
# List the features that were iteratively added to the model
selected_features = df_train[predictors].columns[seqFeatSel.support_]
selected_features
# Train/test the model again using only the selected predictors
model = RandomForestClassifier(random_state=42, n_estimators=150, max_depth=14, oob_score=True, n_jobs=-1)
model.fit(df_train[selected_features], df_train[target])
model_score = model.score(df_test[selected_features], df_test[target])
print(f"The performance of the model after forward feature selection is {model_score*100:.2f}%.")