Baseline model


Our preprocessed DataFrame can be handed over to an ML model of the scikit-learn library.

On the example of a simple DecisionTree, this could look like this:

# We define the columns we want to include in the independent features and also the target feature
predictors = bands
target = "label"

from sklearn.tree import DecisionTreeClassifier, plot_tree

# define the classification model
model = DecisionTreeClassifier(random_state=42,max_depth=3)

# fit the model to our data[predictors],df_train[target])
DecisionTreeClassifier(max_depth=3, random_state=42)

We can then save the trained model to the hard disk so we don’t have to keep re-training it for later applications:

import joblib

joblib.dump(model, 'my_trained.model')

If we want to access the trained model again later, we can simply load it via:

model = joblib.load('my_trained.model')


We can check the model’s performance by applying the models score function to the test DataFrame (which gives us the accuracy):


We can also visualize the model to get a better understanding of its internal structure:

import matplotlib.pyplot as plt

plot_tree(model,feature_names=predictors,filled=True,class_names=["not flooded","flooded"],proportion=True)
<Figure size 1500x1000 with 1 Axes>

And we can plot the models assessment of the feature importances:

feat_imps = pd.DataFrame({"feat_importance" : model.feature_importances_,
                          "band" : bands}).sort_values(by="feat_importance")

fig, ax = plt.subplots(1,1,figsize=(4,6))


We can now go ahead and apply the model to our whole domain to see which areas it would classify as flooded:

prediction = model.predict(df.fillna(0)[df.keys()[:-1]]).reshape(labels.shape)
fig, ax = plt.subplots(1,1,figsize=(20,10))
im = ax.imshow(prediction,vmin=0,vmax=1,interpolation="None",cmap="Greys")

Finally, we can save our prediction results as a GeoTif-file. For this, we first have to convert them into an xarray-DataArray:

# Convert results into DataArray using the dims/coords information of the label data
res_da = xr.DataArray(prediction,dims=labels.dims,coords=[labels.y,labels.x]).astype(np.uint8)

# Save the DataArray as GeoTif using rioxarray"predictions.tif")


  • Train a random forest classifier on the provided training data. For validation purposes, split the data set into 80/20 training/test subsets.

  • Test the model performance using the 20% test subset.

  • Test the model performance with the provided spatially independent test data set. Why does the model performance differ for both test sets?

  • Use the model to detect flooded areas in the whole scene and save your results as a GeoTif file.

  • Visualize your results.

  • Identify possible problems/insufficiencies of the model and try to fix them.


# Load all data sets
import xarray as xr
sentinel_before = xr.open_dataset("data/")
sentinel_after = xr.open_dataset("data/")
flooded_train = xr.open_dataset("data/ahr_hochwasser_flooded_areas.tif", engine="rasterio").band_data[0]
flooded_test = xr.open_dataset("data/ahr_hochwasser_flooded_areas_test.tif", engine="rasterio").band_data[0]
dem = xr.open_dataset("data/dem_utm32n_clipped_cropped.tif", engine="rasterio").band_data[0]

# Combine all data sets in one pandas DataFrame
import pandas as pd
bands = list(sentinel_before.data_vars.keys())
df = pd.DataFrame()
for i in range(12):
    df["b_"+bands[i]] = sentinel_before[bands[i]].values.flatten()
    df["a_"+bands[i]] = sentinel_after[bands[i]].values.flatten()
df["dem"] = dem.values.flatten()
df["label_train"] = flooded_train.values.flatten()
df["label_test"] = flooded_test.values.flatten()

# Define predictor and target columns
predictors = df.keys()[:-2]
target = "label_train"

# Clip DataFrame to samples where all predictor bands have valid pixel values
df_valid = df.dropna(subset=predictors)

# Split the DataFrame into train/test (use only the labels in ahr_hochwasser_flooded_areas.tif)
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_valid[~np.isnan(df_valid.label_train)].drop(columns=["label_test"]),random_state=42,test_size=0.2)

# Train a RF Classifier
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42,oob_score=True,n_jobs=-1)[predictors],df_train[target])

# Check model performance on test data set

# Check model performance on spatially independent test data set from ahr_hochwasser_flooded_areas_test.tif
df_test2 = df_valid[~np.isnan(df_valid.label_test)].drop(columns=["label_train"])

# Use the model to predict results for the complete scene:
prediction = model.predict(df.fillna(0)[predictors]).reshape(dem.shape)

# Convert results into DataArray using the dims/coords information of the label data
res_da = xr.DataArray(prediction,dims=labels.dims,coords=[labels.y,labels.x]).astype(np.uint8)

# Save the DataArray as GeoTif using rioxarray"predictions.tif")