Baseline model

Training

Our preprocessed DataFrame can be handed over to an ML model of the scikit-learn library.

On the example of a simple DecisionTree, this could look like this:

# We define the columns we want to include in the independent features and also the target feature
predictors = bands
target = "label"

from sklearn.tree import DecisionTreeClassifier, plot_tree

# define the classification model
model = DecisionTreeClassifier(random_state=42,max_depth=3)

# fit the model to our data
model.fit(df_train[predictors],df_train[target])
DecisionTreeClassifier(max_depth=3, random_state=42)

We can then save the trained model to the hard disk so we don’t have to keep re-training it for later applications:

import joblib

joblib.dump(model, 'my_trained.model')
['my_trained.model']

If we want to access the trained model again later, we can simply load it via:

model = joblib.load('my_trained.model')

Validation

We can check the model’s performance by applying the models score function to the test DataFrame (which gives us the accuracy):

model.score(df_test[predictors],df_test[target])
0.9714656057032783

We can also visualize the model to get a better understanding of its internal structure:

import matplotlib.pyplot as plt

plt.figure(figsize=(15,10))
plot_tree(model,feature_names=predictors,filled=True,class_names=["not flooded","flooded"],proportion=True)
plt.show()
<Figure size 1500x1000 with 1 Axes>

And we can plot the models assessment of the feature importances:

feat_imps = pd.DataFrame({"feat_importance" : model.feature_importances_,
                          "band" : bands}).sort_values(by="feat_importance")

fig, ax = plt.subplots(1,1,figsize=(4,6))
ax.barh(range(len(feat_imps.feat_importance)),feat_imps.feat_importance)
ax.set_yticks(range(len(feat_imps.feat_importance)))
ax.set_yticklabels(feat_imps.band)
plt.show()
../../../_images/04_basemodel_14_0.png

Application

We can now go ahead and apply the model to our whole domain to see which areas it would classify as flooded:

prediction = model.predict(df.fillna(0)[df.keys()[:-1]]).reshape(labels.shape)
fig, ax = plt.subplots(1,1,figsize=(20,10))
im = ax.imshow(prediction,vmin=0,vmax=1,interpolation="None",cmap="Greys")
plt.show()
../../../_images/04_basemodel_18_0.png

Finally, we can save our prediction results as a GeoTif-file. For this, we first have to convert them into an xarray-DataArray:

# Convert results into DataArray using the dims/coords information of the label data
res_da = xr.DataArray(prediction,dims=labels.dims,coords=[labels.y,labels.x]).astype(np.uint8)

# Save the DataArray as GeoTif using rioxarray
res_da.rio.to_raster("predictions.tif")

Task

  • Train a random forest classifier on the provided training data. For validation purposes, split the data set into 80/20 training/test subsets.

  • Test the model performance using the 20% test subset.

  • Test the model performance with the provided spatially independent test data set. Why does the model performance differ for both test sets?

  • Use the model to detect flooded areas in the whole scene and save your results as a GeoTif file.

  • Visualize your results.

  • Identify possible problems/insufficiencies of the model and try to fix them.

Solution

# Load all data sets
import xarray as xr
sentinel_before = xr.open_dataset("data/ahr_hochwasser_before.nc")
sentinel_after = xr.open_dataset("data/ahr_hochwasser_after.nc")
flooded_train = xr.open_dataset("data/ahr_hochwasser_flooded_areas.tif", engine="rasterio").band_data[0]
flooded_test = xr.open_dataset("data/ahr_hochwasser_flooded_areas_test.tif", engine="rasterio").band_data[0]
dem = xr.open_dataset("data/dem_utm32n_clipped_cropped.tif", engine="rasterio").band_data[0]

# Combine all data sets in one pandas DataFrame
import pandas as pd
bands = list(sentinel_before.data_vars.keys())
df = pd.DataFrame()
for i in range(12):
    df["b_"+bands[i]] = sentinel_before[bands[i]].values.flatten()
    df["a_"+bands[i]] = sentinel_after[bands[i]].values.flatten()
df["dem"] = dem.values.flatten()
df["label_train"] = flooded_train.values.flatten()
df["label_test"] = flooded_test.values.flatten()

# Define predictor and target columns
predictors = df.keys()[:-2]
target = "label_train"

# Clip DataFrame to samples where all predictor bands have valid pixel values
df_valid = df.dropna(subset=predictors)

# Split the DataFrame into train/test (use only the labels in ahr_hochwasser_flooded_areas.tif)
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_valid[~np.isnan(df_valid.label_train)].drop(columns=["label_test"]),random_state=42,test_size=0.2)

# Train a RF Classifier
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42,oob_score=True,n_jobs=-1)
model.fit(df_train[predictors],df_train[target])

# Check model performance on test data set
print(model.score(df_test[predictors],df_test[target]))

# Check model performance on spatially independent test data set from ahr_hochwasser_flooded_areas_test.tif
df_test2 = df_valid[~np.isnan(df_valid.label_test)].drop(columns=["label_train"])
print(model.score(df_test2[predictors],df_test2["label_test"]))

# Use the model to predict results for the complete scene:
prediction = model.predict(df.fillna(0)[predictors]).reshape(dem.shape)

# Convert results into DataArray using the dims/coords information of the label data
res_da = xr.DataArray(prediction,dims=labels.dims,coords=[labels.y,labels.x]).astype(np.uint8)

# Save the DataArray as GeoTif using rioxarray
res_da.rio.to_raster("predictions.tif")
0.9917367274560283
0.8540003178008183