Baseline model¶

Training¶

Our preprocessed DataFrame can be handed over to an ML model of the scikit-learn library.

On the example of a simple DecisionTree, this could look like this:

# We define the columns we want to include in the independent features and also the target feature
predictors = bands
target = "label"

from sklearn.tree import DecisionTreeClassifier, plot_tree

# define the classification model
model = DecisionTreeClassifier(random_state=42,max_depth=3)

# fit the model to our data
model.fit(df_train[predictors],df_train[target])

DecisionTreeClassifier(max_depth=3, random_state=42)

We can then save the trained model to the hard disk so we don’t have to keep re-training it for later applications:

import joblib

joblib.dump(model, 'my_trained.model')

['my_trained.model']

If we want to access the trained model again later, we can simply load it via:

model = joblib.load('my_trained.model')

Validation¶

We can check the model’s performance by applying the models score function to the test DataFrame (which gives us the accuracy):

model.score(df_test[predictors],df_test[target])

0.9714656057032783

We can also visualize the model to get a better understanding of its internal structure:

import matplotlib.pyplot as plt

plt.figure(figsize=(15,10))
plot_tree(model,feature_names=predictors,filled=True,class_names=["not flooded","flooded"],proportion=True)
plt.show()

<Figure size 1500x1000 with 1 Axes>

And we can plot the models assessment of the feature importances:

feat_imps = pd.DataFrame({"feat_importance" : model.feature_importances_,
                          "band" : bands}).sort_values(by="feat_importance")

fig, ax = plt.subplots(1,1,figsize=(4,6))
ax.barh(range(len(feat_imps.feat_importance)),feat_imps.feat_importance)
ax.set_yticks(range(len(feat_imps.feat_importance)))
ax.set_yticklabels(feat_imps.band)
plt.show()

Application¶

We can now go ahead and apply the model to our whole domain to see which areas it would classify as flooded:

prediction = model.predict(df.fillna(0)[df.keys()[:-1]]).reshape(labels.shape)

fig, ax = plt.subplots(1,1,figsize=(20,10))
im = ax.imshow(prediction,vmin=0,vmax=1,interpolation="None",cmap="Greys")
plt.show()

Finally, we can save our prediction results as a GeoTif-file. For this, we first have to convert them into an xarray-DataArray:

# Convert results into DataArray using the dims/coords information of the label data
res_da = xr.DataArray(prediction,dims=labels.dims,coords=[labels.y,labels.x]).astype(np.uint8)

# Save the DataArray as GeoTif using rioxarray
res_da.rio.to_raster("predictions.tif")

Task¶

Train a random forest classifier on the provided training data. For validation purposes, split the data set into 80/20 training/test subsets.
Test the model performance using the 20% test subset.
Test the model performance with the provided spatially independent test data set. Why does the model performance differ for both test sets?
Use the model to detect flooded areas in the whole scene and save your results as a GeoTif file.
Visualize your results.
Identify possible problems/insufficiencies of the model and try to fix them.

Solution¶

# Load all data sets
import xarray as xr
sentinel_before = xr.open_dataset("data/ahr_hochwasser_before.nc")
sentinel_after = xr.open_dataset("data/ahr_hochwasser_after.nc")
flooded_train = xr.open_dataset("data/ahr_hochwasser_flooded_areas.tif", engine="rasterio").band_data[0]
flooded_test = xr.open_dataset("data/ahr_hochwasser_flooded_areas_test.tif", engine="rasterio").band_data[0]
dem = xr.open_dataset("data/dem_utm32n_clipped_cropped.tif", engine="rasterio").band_data[0]

# Combine all data sets in one pandas DataFrame
import pandas as pd
bands = list(sentinel_before.data_vars.keys())
df = pd.DataFrame()
for i in range(12):
    df["b_"+bands[i]] = sentinel_before[bands[i]].values.flatten()
    df["a_"+bands[i]] = sentinel_after[bands[i]].values.flatten()
df["dem"] = dem.values.flatten()
df["label_train"] = flooded_train.values.flatten()
df["label_test"] = flooded_test.values.flatten()

# Define predictor and target columns
predictors = df.keys()[:-2]
target = "label_train"

# Clip DataFrame to samples where all predictor bands have valid pixel values
df_valid = df.dropna(subset=predictors)

# Split the DataFrame into train/test (use only the labels in ahr_hochwasser_flooded_areas.tif)
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_valid[~np.isnan(df_valid.label_train)].drop(columns=["label_test"]),random_state=42,test_size=0.2)

# Train a RF Classifier
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42,oob_score=True,n_jobs=-1)
model.fit(df_train[predictors],df_train[target])

# Check model performance on test data set
print(model.score(df_test[predictors],df_test[target]))

# Check model performance on spatially independent test data set from ahr_hochwasser_flooded_areas_test.tif
df_test2 = df_valid[~np.isnan(df_valid.label_test)].drop(columns=["label_train"])
print(model.score(df_test2[predictors],df_test2["label_test"]))

# Use the model to predict results for the complete scene:
prediction = model.predict(df.fillna(0)[predictors]).reshape(dem.shape)

# Convert results into DataArray using the dims/coords information of the label data
res_da = xr.DataArray(prediction,dims=labels.dims,coords=[labels.y,labels.x]).astype(np.uint8)

# Save the DataArray as GeoTif using rioxarray
res_da.rio.to_raster("predictions.tif")

0.9917367274560283
0.8540003178008183

Preprocessing Hyperparameter tuning

Winter Semester 21/22