Model analysis

To measure the performance of your model, there are several evaluation metrics.

Below we will cover the following most commonly used metrics for classification.

  • Accuracy

  • Precision

  • Recall or Sensitivity

  • F1 Score

#load our previously trained models:
import joblib

rf_model = joblib.load("rf_classifier.model")
kn_model = joblib.load("kn_classifier.model")

Confusion matrix

For classification problems, the confusion or contingency matrix is a good starting point. It can be used with two or more classes and is the basis for calculating the before mentioned metrics.

Image source:

The rows represent the predicted classes and the columns represent the actual classes in the validation or test data set.

In the case of two classes, the four fields of the confusion matrix have the following meaning:

  • True Positive (TP): Cases where the pixel was actually cloudy and also classified as cloudy.

  • True Negative (TN): Cases when the pixel was actually cloud free and also classified as cloud free.

  • False Positive (FP): Cases when the pixel was actually cloud free but classified as cloudy.

  • False Negative (FN): Cases when the pixel was actually cloudy but classified as cloud free.

To fill the corresponding fields in the matrix, the classification results are compared with the validation data for each pixel and the number of cases for TP, TN, FP, FN are counted.

For a good prediction, in this matrix, true and false positives as well as true and false negatives are juxtaposed. Obviously, true positives (predicted = True; measured = True) and true negatives (predicted = False; measured = False) should show high values and both false classes should show low values respectively.

We can use the ConfusionMatrixDisplay of sklearn class for this task.

The confusion matrix of our random forest model predictions looks like this:

from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(rf_model, data_test[predictors], data_test[target],cmap="Reds")
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f95c469b310>

The confusion matrix of our KNN model predictions looks like this:

ConfusionMatrixDisplay.from_estimator(kn_model, data_test[predictors], data_test[target],cmap="Blues")
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f95c4541290>


The Accuracy indicates the proportion of correctly classified pixels out of the total number of classified pixels.

It is a good measure if the target variable classes in the data set are almost balanced.

If your dataset is unbalanced, that is, if the majority of the target variable classes belong to one class, you should not use accuracy.

Image source:

# Make predictions to calculate the evaluation scores
y_pred = rf_model.predict(data_test[predictors])
# Calculate accuracy score
from sklearn.metrics import accuracy_score

accuracy_score(data_test[target], y_pred)

To account for imbalances in the sample counts of the classes, we will also calculate the balanced accuracy score:

from sklearn.metrics import balanced_accuracy_score

balanced_accuracy_score(data_test[target], y_pred, adjusted=True)


The Precision measures the performance of the model with respect to FP and indicates how accurate the model is.

It indicates what proportion of the pixels classified as cloudy are actually cloudy.

Image source:

# Calculate precission score
from sklearn.metrics import precision_score

precision_score(data_test[target], y_pred)

Recall or Sensitivity

The Recall or Sensitivity indicates the proportion of cloudy pixels that were correctly detected by the model.

Unlike precision, it is not so much a matter of correctly detecting e.g. cloudy pixels, but rather of detecting all cloudy pixels. If the model simply classified all pixels as cloudy, we would get 100% recall.

It is therefore recommended to analyze Recall and Precision in combination.

Image source:

# Calculate precission score
from sklearn.metrics import recall_score

recall_score(data_test[target], y_pred)

F1 Score

The F1 score combines Precision and Recall into one metric by calculating the harmonic mean of the two metrics.

\( F1 = 2 \cdot \frac{Precision\cdot Recall} {Precision + Recall} \)

If Precision and Recall are high, F1 is also high. If they are both low, F1 is low. If one is high and the other is low, F1 is low.

F1 can be used to determine whether the model is actually good at classifying e.g. cloud-free pixels, or whether it is finding shortcuts (e.g., simply identifying everything as cloud-free).

# Calculate f1 score
from sklearn.metrics import f1_score

f1_score(data_test[target], y_pred)

We can use classification_report of sklearn class to get a summary of the metrics for your model.

from sklearn.metrics import classification_report

              precision    recall  f1-score   support

       False       0.88      0.88      0.88      9731
        True       0.93      0.93      0.93     16417

    accuracy                           0.91     26148
   macro avg       0.90      0.91      0.90     26148
weighted avg       0.91      0.91      0.91     26148

All of the above scores can also be calculated for a classification problem with multiple classes (e.g. practical introduction).

Image source:

Together with the confusion matrix this can give you detailed information about the classification performance with respect to specific classes.

Image source:

The evaluation metrics tell you what kind of errors the model made. The confusion matrix tells you exactly where mistakes were made.

Analysing the calculated metrics together with the confusion matrix provides more nuanced insights in how the model performs.

Receiver Operating Characteristic (ROC)

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was originally developed for operators of military radar receivers starting in 1941, which led to its name.

It plots the True Positive Rate (TP/(TP+FN) versus the False Positive Rate (FP/(FP+TN).

Soure: Wikipedia

Applied to our classifier models, the ROC can help to understand their abilities to identify cloudy pixels with varying classification thresholds. So let’s build an ROC.

We can create ROC curves for both our models using the RocCurveDisplay class of sklearn:

from sklearn.metrics import RocCurveDisplay
import matplotlib.pyplot as plt

predictors = ["IR_016","IR_039","IR_087","IR_097","IR_108","IR_120","IR_134","VIS006","VIS008","WV_062","WV_073","dem"]
target     = "cloudcover"

rf_disp = RocCurveDisplay.from_estimator(rf_model, data_test[predictors], data_test[target])
ax = plt.gca()
kn_disp = RocCurveDisplay.from_estimator(kn_model, data_test[predictors], data_test[target], ax=ax)

We see that both models perform well on the data with areas under the curve (AUC) of 92% (KNeighborsClassifier) and 97% (RandomForestClassifier). However, the RF model slightly outperforms the simple KN model.

Classification thresholds

But how do these different classification thresholds impact the prediction results of our models? Let’s take a look…

First, we read an exemplary scene from our test data set from 2006 and reshape it for our model to be able to digest it:

import xarray as xr

# Load DEM
dem = xr.open_dataset("data/dem_cropped.tif",engine="rasterio").band_data

# Load MSG and data into DataSet
satellite_data = xr.open_dataset("data/satellite/2006/")

# Add DEM to satellite_data-DataSet
satellite_data["dem"] = (["time","y","x"],dem.values)

# Extract all predictor variables
satellite_data_arr = np.array([satellite_data[predictor].values for predictor in predictors])

# Flatten predictors to 1d-arrays
satellite_data_arr_reshaped = satellite_data_arr.reshape(satellite_data_arr.shape[0],-1)

Then we predict for each pixel if it is cloudy or not. This time, we use the predict_proba function so that we get class probabilities for each pixel and not the class value itself:

prediction_rf = rf_model.predict_proba(np.where(np.isnan(satellite_data_arr_reshaped),0,satellite_data_arr_reshaped).T)
prediction_kn = kn_model.predict_proba(np.where(np.isnan(satellite_data_arr_reshaped),0,satellite_data_arr_reshaped).T)

Now we can define a plotting routine that depicts the predictions results together with the satellite data:

from sklearn.preprocessing import binarize

# Plotting method for creating a plot of the model predictions and the original satellite data
def plotResult(threshold_rf,threshold_kn,cmask=False):
    preds_rf = binarize(prediction_rf, threshold=threshold_rf)[:,1].reshape(dem.shape[1:])
    preds_rf = np.where(np.isnan(satellite_data.IR_039[0]),np.nan,preds_rf)
    preds_kn = binarize(prediction_kn, threshold=threshold_kn)[:,1].reshape(dem.shape[1:])
    preds_kn = np.where(np.isnan(satellite_data.IR_039[0]),np.nan,preds_kn)
    r = normArray(satellite_data.IR_039[0]*-1)
    g = normArray(satellite_data.IR_108[0]*-1)
    b = normArray(satellite_data.IR_120[0]*-1)

    fig, ax = plt.subplots(1,1,figsize=(12,8))
    if cmask: ax.imshow(np.where(satellite_data.cmask[0]>1,1,np.nan),cmap="autumn_r",interpolation="None",alpha=0.5)
    ax.set_title("Cloud mask")

And finally, we create an interactive plot that shows us the changes in our prediction results depending on the chosen classification thresholds:

from ipywidgets import interact, Layout
import ipywidgets as widgets

         threshold_rf = widgets.FloatSlider(min=-0.01, max=1, step=0.01, value=0.5, description='RF Threshold:', layout=Layout(width='75%')),
         threshold_kn = widgets.FloatSlider(min=-0.01, max=1, step=0.01, value=0.5, description='KN Threshold:', layout=Layout(width='75%')),
         cmask     = widgets.Checkbox(value=False,description='COMET C-Mask (yellow)'),

The graph is not interactive on this website as it only works in a running Jupyter Environment.

Further reading: