Brief excursus on model validation

Before we generate our ML model, we’ll take a short (theoretical) detour to exemplify the importance of choosing the right cross validation method during model generation.

First, we read an artificially created dataset:

import pandas as pd
dataset = pd.read_csv("data/artificial_dataset.csv")

The dataset consists of 10 stations with 15 cloud cover measurements at each station (cloudy / not cloudy as 1/0) and a couple of MSG band records. Additionally, the digital elevation model and the stations locations (as x and y coordinates) are provided:

dataset
station_id DEM cloudy x y IR_108 IR_120 VIS006 VIS008 WV_067 IR_039
0 1 57 1 3 4 297 284 264 290 288 355
1 1 57 1 3 4 333 342 257 269 289 240
2 1 57 1 3 4 292 225 259 335 336 312
3 1 57 1 3 4 257 330 324 247 293 355
4 1 57 1 3 4 293 329 228 253 301 224
... ... ... ... ... ... ... ... ... ... ... ...
145 10 62 0 8 7 239 343 299 237 240 318
146 10 62 0 8 7 311 233 347 315 283 285
147 10 62 0 8 7 268 268 261 302 295 250
148 10 62 0 8 7 357 285 240 223 292 278
149 10 62 0 8 7 348 353 243 228 324 254

150 rows × 11 columns

To keep the example simple, what is special about this data set is that at each station there were either completely cloud free conditions for the entire period or it was cloudy for the entire period.

Let’s take a look at the data. We plot the station distribution and color the stations by cloudiness:

dataset.plot.scatter("x","y",color=dataset.cloudy,cmap="RdYlGn",label="Cloudy")
<AxesSubplot:xlabel='x', ylabel='y'>
../../../_images/05_ML_validation_excursion_7_1.png

We now want to create an ML classifier, that can predict cloudiness (yes/no) based on the provided data.

Task

Download the dataset and load it via pandas.

  • Do you have an idea what could be problematic about this data set?

  • Try to create an ML model that can predict cloudiness accurately. How well does your model perform? How do you measure its accuracy?

Solution

Terrain height (as a static variable) is a perfect predictor for cloudiness. This has to be accounted for during validation! A plot of the data might help to visualize the problem even better:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1,1,figsize=(12,6))
ax.scatter(x=range(len(dataset.x)),y=dataset.cloudy,c=dataset.DEM)
<matplotlib.collections.PathCollection at 0x7fd53522cad0>
../../../_images/05_ML_validation_excursion_11_1.png

Let’s go ahead and create a random forest classification model and cross validate it using sklearn’s built-in CV-functions:

from sklearn.ensemble import RandomForestClassifier

predictors = ["DEM","VIS006","VIS008","WV_067","IR_039","IR_108","IR_120"]
target     = "cloudy"

model = RandomForestClassifier(n_estimators=100)

We make a feature selection first, to remove unnecessary input:

%%time

from sklearn.model_selection import KFold
from sklearn.feature_selection import RFECV

rfecv = RFECV(estimator=model, step=1, cv=KFold(10,shuffle=True), n_jobs = -1, scoring='accuracy', min_features_to_select = 1)
rfecv.fit(dataset[predictors], dataset[[target]])
CPU times: user 934 ms, sys: 73 ms, total: 1.01 s
Wall time: 2.7 s

Plotting the result shows us, that the model performs best (100% accuracy) when only one band is used:

fig, ax = plt.subplots(1,1,figsize=(6,4))
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (accuracy)")
plt.plot(range(1, len(rfecv.grid_scores_)+1), rfecv.grid_scores_.mean(axis=1))
plt.grid(axis="both")
plt.xticks(range(1,len(rfecv.grid_scores_)+1,1))
plt.show()
../../../_images/05_ML_validation_excursion_17_0.png

And the best “band” is: (surprise) The DEM

dataset[predictors].columns[rfecv.ranking_==1]
Index(['DEM'], dtype='object')

So, let’s see what happens if we validate the model whose only input is the DEM by using a simple KFold (with 10 folds and shuffling, like during the feature selection):

predictors = ["DEM"]
model = RandomForestClassifier(n_estimators=100)

from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold

result = cross_validate(model, dataset[predictors], y=dataset[target], scoring="accuracy", cv=KFold(10,shuffle=True), n_jobs=-1)
print(result["test_score"].mean())
1.0

Wow, that’s a good score! But can we trust it?

Not really, because we didn’t strictly separate training and test data sets along stations!

Let’s validate the model again, but this time we ensure that the data set used for testing does not contain samples from a station that is already used in training:

from sklearn.model_selection import GroupKFold

result = cross_validate(model, dataset[predictors], y=dataset[target], groups=dataset.station_id, scoring="accuracy", cv=GroupKFold(10))
print(result["test_score"].mean())
0.2

Ok, this looks much worse. We now only get around 20% of accuracy which is not even as good as a “random guess”. This is because now we did not help the model to generate good predictions by providing it information about stations contained in the test data set already during training. Knowing a stations elevation now doesn’t help the model to derive its cloudiness anymore.

The accuracy is below a random guess because during training there are always more measurements of the class, that is not contained in the test data set. (There are 5 cloudy and 5 non-cloudy stations. Removing one station for testing always increases the relative frequency of the opposite class)

So when you work with time series data from station measurements, always make sure to conduct a true Leave Location Out (LLO) cross validation and not just a simple Leave One (or Many) Out cross validation! As shown above, in sklearn this can be achieved using the GroupKFold cross validator.