Brief excursus on model validation¶
Before we generate our ML model, we’ll take a short (theoretical) detour to exemplify the importance of choosing the right cross validation method during model generation.
First, we read an artificially created dataset:
import pandas as pd dataset = pd.read_csv("data/artificial_dataset.csv")
The dataset consists of 10 stations with 15 cloud cover measurements at each station (cloudy / not cloudy as 1/0) and a couple of MSG band records. Additionally, the digital elevation model and the stations locations (as x and y coordinates) are provided:
150 rows × 11 columns
To keep the example simple, what is special about this data set is that at each station there were either completely cloud free conditions for the entire period or it was cloudy for the entire period.
Let’s take a look at the data. We plot the station distribution and color the stations by cloudiness:
We now want to create an ML classifier, that can predict cloudiness (yes/no) based on the provided data.
Download the dataset and load it via pandas.
Do you have an idea what could be problematic about this data set?
Try to create an ML model that can predict cloudiness accurately. How well does your model perform? How do you measure its accuracy?
Terrain height (as a static variable) is a perfect predictor for cloudiness. This has to be accounted for during validation! A plot of the data might help to visualize the problem even better:
import matplotlib.pyplot as plt fig, ax = plt.subplots(1,1,figsize=(12,6)) ax.scatter(x=range(len(dataset.x)),y=dataset.cloudy,c=dataset.DEM)
<matplotlib.collections.PathCollection at 0x7fd53522cad0>
Let’s go ahead and create a random forest classification model and cross validate it using sklearn’s built-in CV-functions:
from sklearn.ensemble import RandomForestClassifier predictors = ["DEM","VIS006","VIS008","WV_067","IR_039","IR_108","IR_120"] target = "cloudy" model = RandomForestClassifier(n_estimators=100)
We make a feature selection first, to remove unnecessary input:
%%time from sklearn.model_selection import KFold from sklearn.feature_selection import RFECV rfecv = RFECV(estimator=model, step=1, cv=KFold(10,shuffle=True), n_jobs = -1, scoring='accuracy', min_features_to_select = 1) rfecv.fit(dataset[predictors], dataset[[target]])
CPU times: user 934 ms, sys: 73 ms, total: 1.01 s Wall time: 2.7 s
Plotting the result shows us, that the model performs best (100% accuracy) when only one band is used:
fig, ax = plt.subplots(1,1,figsize=(6,4)) plt.xlabel("Number of features selected") plt.ylabel("Cross validation score (accuracy)") plt.plot(range(1, len(rfecv.grid_scores_)+1), rfecv.grid_scores_.mean(axis=1)) plt.grid(axis="both") plt.xticks(range(1,len(rfecv.grid_scores_)+1,1)) plt.show()
And the best “band” is: (surprise) The DEM
So, let’s see what happens if we validate the model whose only input is the DEM by using a simple KFold (with 10 folds and shuffling, like during the feature selection):
predictors = ["DEM"]
model = RandomForestClassifier(n_estimators=100) from sklearn.model_selection import cross_validate from sklearn.model_selection import KFold result = cross_validate(model, dataset[predictors], y=dataset[target], scoring="accuracy", cv=KFold(10,shuffle=True), n_jobs=-1) print(result["test_score"].mean())
Wow, that’s a good score! But can we trust it?
Not really, because we didn’t strictly separate training and test data sets along stations!
Let’s validate the model again, but this time we ensure that the data set used for testing does not contain samples from a station that is already used in training:
from sklearn.model_selection import GroupKFold result = cross_validate(model, dataset[predictors], y=dataset[target], groups=dataset.station_id, scoring="accuracy", cv=GroupKFold(10)) print(result["test_score"].mean())
Ok, this looks much worse. We now only get around 20% of accuracy which is not even as good as a “random guess”. This is because now we did not help the model to generate good predictions by providing it information about stations contained in the test data set already during training. Knowing a stations elevation now doesn’t help the model to derive its cloudiness anymore.
The accuracy is below a random guess because during training there are always more measurements of the class, that is not contained in the test data set. (There are 5 cloudy and 5 non-cloudy stations. Removing one station for testing always increases the relative frequency of the opposite class)
So when you work with time series data from station measurements, always make sure to conduct a true Leave Location Out (LLO) cross validation and not just a simple Leave One (or Many) Out cross validation! As shown above, in sklearn this can be achieved using the
GroupKFold cross validator.