Contents

Model generation (classifier)

After having merged the satellite data with the station measurements, we can now proceed to the machine learning part. First, we wanted to train a classifier (1: cloud free, 2: cloud contaminated or cloud covered). Classifier models that can be used for these tasks can be found in the scikit-learn documentation.

So let’s first read our merged data table:

import pandas as pd
data = pd.read_csv("data/stations/metar_station_measurements_with_MSG.csv",parse_dates=["time"]).fillna(0)

As we only want to classify into two classes (cloudy yes/no), we have to adapt the data set a little bit:

# Binarize cloud cover: False = cloud free; True = Cloud contaminated or cloud covered
data.cloudcover = data.cloudcover>1
data
icao time cloudcover cloud_altitude x y IR_016 IR_039 IR_087 IR_097 IR_108 IR_120 IR_134 VIS006 VIS008 WV_062 WV_073 cmask dem
0 EBAW 2005-01-15 00:00:00 True 2400 2.920539e+05 4.611705e+06 0.0 263.69100 266.75824 241.82200 270.32462 270.10632 254.10730 0.0 0.0 229.80460 253.90987 3.0 5.0
1 EBBR 2005-01-15 00:00:00 True 3600 2.956507e+05 4.596070e+06 0.0 263.69100 267.39102 242.04547 270.97700 271.09344 255.38649 0.0 0.0 230.68666 254.62540 3.0 37.0
2 EBCI 2005-01-15 00:00:00 True 2200 2.967015e+05 4.571825e+06 0.0 262.45435 266.44186 241.59853 269.99844 270.43536 255.38649 0.0 0.0 231.74515 255.34093 3.0 150.0
3 EBLG 2005-01-15 00:00:00 False -999 3.608589e+05 4.580852e+06 0.0 266.16430 268.02380 242.04547 269.67227 269.77728 254.61897 0.0 0.0 230.33385 253.43285 1.0 146.0
4 EBOS 2005-01-15 00:00:00 True 2600 1.875154e+05 4.613160e+06 0.0 263.69100 268.02380 242.04547 271.30316 271.42250 254.87482 0.0 0.0 230.86308 253.43285 3.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
52007 LZKZ 2006-12-15 21:00:00 True 3300 1.430620e+06 4.434616e+06 0.0 263.76193 264.90475 242.95023 268.16614 268.26227 254.06820 0.0 0.0 231.50674 251.63990 3.0 242.0
52008 LZPP 2006-12-15 21:00:00 True 3500 1.212834e+06 4.443047e+06 0.0 262.61996 263.68170 241.99338 266.59192 267.01120 252.88803 0.0 0.0 233.18146 252.30473 3.0 170.0
52009 LZSL 2006-12-15 21:00:00 True 3300 1.296992e+06 4.439913e+06 0.0 262.61996 264.29324 242.71101 267.53647 267.94950 253.12407 0.0 0.0 232.67905 250.08860 3.0 331.0
52010 LZTT 2006-12-15 21:00:00 False -999 1.354894e+06 4.461921e+06 0.0 262.61996 264.29324 242.23259 267.22162 267.32397 252.65201 0.0 0.0 230.83685 249.42377 1.0 779.0
52011 LZZI 2006-12-15 21:00:00 True 2400 1.246986e+06 4.476102e+06 0.0 262.61996 264.29324 242.23259 267.53647 267.94950 253.12407 0.0 0.0 232.00916 250.75343 3.0 348.0

52012 rows × 19 columns

… and then choose a suitable model for the classification task. Here we use a simple KNeighborsClassifier. As discussed in the previous section, we will validate the performance of the chosen model architecture by using a grouped KFold cross validation approach:

from sklearn.model_selection import cross_validate
from sklearn.model_selection import GroupKFold
from sklearn.neighbors import KNeighborsClassifier

predictors = ["IR_016","IR_039","IR_087","IR_097","IR_108","IR_120","IR_134","VIS006","VIS008","WV_062","WV_073","dem"]
target     = "cloudcover"

result = cross_validate(KNeighborsClassifier(n_neighbors=5,n_jobs=-1), data[predictors], y=data[target], groups=data.icao, scoring="accuracy", cv=GroupKFold(), n_jobs=-1)

# Print average accuracy score
result["test_score"].mean()
0.8754267232222827

OK, we get around 88% of accuracy with this simple model. Let’s try the same validation technique to check the performance of a RandomForestClassifier:

from sklearn.ensemble import RandomForestClassifier
result = cross_validate(RandomForestClassifier(n_estimators=100,n_jobs=-1), data[predictors], y=data[target], groups=data.icao, scoring="accuracy", cv=GroupKFold(), n_jobs=-1)

# Print average accuracy score
result["test_score"].mean()
0.9328457768685647

Nice, with ~93% accuracy that looks even better. So let’s go ahead and train the classifiers on our data. In order to additionally guarantee temporal independence, we split the data set into the years 2005 (for training) and 2006 (for testing):

data_train = data[data.time.dt.year==2005]
data_test  = data[data.time.dt.year==2006]
predictors = ["IR_016","IR_039","IR_087","IR_097","IR_108","IR_120","IR_134","VIS006","VIS008","WV_062","WV_073","dem"]
target     = "cloudcover"

rf_model = RandomForestClassifier(n_estimators=100,n_jobs=-1)
rf_model.fit(data_train[predictors], data_train[target])
print(rf_model.score(data_test[predictors], data_test[target]))

kn_model = KNeighborsClassifier(n_neighbors=5,n_jobs=-1)
kn_model.fit(data_train[predictors], data_train[target])
print(kn_model.score(data_test[predictors], data_test[target]))
0.910585895670797
0.8617485084901331

We see that both models trained on data from 2005 reach good scores (RF: 91% and KN: 86%) for the test data from 2006.

We can now save our models for later use:

import joblib

joblib.dump(rf_model,"rf_classifier.model")
joblib.dump(kn_model,"kn_classifier.model")
['kn_classifier.model']

Task

Design and train an ML model suitable to build a cloud mask (cloudy yes/no) based on the provided MSG data. Don’t forget to enhance your model via hyperparameter tuning and feature selection.