Model generation (classifier)¶

After having merged the satellite data with the station measurements, we can now proceed to the machine learning part. First, we wanted to train a classifier (1: cloud free, 2: cloud contaminated or cloud covered). Classifier models that can be used for these tasks can be found in the scikit-learn documentation.

So let’s first read our merged data table:

import pandas as pd
data = pd.read_csv("data/stations/metar_station_measurements_with_MSG.csv",parse_dates=["time"]).fillna(0)

As we only want to classify into two classes (cloudy yes/no), we have to adapt the data set a little bit:

# Binarize cloud cover: False = cloud free; True = Cloud contaminated or cloud covered
data.cloudcover = data.cloudcover>1
data

	icao	time	cloudcover	cloud_altitude	x	y	IR_016	IR_039	IR_087	IR_097	IR_108	IR_120	IR_134	VIS006	VIS008	WV_062	WV_073	cmask	dem
0	EBAW	2005-01-15 00:00:00	True	2400	2.920539e+05	4.611705e+06	0.0	263.69100	266.75824	241.82200	270.32462	270.10632	254.10730	0.0	0.0	229.80460	253.90987	3.0	5.0
1	EBBR	2005-01-15 00:00:00	True	3600	2.956507e+05	4.596070e+06	0.0	263.69100	267.39102	242.04547	270.97700	271.09344	255.38649	0.0	0.0	230.68666	254.62540	3.0	37.0
2	EBCI	2005-01-15 00:00:00	True	2200	2.967015e+05	4.571825e+06	0.0	262.45435	266.44186	241.59853	269.99844	270.43536	255.38649	0.0	0.0	231.74515	255.34093	3.0	150.0
3	EBLG	2005-01-15 00:00:00	False	-999	3.608589e+05	4.580852e+06	0.0	266.16430	268.02380	242.04547	269.67227	269.77728	254.61897	0.0	0.0	230.33385	253.43285	1.0	146.0
4	EBOS	2005-01-15 00:00:00	True	2600	1.875154e+05	4.613160e+06	0.0	263.69100	268.02380	242.04547	271.30316	271.42250	254.87482	0.0	0.0	230.86308	253.43285	3.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
52007	LZKZ	2006-12-15 21:00:00	True	3300	1.430620e+06	4.434616e+06	0.0	263.76193	264.90475	242.95023	268.16614	268.26227	254.06820	0.0	0.0	231.50674	251.63990	3.0	242.0
52008	LZPP	2006-12-15 21:00:00	True	3500	1.212834e+06	4.443047e+06	0.0	262.61996	263.68170	241.99338	266.59192	267.01120	252.88803	0.0	0.0	233.18146	252.30473	3.0	170.0
52009	LZSL	2006-12-15 21:00:00	True	3300	1.296992e+06	4.439913e+06	0.0	262.61996	264.29324	242.71101	267.53647	267.94950	253.12407	0.0	0.0	232.67905	250.08860	3.0	331.0
52010	LZTT	2006-12-15 21:00:00	False	-999	1.354894e+06	4.461921e+06	0.0	262.61996	264.29324	242.23259	267.22162	267.32397	252.65201	0.0	0.0	230.83685	249.42377	1.0	779.0
52011	LZZI	2006-12-15 21:00:00	True	2400	1.246986e+06	4.476102e+06	0.0	262.61996	264.29324	242.23259	267.53647	267.94950	253.12407	0.0	0.0	232.00916	250.75343	3.0	348.0

52012 rows × 19 columns

… and then choose a suitable model for the classification task. Here we use a simple KNeighborsClassifier. As discussed in the previous section, we will validate the performance of the chosen model architecture by using a grouped KFold cross validation approach:

from sklearn.model_selection import cross_validate
from sklearn.model_selection import GroupKFold
from sklearn.neighbors import KNeighborsClassifier

predictors = ["IR_016","IR_039","IR_087","IR_097","IR_108","IR_120","IR_134","VIS006","VIS008","WV_062","WV_073","dem"]
target     = "cloudcover"

result = cross_validate(KNeighborsClassifier(n_neighbors=5,n_jobs=-1), data[predictors], y=data[target], groups=data.icao, scoring="accuracy", cv=GroupKFold(), n_jobs=-1)

# Print average accuracy score
result["test_score"].mean()

0.8754267232222827

OK, we get around 88% of accuracy with this simple model. Let’s try the same validation technique to check the performance of a RandomForestClassifier:

from sklearn.ensemble import RandomForestClassifier
result = cross_validate(RandomForestClassifier(n_estimators=100,n_jobs=-1), data[predictors], y=data[target], groups=data.icao, scoring="accuracy", cv=GroupKFold(), n_jobs=-1)

# Print average accuracy score
result["test_score"].mean()

0.9328457768685647

Nice, with ~93% accuracy that looks even better. So let’s go ahead and train the classifiers on our data. In order to additionally guarantee temporal independence, we split the data set into the years 2005 (for training) and 2006 (for testing):

data_train = data[data.time.dt.year==2005]
data_test  = data[data.time.dt.year==2006]

predictors = ["IR_016","IR_039","IR_087","IR_097","IR_108","IR_120","IR_134","VIS006","VIS008","WV_062","WV_073","dem"]
target     = "cloudcover"

rf_model = RandomForestClassifier(n_estimators=100,n_jobs=-1)
rf_model.fit(data_train[predictors], data_train[target])
print(rf_model.score(data_test[predictors], data_test[target]))

kn_model = KNeighborsClassifier(n_neighbors=5,n_jobs=-1)
kn_model.fit(data_train[predictors], data_train[target])
print(kn_model.score(data_test[predictors], data_test[target]))

0.910585895670797
0.8617485084901331

We see that both models trained on data from 2005 reach good scores (RF: 91% and KN: 86%) for the test data from 2006.

We can now save our models for later use:

import joblib

joblib.dump(rf_model,"rf_classifier.model")
joblib.dump(kn_model,"kn_classifier.model")

['kn_classifier.model']

Task¶

Design and train an ML model suitable to build a cloud mask (cloudy yes/no) based on the provided MSG data. Don’t forget to enhance your model via hyperparameter tuning and feature selection.

Brief excursus on model validation Model analysis

Winter Semester 21/22

Model generation (classifier)¶

Task¶