Model generation (classifier)¶
After having merged the satellite data with the station measurements, we can now proceed to the machine learning part. First, we wanted to train a classifier (1: cloud free, 2: cloud contaminated or cloud covered). Classifier models that can be used for these tasks can be found in the scikit-learn documentation.
So let’s first read our merged data table:
import pandas as pd
data = pd.read_csv("data/stations/metar_station_measurements_with_MSG.csv",parse_dates=["time"]).fillna(0)
As we only want to classify into two classes (cloudy yes/no), we have to adapt the data set a little bit:
# Binarize cloud cover: False = cloud free; True = Cloud contaminated or cloud covered
data.cloudcover = data.cloudcover>1
data
icao | time | cloudcover | cloud_altitude | x | y | IR_016 | IR_039 | IR_087 | IR_097 | IR_108 | IR_120 | IR_134 | VIS006 | VIS008 | WV_062 | WV_073 | cmask | dem | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | EBAW | 2005-01-15 00:00:00 | True | 2400 | 2.920539e+05 | 4.611705e+06 | 0.0 | 263.69100 | 266.75824 | 241.82200 | 270.32462 | 270.10632 | 254.10730 | 0.0 | 0.0 | 229.80460 | 253.90987 | 3.0 | 5.0 |
1 | EBBR | 2005-01-15 00:00:00 | True | 3600 | 2.956507e+05 | 4.596070e+06 | 0.0 | 263.69100 | 267.39102 | 242.04547 | 270.97700 | 271.09344 | 255.38649 | 0.0 | 0.0 | 230.68666 | 254.62540 | 3.0 | 37.0 |
2 | EBCI | 2005-01-15 00:00:00 | True | 2200 | 2.967015e+05 | 4.571825e+06 | 0.0 | 262.45435 | 266.44186 | 241.59853 | 269.99844 | 270.43536 | 255.38649 | 0.0 | 0.0 | 231.74515 | 255.34093 | 3.0 | 150.0 |
3 | EBLG | 2005-01-15 00:00:00 | False | -999 | 3.608589e+05 | 4.580852e+06 | 0.0 | 266.16430 | 268.02380 | 242.04547 | 269.67227 | 269.77728 | 254.61897 | 0.0 | 0.0 | 230.33385 | 253.43285 | 1.0 | 146.0 |
4 | EBOS | 2005-01-15 00:00:00 | True | 2600 | 1.875154e+05 | 4.613160e+06 | 0.0 | 263.69100 | 268.02380 | 242.04547 | 271.30316 | 271.42250 | 254.87482 | 0.0 | 0.0 | 230.86308 | 253.43285 | 3.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
52007 | LZKZ | 2006-12-15 21:00:00 | True | 3300 | 1.430620e+06 | 4.434616e+06 | 0.0 | 263.76193 | 264.90475 | 242.95023 | 268.16614 | 268.26227 | 254.06820 | 0.0 | 0.0 | 231.50674 | 251.63990 | 3.0 | 242.0 |
52008 | LZPP | 2006-12-15 21:00:00 | True | 3500 | 1.212834e+06 | 4.443047e+06 | 0.0 | 262.61996 | 263.68170 | 241.99338 | 266.59192 | 267.01120 | 252.88803 | 0.0 | 0.0 | 233.18146 | 252.30473 | 3.0 | 170.0 |
52009 | LZSL | 2006-12-15 21:00:00 | True | 3300 | 1.296992e+06 | 4.439913e+06 | 0.0 | 262.61996 | 264.29324 | 242.71101 | 267.53647 | 267.94950 | 253.12407 | 0.0 | 0.0 | 232.67905 | 250.08860 | 3.0 | 331.0 |
52010 | LZTT | 2006-12-15 21:00:00 | False | -999 | 1.354894e+06 | 4.461921e+06 | 0.0 | 262.61996 | 264.29324 | 242.23259 | 267.22162 | 267.32397 | 252.65201 | 0.0 | 0.0 | 230.83685 | 249.42377 | 1.0 | 779.0 |
52011 | LZZI | 2006-12-15 21:00:00 | True | 2400 | 1.246986e+06 | 4.476102e+06 | 0.0 | 262.61996 | 264.29324 | 242.23259 | 267.53647 | 267.94950 | 253.12407 | 0.0 | 0.0 | 232.00916 | 250.75343 | 3.0 | 348.0 |
52012 rows × 19 columns
… and then choose a suitable model for the classification task. Here we use a simple KNeighborsClassifier. As discussed in the previous section, we will validate the performance of the chosen model architecture by using a grouped KFold cross validation approach:
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GroupKFold
from sklearn.neighbors import KNeighborsClassifier
predictors = ["IR_016","IR_039","IR_087","IR_097","IR_108","IR_120","IR_134","VIS006","VIS008","WV_062","WV_073","dem"]
target = "cloudcover"
result = cross_validate(KNeighborsClassifier(n_neighbors=5,n_jobs=-1), data[predictors], y=data[target], groups=data.icao, scoring="accuracy", cv=GroupKFold(), n_jobs=-1)
# Print average accuracy score
result["test_score"].mean()
0.8754267232222827
OK, we get around 88% of accuracy with this simple model. Let’s try the same validation technique to check the performance of a RandomForestClassifier:
from sklearn.ensemble import RandomForestClassifier
result = cross_validate(RandomForestClassifier(n_estimators=100,n_jobs=-1), data[predictors], y=data[target], groups=data.icao, scoring="accuracy", cv=GroupKFold(), n_jobs=-1)
# Print average accuracy score
result["test_score"].mean()
0.9328457768685647
Nice, with ~93% accuracy that looks even better. So let’s go ahead and train the classifiers on our data. In order to additionally guarantee temporal independence, we split the data set into the years 2005 (for training) and 2006 (for testing):
data_train = data[data.time.dt.year==2005]
data_test = data[data.time.dt.year==2006]
predictors = ["IR_016","IR_039","IR_087","IR_097","IR_108","IR_120","IR_134","VIS006","VIS008","WV_062","WV_073","dem"]
target = "cloudcover"
rf_model = RandomForestClassifier(n_estimators=100,n_jobs=-1)
rf_model.fit(data_train[predictors], data_train[target])
print(rf_model.score(data_test[predictors], data_test[target]))
kn_model = KNeighborsClassifier(n_neighbors=5,n_jobs=-1)
kn_model.fit(data_train[predictors], data_train[target])
print(kn_model.score(data_test[predictors], data_test[target]))
0.910585895670797
0.8617485084901331
We see that both models trained on data from 2005 reach good scores (RF: 91% and KN: 86%) for the test data from 2006.
We can now save our models for later use:
import joblib
joblib.dump(rf_model,"rf_classifier.model")
joblib.dump(kn_model,"kn_classifier.model")
['kn_classifier.model']
Task¶
Design and train an ML model suitable to build a cloud mask (cloudy yes/no) based on the provided MSG data. Don’t forget to enhance your model via hyperparameter tuning and feature selection.