Preprocessing

Merge station tables

To be able to extract MSG pixel values at station locations, we have to add the station location information to our measurement table. For this we can use the merge function:

dataframe_1.merge(dataframe_2,on="name of column on which to merge")

The result should then look like this:

station_data
icao time cloudcover cloud_altitude x y
0 EBAW 2005-01-15 00:00:00 3 2400 2.920539e+05 4.611705e+06
1 EBBR 2005-01-15 00:00:00 3 3600 2.956507e+05 4.596070e+06
2 EBCI 2005-01-15 00:00:00 3 2200 2.967015e+05 4.571825e+06
3 EBLG 2005-01-15 00:00:00 1 -999 3.608589e+05 4.580852e+06
4 EBOS 2005-01-15 00:00:00 3 2600 1.875154e+05 4.613160e+06
... ... ... ... ... ... ...
52007 LZKZ 2006-12-15 21:00:00 3 3300 1.430620e+06 4.434616e+06
52008 LZPP 2006-12-15 21:00:00 3 3500 1.212834e+06 4.443047e+06
52009 LZSL 2006-12-15 21:00:00 3 3300 1.296992e+06 4.439913e+06
52010 LZTT 2006-12-15 21:00:00 1 -999 1.354894e+06 4.461921e+06
52011 LZZI 2006-12-15 21:00:00 3 2400 1.246986e+06 4.476102e+06

52012 rows × 6 columns

Task

  • Merge the meta data table to the station measurement table

Solution

station_data = station_data.merge(station_meta_data,on="icao").sort_values(by=["time","icao"]).reset_index(drop=True)

Extract time slots

We can get a list of all time slots via:

time_slots = station_data.time.unique()

We can select all measurements of a specific time slot via:

one_time_slot = station_data[station_data.time == "2005-01-15 09:00:00"]
one_time_slot
icao time cloudcover cloud_altitude x y
3 EBAW 2005-01-15 09:00:00 1 -999 2.920539e+05 4.611705e+06
194 EBBR 2005-01-15 09:00:00 1 -999 2.956507e+05 4.596070e+06
385 EBCI 2005-01-15 09:00:00 1 -999 2.967015e+05 4.571825e+06
576 EBLG 2005-01-15 09:00:00 1 -999 3.608589e+05 4.580852e+06
767 EBOS 2005-01-15 09:00:00 1 -999 1.875154e+05 4.613160e+06
... ... ... ... ... ... ...
51072 LZKZ 2005-01-15 09:00:00 3 3100 1.430620e+06 4.434616e+06
51263 LZPP 2005-01-15 09:00:00 3 3300 1.212834e+06 4.443047e+06
51454 LZSL 2005-01-15 09:00:00 1 -999 1.296992e+06 4.439913e+06
51641 LZTT 2005-01-15 09:00:00 3 2400 1.354894e+06 4.461921e+06
51826 LZZI 2005-01-15 09:00:00 3 2600 1.246986e+06 4.476102e+06

273 rows × 6 columns

Task

  • Visualize the cloud altitude for each station on June 15 2006 12:00.

Solution

one_time_slot = station_data[station_data.time == "2006-06-15 12:00"]
one_time_slot.plot.scatter(x="x",y="y",c="cloud_altitude",cmap="jet",figsize=(14,8))
<AxesSubplot:xlabel='x', ylabel='y'>
../../../_images/03_preprocessing_18_1.png

Merge satellite and station data

We can open the MSG scene of a specific time slot by converting the time slot into a path string (and then use the xarray function from above):

pd.to_datetime("2005-01-15 09:00:00").strftime("path/to/satellite/data/%Y/%Y%m%d_%H%M.nc")
'path/to/satellite/data/2005/20050115_0900.nc'

Task

  • Load the satellite scene of June 15 2006 12:00.

Solution

satellite_data = xr.open_dataset(pd.to_datetime("2006-06-15 12:00").strftime("path/to/satellite/data/%Y/%Y%m%d_%H%M.nc"))

We can then extract all station pixel values from the MSG scene via:

station_px_values = satellite_data.interp(x=xr.DataArray(one_time_slot.x), y=xr.DataArray(one_time_slot.y), method="nearest").to_dataframe()
station_px_values
IR_016 IR_039 IR_087 IR_097 IR_108 IR_120 IR_134 VIS006 VIS008 WV_062 WV_073 cmask x y
dim_0 time
3 2005-01-15 09:00:10.176 0.312616 270.067841 267.258606 241.118332 268.481750 267.831024 251.366409 0.259972 0.332700 232.461472 251.651505 1.0 2.920539e+05 4.611705e+06
194 2005-01-15 09:00:10.176 0.266303 270.802032 269.441223 242.453674 271.101013 270.815613 254.969254 0.244680 0.307108 234.292130 255.062897 1.0 2.956507e+05 4.596070e+06
385 2005-01-15 09:00:10.176 0.231568 271.169128 270.376617 242.676224 272.083221 271.810486 255.998642 0.214095 0.281516 236.855042 256.768616 1.0 2.967015e+05 4.571825e+06
576 2005-01-15 09:00:10.176 0.208411 269.700745 269.129425 241.563446 271.101013 270.815613 255.226608 0.229387 0.358293 233.925995 255.550247 1.0 3.608589e+05 4.580852e+06
767 2005-01-15 09:00:10.176 0.347352 273.004639 270.376617 243.343903 272.738007 272.142090 254.969254 0.321142 0.383885 233.742920 253.357208 1.0 1.875154e+05 4.613160e+06
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
51072 2005-01-15 09:00:10.176 0.521027 279.612457 259.463654 235.331879 262.261078 261.530212 247.506210 0.428190 0.511847 236.305847 246.534393 3.0 1.430620e+06 4.434616e+06
51263 2005-01-15 09:00:10.176 0.382087 276.308533 269.441223 241.118332 272.410614 271.147217 252.653137 0.275265 0.358293 232.827591 249.214783 3.0 1.212834e+06 4.443047e+06
51454 2005-01-15 09:00:10.176 0.301038 273.371735 271.623810 240.895782 273.720245 273.468567 254.197220 0.183510 0.281516 235.390518 249.945801 1.0 1.296992e+06 4.439913e+06
51641 2005-01-15 09:00:10.176 0.347352 271.169128 263.828827 236.889771 266.517334 265.841309 249.564987 0.443482 0.537439 235.207458 247.752762 3.0 1.354894e+06 4.461921e+06
51826 2005-01-15 09:00:10.176 0.532606 271.903320 256.033875 234.441650 258.987030 258.877258 246.476837 0.550529 0.639808 233.376801 247.021744 3.0 1.246986e+06 4.476102e+06

273 rows × 14 columns

which we can then merge with our station measurements again:

The same can be done with the DEM data.

Progress feedback

As we want to apply the methods explained above to all of our satellite scenes, the overall processing time for the complete data set will be much larger, than for a single scene. Thus, it can make sense to provide a visual feedback about the process when looping over multiple files, eg. by showing a progress bar. This can easily be implemented with the tqdm package:

from tqdm import tqdm
import time

for i in tqdm(range(50)):
    # do sth...
    time.sleep(0.1)
100%|██████████| 50/50 [00:05<00:00,  9.92it/s]

Task

  • Prepare a data set that can be used to train an ML model.

    • Create an empty pandas-DataFrame

    • Iterate over all available time slots and…

      • extract the MSG pixel values at all stations

      • merge the pixel values with the station measurements

      • append all information to the pandas-DataFrame

    • add the terrain elevation from the DEM

    • Save the pandas-DataFrame in one CSV file.

The result should look sth. like this:

merged_data
icao time cloudcover cloud_altitude x y IR_016 IR_039 IR_087 IR_097 IR_108 IR_120 IR_134 VIS006 VIS008 WV_062 WV_073 cmask dem
0 EBAW 2005-01-15 00:00:00 3 2400 2.920539e+05 4.611705e+06 NaN 263.69100 266.75824 241.82200 270.32462 270.10632 254.10730 NaN NaN 229.80460 253.90987 3.0 5.0
1 EBBR 2005-01-15 00:00:00 3 3600 2.956507e+05 4.596070e+06 NaN 263.69100 267.39102 242.04547 270.97700 271.09344 255.38649 NaN NaN 230.68666 254.62540 3.0 37.0
2 EBCI 2005-01-15 00:00:00 3 2200 2.967015e+05 4.571825e+06 NaN 262.45435 266.44186 241.59853 269.99844 270.43536 255.38649 NaN NaN 231.74515 255.34093 3.0 150.0
3 EBLG 2005-01-15 00:00:00 1 -999 3.608589e+05 4.580852e+06 NaN 266.16430 268.02380 242.04547 269.67227 269.77728 254.61897 NaN NaN 230.33385 253.43285 1.0 146.0
4 EBOS 2005-01-15 00:00:00 3 2600 1.875154e+05 4.613160e+06 NaN 263.69100 268.02380 242.04547 271.30316 271.42250 254.87482 NaN NaN 230.86308 253.43285 3.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
52007 LZKZ 2006-12-15 21:00:00 3 3300 1.430620e+06 4.434616e+06 NaN 263.76193 264.90475 242.95023 268.16614 268.26227 254.06820 NaN NaN 231.50674 251.63990 3.0 242.0
52008 LZPP 2006-12-15 21:00:00 3 3500 1.212834e+06 4.443047e+06 NaN 262.61996 263.68170 241.99338 266.59192 267.01120 252.88803 NaN NaN 233.18146 252.30473 3.0 170.0
52009 LZSL 2006-12-15 21:00:00 3 3300 1.296992e+06 4.439913e+06 NaN 262.61996 264.29324 242.71101 267.53647 267.94950 253.12407 NaN NaN 232.67905 250.08860 3.0 331.0
52010 LZTT 2006-12-15 21:00:00 1 -999 1.354894e+06 4.461921e+06 NaN 262.61996 264.29324 242.23259 267.22162 267.32397 252.65201 NaN NaN 230.83685 249.42377 1.0 779.0
52011 LZZI 2006-12-15 21:00:00 3 2400 1.246986e+06 4.476102e+06 NaN 262.61996 264.29324 242.23259 267.53647 267.94950 253.12407 NaN NaN 232.00916 250.75343 3.0 348.0

52012 rows × 19 columns