Contents

Preprocessing

To be able to include our 2d satellite and label data in a machine learning model, we have to “reshape” it into the tabular format described in the introduction. We can do this using a pandas DataFrame:

import pandas as pd

# Get list of band names in satellite DataSet
bands = list(satellite_data.data_vars.keys())

# Instantiate empty pandas DataFrame
df = pd.DataFrame()

# Fill DataFrame with "flattened" satellite bands (--> each band is flattened to one column)
for band in bands:
    df[band] = satellite_data[band].values.flatten()

# Add target (labels) to DataFrame
df["label"] = labels.values.flatten()

# Define predictor and target columns
predictors = df.keys()[:-2]
target = "label"

# Clip DataFrame to samples where all predictor bands have valid pixel values
df_valid = df.dropna(subset=predictors)
df_valid
B1 B2 B3 B4 B5 B6 B7 B8 B8A B9 B11 B12 label
4602 0.0162 0.0140 0.0266 0.0131 0.0554 0.2530 0.3518 0.2961 0.3651 0.3461 0.1576 0.0642 NaN
4603 0.0162 0.0150 0.0271 0.0120 0.0536 0.2559 0.3404 0.3339 0.3622 0.3461 0.1583 0.0642 NaN
4604 0.0159 0.0170 0.0308 0.0161 0.0536 0.2559 0.3404 0.3683 0.3622 0.3461 0.1583 0.0642 NaN
4605 0.0159 0.0150 0.0290 0.0148 0.0545 0.2578 0.3567 0.3419 0.3877 0.3806 0.1630 0.0682 NaN
4606 0.0159 0.0171 0.0325 0.0157 0.0545 0.2578 0.3567 0.3851 0.3877 0.3806 0.1630 0.0682 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
4801696 0.0174 0.0282 0.0573 0.0528 0.0958 0.2656 0.3371 0.3806 0.3648 0.3799 0.1860 0.0928 0.0
4801697 0.0174 0.0251 0.0440 0.0298 0.0958 0.2656 0.3371 0.3127 0.3648 0.3799 0.1860 0.0928 0.0
4801698 0.0174 0.0158 0.0332 0.0199 0.0540 0.2503 0.3128 0.2832 0.3246 0.3741 0.1479 0.0621 0.0
4801699 0.0174 0.0142 0.0309 0.0193 0.0540 0.2503 0.3128 0.3518 0.3246 0.3741 0.1479 0.0621 0.0
4801700 0.0174 0.0152 0.0320 0.0163 0.0558 0.2645 0.3341 0.3437 0.3646 0.3741 0.1552 0.0639 0.0

1232367 rows × 13 columns

We can now split the data into train and test subsets using sklearns built-in splitting function via:

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df_valid[~np.isnan(df_valid.label)],random_state=42,test_size=0.2)

The resulting training-DataFrame looks like this:

df_train
B1 B2 B3 B4 B5 B6 B7 B8 B8A B9 B11 B12 label
4198582 0.0257 0.0244 0.0682 0.0297 0.1154 0.3508 0.4161 0.4432 0.4480 0.4233 0.2128 0.0945 0.0
3703954 0.0216 0.0171 0.0307 0.0194 0.0443 0.1614 0.2082 0.2344 0.2271 0.2580 0.0841 0.0362 0.0
3904497 0.0160 0.0200 0.0378 0.0224 0.0554 0.2272 0.2948 0.3249 0.3105 0.3281 0.1225 0.0538 0.0
3886858 0.0353 0.0440 0.0692 0.0630 0.1106 0.2097 0.2577 0.3034 0.2752 0.2770 0.2019 0.1201 0.0
3682430 0.0555 0.0592 0.0830 0.1132 0.1204 0.1604 0.1989 0.1886 0.1989 0.1940 0.1892 0.1382 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
3643284 0.0143 0.0155 0.0265 0.0156 0.0481 0.1740 0.2322 0.2305 0.2504 0.2607 0.1014 0.0459 0.0
4243491 0.0167 0.0206 0.0387 0.0201 0.0577 0.2151 0.2924 0.3838 0.3060 0.3312 0.1368 0.0628 0.0
3688515 0.0371 0.0376 0.0652 0.0543 0.1009 0.2760 0.3483 0.3342 0.3740 0.3239 0.1967 0.1169 0.0
3745470 0.0234 0.0370 0.0731 0.0555 0.1276 0.3327 0.3870 0.4163 0.4354 0.4068 0.2179 0.1096 0.0
3649765 0.0119 0.0150 0.0291 0.0165 0.0548 0.2178 0.2786 0.3032 0.3023 0.3531 0.1192 0.0513 0.0

222184 rows × 13 columns

Task

Prepare a data set for the ML task that includes all provided data sources (both Sentinel-2 scenes, DEM and labels)

# Load all data sets
import xarray as xr
sentinel_before = xr.open_dataset("data/ahr_hochwasser_before.nc")
sentinel_after = xr.open_dataset("data/ahr_hochwasser_after.nc")
flooded_train = xr.open_dataset("data/ahr_hochwasser_flooded_areas.tif", engine="rasterio").band_data[0]
flooded_test = xr.open_dataset("data/ahr_hochwasser_flooded_areas_test.tif", engine="rasterio").band_data[0]
dem = xr.open_dataset("data/dem_utm32n_clipped_cropped.tif", engine="rasterio").band_data[0]

# Combine all data sets in one pandas DataFrame
import pandas as pd
bands = list(sentinel_before.data_vars.keys())
df = pd.DataFrame()
for i in range(12):
    df["b_"+bands[i]] = sentinel_before[bands[i]].values.flatten()
    df["a_"+bands[i]] = sentinel_after[bands[i]].values.flatten()
df["dem"] = dem.values.flatten()
df["label_train"] = flooded_train.values.flatten()
df["label_test"] = flooded_test.values.flatten()

# Define predictor and target columns
predictors = df.keys()[:-2]
target = "label_train"

# Clip DataFrame to samples where all predictor bands have valid pixel values
df_valid = df.dropna(subset=predictors)
df_valid
b_B1 a_B1 b_B2 a_B2 b_B3 a_B3 b_B4 a_B4 b_B5 a_B5 ... a_B8A b_B9 a_B9 b_B11 a_B11 b_B12 a_B12 dem label_train label_test
4602 0.0172 0.0162 0.0217 0.0140 0.0335 0.0266 0.0166 0.0131 0.0683 0.0554 ... 0.3651 0.3935 0.3461 0.1745 0.1576 0.0744 0.0642 280.214813 NaN NaN
4603 0.0172 0.0162 0.0211 0.0150 0.0359 0.0271 0.0173 0.0120 0.0614 0.0536 ... 0.3622 0.3935 0.3461 0.1763 0.1583 0.0736 0.0642 279.540619 NaN NaN
4604 0.0166 0.0159 0.0215 0.0170 0.0373 0.0308 0.0168 0.0161 0.0614 0.0536 ... 0.3622 0.3935 0.3461 0.1763 0.1583 0.0736 0.0642 279.540619 NaN NaN
4605 0.0166 0.0159 0.0222 0.0150 0.0334 0.0290 0.0175 0.0148 0.0650 0.0545 ... 0.3877 0.4302 0.3806 0.1810 0.1630 0.0752 0.0682 279.311157 NaN NaN
4606 0.0166 0.0159 0.0198 0.0171 0.0407 0.0325 0.0164 0.0157 0.0650 0.0545 ... 0.3877 0.4302 0.3806 0.1810 0.1630 0.0752 0.0682 279.311157 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4801696 0.0221 0.0174 0.0305 0.0282 0.0597 0.0573 0.0303 0.0528 0.0932 0.0958 ... 0.3648 0.4347 0.3799 0.1859 0.1860 0.0863 0.0928 467.113068 0.0 NaN
4801697 0.0221 0.0174 0.0289 0.0251 0.0521 0.0440 0.0263 0.0298 0.0932 0.0958 ... 0.3648 0.4347 0.3799 0.1859 0.1860 0.0863 0.0928 464.409637 0.0 NaN
4801698 0.0219 0.0174 0.0225 0.0158 0.0447 0.0332 0.0187 0.0199 0.0750 0.0540 ... 0.3246 0.4347 0.3741 0.1743 0.1479 0.0759 0.0621 464.409637 0.0 NaN
4801699 0.0219 0.0174 0.0226 0.0142 0.0436 0.0309 0.0184 0.0193 0.0750 0.0540 ... 0.3246 0.4347 0.3741 0.1743 0.1479 0.0759 0.0621 461.661682 0.0 NaN
4801700 0.0219 0.0174 0.0218 0.0152 0.0430 0.0320 0.0198 0.0163 0.0787 0.0558 ... 0.3646 0.4347 0.3741 0.1854 0.1552 0.0807 0.0639 461.661682 0.0 NaN

1232367 rows × 27 columns