Preprocessing¶
To be able to include our 2d satellite and label data in a machine learning model, we have to “reshape” it into the tabular format described in the introduction. We can do this using a pandas DataFrame:
import pandas as pd
# Get list of band names in satellite DataSet
bands = list(satellite_data.data_vars.keys())
# Instantiate empty pandas DataFrame
df = pd.DataFrame()
# Fill DataFrame with "flattened" satellite bands (--> each band is flattened to one column)
for band in bands:
df[band] = satellite_data[band].values.flatten()
# Add target (labels) to DataFrame
df["label"] = labels.values.flatten()
# Define predictor and target columns
predictors = df.keys()[:-2]
target = "label"
# Clip DataFrame to samples where all predictor bands have valid pixel values
df_valid = df.dropna(subset=predictors)
df_valid
B1 | B2 | B3 | B4 | B5 | B6 | B7 | B8 | B8A | B9 | B11 | B12 | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4602 | 0.0162 | 0.0140 | 0.0266 | 0.0131 | 0.0554 | 0.2530 | 0.3518 | 0.2961 | 0.3651 | 0.3461 | 0.1576 | 0.0642 | NaN |
4603 | 0.0162 | 0.0150 | 0.0271 | 0.0120 | 0.0536 | 0.2559 | 0.3404 | 0.3339 | 0.3622 | 0.3461 | 0.1583 | 0.0642 | NaN |
4604 | 0.0159 | 0.0170 | 0.0308 | 0.0161 | 0.0536 | 0.2559 | 0.3404 | 0.3683 | 0.3622 | 0.3461 | 0.1583 | 0.0642 | NaN |
4605 | 0.0159 | 0.0150 | 0.0290 | 0.0148 | 0.0545 | 0.2578 | 0.3567 | 0.3419 | 0.3877 | 0.3806 | 0.1630 | 0.0682 | NaN |
4606 | 0.0159 | 0.0171 | 0.0325 | 0.0157 | 0.0545 | 0.2578 | 0.3567 | 0.3851 | 0.3877 | 0.3806 | 0.1630 | 0.0682 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4801696 | 0.0174 | 0.0282 | 0.0573 | 0.0528 | 0.0958 | 0.2656 | 0.3371 | 0.3806 | 0.3648 | 0.3799 | 0.1860 | 0.0928 | 0.0 |
4801697 | 0.0174 | 0.0251 | 0.0440 | 0.0298 | 0.0958 | 0.2656 | 0.3371 | 0.3127 | 0.3648 | 0.3799 | 0.1860 | 0.0928 | 0.0 |
4801698 | 0.0174 | 0.0158 | 0.0332 | 0.0199 | 0.0540 | 0.2503 | 0.3128 | 0.2832 | 0.3246 | 0.3741 | 0.1479 | 0.0621 | 0.0 |
4801699 | 0.0174 | 0.0142 | 0.0309 | 0.0193 | 0.0540 | 0.2503 | 0.3128 | 0.3518 | 0.3246 | 0.3741 | 0.1479 | 0.0621 | 0.0 |
4801700 | 0.0174 | 0.0152 | 0.0320 | 0.0163 | 0.0558 | 0.2645 | 0.3341 | 0.3437 | 0.3646 | 0.3741 | 0.1552 | 0.0639 | 0.0 |
1232367 rows × 13 columns
We can now split the data into train and test subsets using sklearns built-in splitting function via:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_valid[~np.isnan(df_valid.label)],random_state=42,test_size=0.2)
The resulting training-DataFrame looks like this:
df_train
B1 | B2 | B3 | B4 | B5 | B6 | B7 | B8 | B8A | B9 | B11 | B12 | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4198582 | 0.0257 | 0.0244 | 0.0682 | 0.0297 | 0.1154 | 0.3508 | 0.4161 | 0.4432 | 0.4480 | 0.4233 | 0.2128 | 0.0945 | 0.0 |
3703954 | 0.0216 | 0.0171 | 0.0307 | 0.0194 | 0.0443 | 0.1614 | 0.2082 | 0.2344 | 0.2271 | 0.2580 | 0.0841 | 0.0362 | 0.0 |
3904497 | 0.0160 | 0.0200 | 0.0378 | 0.0224 | 0.0554 | 0.2272 | 0.2948 | 0.3249 | 0.3105 | 0.3281 | 0.1225 | 0.0538 | 0.0 |
3886858 | 0.0353 | 0.0440 | 0.0692 | 0.0630 | 0.1106 | 0.2097 | 0.2577 | 0.3034 | 0.2752 | 0.2770 | 0.2019 | 0.1201 | 0.0 |
3682430 | 0.0555 | 0.0592 | 0.0830 | 0.1132 | 0.1204 | 0.1604 | 0.1989 | 0.1886 | 0.1989 | 0.1940 | 0.1892 | 0.1382 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3643284 | 0.0143 | 0.0155 | 0.0265 | 0.0156 | 0.0481 | 0.1740 | 0.2322 | 0.2305 | 0.2504 | 0.2607 | 0.1014 | 0.0459 | 0.0 |
4243491 | 0.0167 | 0.0206 | 0.0387 | 0.0201 | 0.0577 | 0.2151 | 0.2924 | 0.3838 | 0.3060 | 0.3312 | 0.1368 | 0.0628 | 0.0 |
3688515 | 0.0371 | 0.0376 | 0.0652 | 0.0543 | 0.1009 | 0.2760 | 0.3483 | 0.3342 | 0.3740 | 0.3239 | 0.1967 | 0.1169 | 0.0 |
3745470 | 0.0234 | 0.0370 | 0.0731 | 0.0555 | 0.1276 | 0.3327 | 0.3870 | 0.4163 | 0.4354 | 0.4068 | 0.2179 | 0.1096 | 0.0 |
3649765 | 0.0119 | 0.0150 | 0.0291 | 0.0165 | 0.0548 | 0.2178 | 0.2786 | 0.3032 | 0.3023 | 0.3531 | 0.1192 | 0.0513 | 0.0 |
222184 rows × 13 columns
Task¶
Prepare a data set for the ML task that includes all provided data sources (both Sentinel-2 scenes, DEM and labels)
# Load all data sets
import xarray as xr
sentinel_before = xr.open_dataset("data/ahr_hochwasser_before.nc")
sentinel_after = xr.open_dataset("data/ahr_hochwasser_after.nc")
flooded_train = xr.open_dataset("data/ahr_hochwasser_flooded_areas.tif", engine="rasterio").band_data[0]
flooded_test = xr.open_dataset("data/ahr_hochwasser_flooded_areas_test.tif", engine="rasterio").band_data[0]
dem = xr.open_dataset("data/dem_utm32n_clipped_cropped.tif", engine="rasterio").band_data[0]
# Combine all data sets in one pandas DataFrame
import pandas as pd
bands = list(sentinel_before.data_vars.keys())
df = pd.DataFrame()
for i in range(12):
df["b_"+bands[i]] = sentinel_before[bands[i]].values.flatten()
df["a_"+bands[i]] = sentinel_after[bands[i]].values.flatten()
df["dem"] = dem.values.flatten()
df["label_train"] = flooded_train.values.flatten()
df["label_test"] = flooded_test.values.flatten()
# Define predictor and target columns
predictors = df.keys()[:-2]
target = "label_train"
# Clip DataFrame to samples where all predictor bands have valid pixel values
df_valid = df.dropna(subset=predictors)
df_valid
b_B1 | a_B1 | b_B2 | a_B2 | b_B3 | a_B3 | b_B4 | a_B4 | b_B5 | a_B5 | ... | a_B8A | b_B9 | a_B9 | b_B11 | a_B11 | b_B12 | a_B12 | dem | label_train | label_test | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4602 | 0.0172 | 0.0162 | 0.0217 | 0.0140 | 0.0335 | 0.0266 | 0.0166 | 0.0131 | 0.0683 | 0.0554 | ... | 0.3651 | 0.3935 | 0.3461 | 0.1745 | 0.1576 | 0.0744 | 0.0642 | 280.214813 | NaN | NaN |
4603 | 0.0172 | 0.0162 | 0.0211 | 0.0150 | 0.0359 | 0.0271 | 0.0173 | 0.0120 | 0.0614 | 0.0536 | ... | 0.3622 | 0.3935 | 0.3461 | 0.1763 | 0.1583 | 0.0736 | 0.0642 | 279.540619 | NaN | NaN |
4604 | 0.0166 | 0.0159 | 0.0215 | 0.0170 | 0.0373 | 0.0308 | 0.0168 | 0.0161 | 0.0614 | 0.0536 | ... | 0.3622 | 0.3935 | 0.3461 | 0.1763 | 0.1583 | 0.0736 | 0.0642 | 279.540619 | NaN | NaN |
4605 | 0.0166 | 0.0159 | 0.0222 | 0.0150 | 0.0334 | 0.0290 | 0.0175 | 0.0148 | 0.0650 | 0.0545 | ... | 0.3877 | 0.4302 | 0.3806 | 0.1810 | 0.1630 | 0.0752 | 0.0682 | 279.311157 | NaN | NaN |
4606 | 0.0166 | 0.0159 | 0.0198 | 0.0171 | 0.0407 | 0.0325 | 0.0164 | 0.0157 | 0.0650 | 0.0545 | ... | 0.3877 | 0.4302 | 0.3806 | 0.1810 | 0.1630 | 0.0752 | 0.0682 | 279.311157 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4801696 | 0.0221 | 0.0174 | 0.0305 | 0.0282 | 0.0597 | 0.0573 | 0.0303 | 0.0528 | 0.0932 | 0.0958 | ... | 0.3648 | 0.4347 | 0.3799 | 0.1859 | 0.1860 | 0.0863 | 0.0928 | 467.113068 | 0.0 | NaN |
4801697 | 0.0221 | 0.0174 | 0.0289 | 0.0251 | 0.0521 | 0.0440 | 0.0263 | 0.0298 | 0.0932 | 0.0958 | ... | 0.3648 | 0.4347 | 0.3799 | 0.1859 | 0.1860 | 0.0863 | 0.0928 | 464.409637 | 0.0 | NaN |
4801698 | 0.0219 | 0.0174 | 0.0225 | 0.0158 | 0.0447 | 0.0332 | 0.0187 | 0.0199 | 0.0750 | 0.0540 | ... | 0.3246 | 0.4347 | 0.3741 | 0.1743 | 0.1479 | 0.0759 | 0.0621 | 464.409637 | 0.0 | NaN |
4801699 | 0.0219 | 0.0174 | 0.0226 | 0.0142 | 0.0436 | 0.0309 | 0.0184 | 0.0193 | 0.0750 | 0.0540 | ... | 0.3246 | 0.4347 | 0.3741 | 0.1743 | 0.1479 | 0.0759 | 0.0621 | 461.661682 | 0.0 | NaN |
4801700 | 0.0219 | 0.0174 | 0.0218 | 0.0152 | 0.0430 | 0.0320 | 0.0198 | 0.0163 | 0.0787 | 0.0558 | ... | 0.3646 | 0.4347 | 0.3741 | 0.1854 | 0.1552 | 0.0807 | 0.0639 | 461.661682 | 0.0 | NaN |
1232367 rows × 27 columns