Random forest baseline model

Before we start with the UNet implementation, we want to train and test a simple pixel-based baseline model for comparison.

Load and prepare data

import xarray as xr
import numpy as np

We load the training data:

# Load data
training_data = xr.open_dataset("data/oberschelden.nc")
training_data
<xarray.Dataset>
Dimensions:         (x: 13224, y: 11108)
Coordinates:
  * y               (y) float64 5.636e+06 5.636e+06 ... 5.635e+06 5.635e+06
  * x               (x) float64 4.253e+05 4.253e+05 ... 4.266e+05 4.266e+05
Data variables:
    buildings_mask  (y, x) uint8 ...
    red             (y, x) uint8 ...
    green           (y, x) uint8 ...
    blue            (y, x) uint8 ...

Now we convert the data into tabular format:

df_train = training_data.to_dask_dataframe()[["buildings_mask","red","green","blue"]]
df_train
Dask DataFrame Structure:
buildings_mask red green blue
npartitions=9
0 uint8 uint8 uint8 uint8
2705292 ... ... ... ...
... ... ... ... ...
17279748 ... ... ... ...
17803745 ... ... ... ...
Dask Name: getitem, 544 tasks

Then we take a sample from the data (as ~150 million pixels/samples is way too much for a random forest):

df_train_sample = df_train.sample(frac=0.0001,random_state=42).compute()
df_train_sample
buildings_mask red green blue
2315693 0 123 116 117
123035352 0 91 89 90
112477877 0 126 131 109
71267707 0 56 59 39
124322821 0 104 100 101
... ... ... ... ...
52705504 0 155 155 144
126099645 0 140 118 99
106530192 0 145 122 104
28372898 0 137 146 114
112970162 0 55 60 63

14689 rows × 4 columns

Train RF

We can now check the performance of an RF model on the data and then train it:

from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier

predictors = ["red","green","blue"]
target     = "buildings_mask"

result = cross_validate(RandomForestClassifier(n_estimators=100,n_jobs=-1), df_train_sample[predictors], y=df_train_sample[target], scoring="accuracy", cv=KFold(), n_jobs=-1, verbose=2)

# Print average accuracy score
result["test_score"].mean()
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    1.3s finished
0.9555452104820702
rf_model = RandomForestClassifier(n_estimators=100,n_jobs=-1)
rf_model.fit(df_train_sample[predictors], df_train_sample[target])
RandomForestClassifier(n_jobs=-1)

Apply RF

Now we can apply the trained model on the test data:

# Load data
test_data = xr.open_dataset("data/volnsberg.nc",chunks={"x":2000,"y":2000})
test_data
<xarray.Dataset>
Dimensions:         (x: 4383, y: 4062)
Coordinates:
  * y               (y) float64 5.637e+06 5.637e+06 ... 5.637e+06 5.637e+06
  * x               (x) float64 4.351e+05 4.351e+05 ... 4.355e+05 4.355e+05
Data variables:
    buildings_mask  (y, x) uint8 dask.array<chunksize=(2000, 2000), meta=np.ndarray>
    red             (y, x) uint8 dask.array<chunksize=(2000, 2000), meta=np.ndarray>
    green           (y, x) uint8 dask.array<chunksize=(2000, 2000), meta=np.ndarray>
    blue            (y, x) uint8 dask.array<chunksize=(2000, 2000), meta=np.ndarray>

Note that here we “chunked” the data into multiple tiles/chunks so that our memory is not exceeded during model prediction.

We can now define a helper function that converts a given xr.Dataset into a pandas.DataFrame, uses the trained model to predict the target variable value for each pixel and then re-converts the data back into an xr.Dataset:

# Helper function for calling a model on an xr.Dataset and returning its predictions
def predict_func(ds,model,predictors):
    
    # Convert the xr.Dataset (the tile) into a pandas.DataFrame
    df = ds.to_dataframe()
    
    # Use the model to predict values for each pixel
    result = model.predict(df[predictors])
    
    # Reshape the results back into a 2d-array
    # Note that the "order" argument should be removed when working with Windows!
    result = result.reshape(ds.y.size,ds.x.size,order="F")
    
    # Return the resulting 2d-prediction-results as xr.DataArray
    return xr.DataArray(result,dims=["y","x"],coords=[ds.y,ds.x])

We also need to specify a template for the coming process, which is a placeholder for our result (building / no building raster):

template = test_data[list(test_data.data_vars)[1]].rename("building")
template
<xarray.DataArray 'building' (y: 4062, x: 4383)>
dask.array<open_dataset-e65f59fae7f46d8b063a2cb30821e8e2red, shape=(4062, 4383), dtype=uint8, chunksize=(2000, 2000), chunktype=numpy.ndarray>
Coordinates:
  * y        (y) float64 5.637e+06 5.637e+06 5.637e+06 ... 5.637e+06 5.637e+06
  * x        (x) float64 4.351e+05 4.351e+05 4.351e+05 ... 4.355e+05 4.355e+05
Attributes:
    transform:      [ 1.0000000e-01  0.0000000e+00  4.3506550e+05  0.0000000e...
    crs:            +init=epsg:25832
    res:            [0.1 0.1]
    is_tiled:       0
    nodatavals:     [nan nan nan nan]
    scales:         [1. 1. 1. 1.]
    offsets:        [0. 0. 0. 0.]
    AREA_OR_POINT:  Area

Now we can use the map_blocks function to apply out helper function to apply the model to each chunk separately. This prevents our memory from overflowing:

%%time

prediction = test_data.map_blocks(predict_func, kwargs={"model": rf_model, "predictors": predictors}, template=template).compute()
CPU times: user 3min 28s, sys: 8.63 s, total: 3min 36s
Wall time: 15.5 s

Let’s take a look at the result:

import matplotlib.pyplot as plt
plt.figure(figsize=(18,18))
plt.imshow(np.array([test_data.red,test_data.green,test_data.blue]).transpose(1,2,0))
plt.imshow(np.where(prediction==1,1,np.nan),cmap="autumn_r",interpolation="None",alpha=0.5)
plt.show()
../../../_images/03_RF_baseline_24_0.png

Well, that doesn’t look too good. Let’s check the statistics. We will use the metrics implementations of tensorflow here because we want to compare our results to the UNet (which is built on tensorflow) later.

import tensorflow as tf
# Keras implementation of accuracy
m = tf.keras.metrics.Accuracy()
m.update_state(prediction, test_data.buildings_mask)
m.result().numpy()
0.9054271

Ok, the accuracy is still at ~90%. This is due to the strong imbalance in the data (with most pixels being no-building pixels). So it’s very easy to achieve high accuracy values, even if buildings are not really well detected.

Another statistical measure, the MeanIoU is more meaningful in this context. The Mean Intersection-Over-Union is a common evaluation metric for semantic image segmentation, which first computes the IOU for each semantic class and then computes the average over classes. IOU is defined as follows:

\(IOU = \frac{TP}{TP + FP + FN}\)

with TP, FP and FN being true positives, false positives and false negatives respectively:

# MeanIoU (Image segmentation metrics)
m = tf.keras.metrics.MeanIoU(num_classes=2)
m.update_state(prediction, test_data.buildings_mask)
m.result().numpy()
0.5172027

By averaging over all classes, each class gets the same weight in the metric and thus the result of this score is much worse (because the “building” class performes really bad).