Random forest baseline model¶

Before we start with the UNet implementation, we want to train and test a simple pixel-based baseline model for comparison.

Load and prepare data¶

import xarray as xr
import numpy as np

We load the training data:

Now we convert the data into tabular format:

df_train = training_data.to_dask_dataframe()[["buildings_mask","red","green","blue"]]
df_train

Dask DataFrame Structure:

	buildings_mask	red	green	blue
npartitions=9
0	uint8	uint8	uint8	uint8
2705292	...	...	...	...
...	...	...	...	...
17279748	...	...	...	...
17803745	...	...	...	...

Dask Name: getitem, 544 tasks

Then we take a sample from the data (as ~150 million pixels/samples is way too much for a random forest):

df_train_sample = df_train.sample(frac=0.0001,random_state=42).compute()
df_train_sample

	buildings_mask	red	green	blue
2315693	0	123	116	117
123035352	0	91	89	90
112477877	0	126	131	109
71267707	0	56	59	39
124322821	0	104	100	101
...	...	...	...	...
52705504	0	155	155	144
126099645	0	140	118	99
106530192	0	145	122	104
28372898	0	137	146	114
112970162	0	55	60	63

14689 rows × 4 columns

Train RF¶

We can now check the performance of an RF model on the data and then train it:

from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier

predictors = ["red","green","blue"]
target     = "buildings_mask"

result = cross_validate(RandomForestClassifier(n_estimators=100,n_jobs=-1), df_train_sample[predictors], y=df_train_sample[target], scoring="accuracy", cv=KFold(), n_jobs=-1, verbose=2)

# Print average accuracy score
result["test_score"].mean()

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    1.3s finished

0.9555452104820702

rf_model = RandomForestClassifier(n_estimators=100,n_jobs=-1)
rf_model.fit(df_train_sample[predictors], df_train_sample[target])

RandomForestClassifier(n_jobs=-1)

Apply RF¶

Now we can apply the trained model on the test data:

# Load data
test_data = xr.open_dataset("data/volnsberg.nc",chunks={"x":2000,"y":2000})
test_data

<xarray.Dataset>
Dimensions:         (x: 4383, y: 4062)
Coordinates:
  * y               (y) float64 5.637e+06 5.637e+06 ... 5.637e+06 5.637e+06
  * x               (x) float64 4.351e+05 4.351e+05 ... 4.355e+05 4.355e+05
Data variables:
    buildings_mask  (y, x) uint8 dask.array<chunksize=(2000, 2000), meta=np.ndarray>
    red             (y, x) uint8 dask.array<chunksize=(2000, 2000), meta=np.ndarray>
    green           (y, x) uint8 dask.array<chunksize=(2000, 2000), meta=np.ndarray>
    blue            (y, x) uint8 dask.array<chunksize=(2000, 2000), meta=np.ndarray>

xarray.Dataset

Dimensions:
- x: 4383
- y: 4062

Coordinates: (2)

(y)

float64

5.637e+06 5.637e+06 ... 5.637e+06

array([5637426.15, 5637426.05, 5637425.95, ..., 5637020.25, 5637020.15,
       5637020.05])

(x)

float64

4.351e+05 4.351e+05 ... 4.355e+05

array([435065.55, 435065.65, 435065.75, ..., 435503.55, 435503.65, 435503.75])

Data variables: (4)

buildings_mask

(y, x)

uint8

dask.array<chunksize=(2000, 2000), meta=np.ndarray>

transform :: [ 1.0000000e-01 0.0000000e+00 4.3506550e+05 0.0000000e+00 -1.0000000e-01 5.6374262e+06]
crs :: +init=epsg:25832
res :: [0.1 0.1]
is_tiled :: 0
nodatavals :: [0. 0. 0. 0.]
scales :: [1. 1. 1. 1.]
offsets :: [0. 0. 0. 0.]
AREA_OR_POINT :: Area

	Array	Chunk
Bytes	16.98 MiB	3.81 MiB
Shape	(4062, 4383)	(2000, 2000)
Count	10 Tasks	9 Chunks
Type	uint8	numpy.ndarray

red

(y, x)

uint8

dask.array<chunksize=(2000, 2000), meta=np.ndarray>

transform :: [ 1.0000000e-01 0.0000000e+00 4.3506550e+05 0.0000000e+00 -1.0000000e-01 5.6374262e+06]
crs :: +init=epsg:25832
res :: [0.1 0.1]
is_tiled :: 0
nodatavals :: [nan nan nan nan]
scales :: [1. 1. 1. 1.]
offsets :: [0. 0. 0. 0.]
AREA_OR_POINT :: Area

	Array	Chunk
Bytes	16.98 MiB	3.81 MiB
Shape	(4062, 4383)	(2000, 2000)
Count	10 Tasks	9 Chunks
Type	uint8	numpy.ndarray

green

(y, x)

uint8

dask.array<chunksize=(2000, 2000), meta=np.ndarray>

transform :: [ 1.0000000e-01 0.0000000e+00 4.3506550e+05 0.0000000e+00 -1.0000000e-01 5.6374262e+06]
crs :: +init=epsg:25832
res :: [0.1 0.1]
is_tiled :: 0
nodatavals :: [nan nan nan nan]
scales :: [1. 1. 1. 1.]
offsets :: [0. 0. 0. 0.]
AREA_OR_POINT :: Area

	Array	Chunk
Bytes	16.98 MiB	3.81 MiB
Shape	(4062, 4383)	(2000, 2000)
Count	10 Tasks	9 Chunks
Type	uint8	numpy.ndarray

blue

(y, x)

uint8

dask.array<chunksize=(2000, 2000), meta=np.ndarray>

transform :: [ 1.0000000e-01 0.0000000e+00 4.3506550e+05 0.0000000e+00 -1.0000000e-01 5.6374262e+06]
crs :: +init=epsg:25832
res :: [0.1 0.1]
is_tiled :: 0
nodatavals :: [nan nan nan nan]
scales :: [1. 1. 1. 1.]
offsets :: [0. 0. 0. 0.]
AREA_OR_POINT :: Area

	Array	Chunk
Bytes	16.98 MiB	3.81 MiB
Shape	(4062, 4383)	(2000, 2000)
Count	10 Tasks	9 Chunks
Type	uint8	numpy.ndarray

Attributes: (0)

Note that here we “chunked” the data into multiple tiles/chunks so that our memory is not exceeded during model prediction.

We can now define a helper function that converts a given xr.Dataset into a pandas.DataFrame, uses the trained model to predict the target variable value for each pixel and then re-converts the data back into an xr.Dataset:

# Helper function for calling a model on an xr.Dataset and returning its predictions
def predict_func(ds,model,predictors):
    
    # Convert the xr.Dataset (the tile) into a pandas.DataFrame
    df = ds.to_dataframe()
    
    # Use the model to predict values for each pixel
    result = model.predict(df[predictors])
    
    # Reshape the results back into a 2d-array
    # Note that the "order" argument should be removed when working with Windows!
    result = result.reshape(ds.y.size,ds.x.size,order="F")
    
    # Return the resulting 2d-prediction-results as xr.DataArray
    return xr.DataArray(result,dims=["y","x"],coords=[ds.y,ds.x])

We also need to specify a template for the coming process, which is a placeholder for our result (building / no building raster):

template = test_data[list(test_data.data_vars)[1]].rename("building")
template

<xarray.DataArray 'building' (y: 4062, x: 4383)>
dask.array<open_dataset-e65f59fae7f46d8b063a2cb30821e8e2red, shape=(4062, 4383), dtype=uint8, chunksize=(2000, 2000), chunktype=numpy.ndarray>
Coordinates:
  * y        (y) float64 5.637e+06 5.637e+06 5.637e+06 ... 5.637e+06 5.637e+06
  * x        (x) float64 4.351e+05 4.351e+05 4.351e+05 ... 4.355e+05 4.355e+05
Attributes:
    transform:      [ 1.0000000e-01  0.0000000e+00  4.3506550e+05  0.0000000e...
    crs:            +init=epsg:25832
    res:            [0.1 0.1]
    is_tiled:       0
    nodatavals:     [nan nan nan nan]
    scales:         [1. 1. 1. 1.]
    offsets:        [0. 0. 0. 0.]
    AREA_OR_POINT:  Area

xarray.DataArray

'building'

y: 4062
x: 4383

dask.array<chunksize=(2000, 2000), meta=np.ndarray>

	Array	Chunk
Bytes	16.98 MiB	3.81 MiB
Shape	(4062, 4383)	(2000, 2000)
Count	10 Tasks	9 Chunks
Type	uint8	numpy.ndarray

Coordinates: (2)

(y)

float64

5.637e+06 5.637e+06 ... 5.637e+06

array([5637426.15, 5637426.05, 5637425.95, ..., 5637020.25, 5637020.15,
       5637020.05])

(x)

float64

4.351e+05 4.351e+05 ... 4.355e+05

array([435065.55, 435065.65, 435065.75, ..., 435503.55, 435503.65, 435503.75])

Attributes: (8)
transform :
[ 1.0000000e-01 0.0000000e+00 4.3506550e+05 0.0000000e+00 -1.0000000e-01 5.6374262e+06]
crs :
+init=epsg:25832
res :
[0.1 0.1]
is_tiled :
0
nodatavals :
[nan nan nan nan]
scales :
[1. 1. 1. 1.]
offsets :
[0. 0. 0. 0.]
AREA_OR_POINT :
Area

Now we can use the map_blocks function to apply out helper function to apply the model to each chunk separately. This prevents our memory from overflowing:

%%time

prediction = test_data.map_blocks(predict_func, kwargs={"model": rf_model, "predictors": predictors}, template=template).compute()

CPU times: user 3min 28s, sys: 8.63 s, total: 3min 36s
Wall time: 15.5 s

Let’s take a look at the result:

import matplotlib.pyplot as plt

plt.figure(figsize=(18,18))
plt.imshow(np.array([test_data.red,test_data.green,test_data.blue]).transpose(1,2,0))
plt.imshow(np.where(prediction==1,1,np.nan),cmap="autumn_r",interpolation="None",alpha=0.5)
plt.show()

../../../_images/03_RF_baseline_24_0.png

Well, that doesn’t look too good. Let’s check the statistics. We will use the metrics implementations of tensorflow here because we want to compare our results to the UNet (which is built on tensorflow) later.

import tensorflow as tf

# Keras implementation of accuracy
m = tf.keras.metrics.Accuracy()
m.update_state(prediction, test_data.buildings_mask)
m.result().numpy()

0.9054271

Ok, the accuracy is still at ~90%. This is due to the strong imbalance in the data (with most pixels being no-building pixels). So it’s very easy to achieve high accuracy values, even if buildings are not really well detected.

Another statistical measure, the MeanIoU is more meaningful in this context. The Mean Intersection-Over-Union is a common evaluation metric for semantic image segmentation, which first computes the IOU for each semantic class and then computes the average over classes. IOU is defined as follows:

\(IOU = \frac{TP}{TP + FP + FN}\)

with TP, FP and FN being true positives, false positives and false negatives respectively:

# MeanIoU (Image segmentation metrics)
m = tf.keras.metrics.MeanIoU(num_classes=2)
m.update_state(prediction, test_data.buildings_mask)
m.result().numpy()

0.5172027

By averaging over all classes, each class gets the same weight in the metric and thus the result of this score is much worse (because the “building” class performes really bad).

Data inspection UNet approach

Winter Semester 21/22

Random forest baseline model¶

Load and prepare data¶

Train RF¶

Apply RF¶