Random forest baseline model¶
Before we start with the UNet implementation, we want to train and test a simple pixel-based baseline model for comparison.
Load and prepare data¶
import xarray as xr
import numpy as np
We load the training data:
# Load data
training_data = xr.open_dataset("data/oberschelden.nc")
training_data
<xarray.Dataset> Dimensions: (x: 13224, y: 11108) Coordinates: * y (y) float64 5.636e+06 5.636e+06 ... 5.635e+06 5.635e+06 * x (x) float64 4.253e+05 4.253e+05 ... 4.266e+05 4.266e+05 Data variables: buildings_mask (y, x) uint8 ... red (y, x) uint8 ... green (y, x) uint8 ... blue (y, x) uint8 ...
- x: 13224
- y: 11108
- y(y)float645.636e+06 5.636e+06 ... 5.635e+06
array([5636252.35, 5636252.25, 5636252.15, ..., 5635141.85, 5635141.75, 5635141.65])
- x(x)float644.253e+05 4.253e+05 ... 4.266e+05
array([425313.85, 425313.95, 425314.05, ..., 426635.95, 426636.05, 426636.15])
- buildings_mask(y, x)uint8...
- transform :
- [ 1.0000000e-01 0.0000000e+00 4.2531380e+05 0.0000000e+00 -1.0000000e-01 5.6362524e+06]
- crs :
- +init=epsg:25832
- res :
- [0.1 0.1]
- is_tiled :
- 0
- nodatavals :
- [0. 0. 0. 0.]
- scales :
- [1. 1. 1. 1.]
- offsets :
- [0. 0. 0. 0.]
- AREA_OR_POINT :
- Area
[146892192 values with dtype=uint8]
- red(y, x)uint8...
- transform :
- [ 1.0000000e-01 0.0000000e+00 4.2531380e+05 0.0000000e+00 -1.0000000e-01 5.6362524e+06]
- crs :
- +init=epsg:25832
- res :
- [0.1 0.1]
- is_tiled :
- 0
- nodatavals :
- [nan nan nan nan]
- scales :
- [1. 1. 1. 1.]
- offsets :
- [0. 0. 0. 0.]
- AREA_OR_POINT :
- Area
[146892192 values with dtype=uint8]
- green(y, x)uint8...
- transform :
- [ 1.0000000e-01 0.0000000e+00 4.2531380e+05 0.0000000e+00 -1.0000000e-01 5.6362524e+06]
- crs :
- +init=epsg:25832
- res :
- [0.1 0.1]
- is_tiled :
- 0
- nodatavals :
- [nan nan nan nan]
- scales :
- [1. 1. 1. 1.]
- offsets :
- [0. 0. 0. 0.]
- AREA_OR_POINT :
- Area
[146892192 values with dtype=uint8]
- blue(y, x)uint8...
- transform :
- [ 1.0000000e-01 0.0000000e+00 4.2531380e+05 0.0000000e+00 -1.0000000e-01 5.6362524e+06]
- crs :
- +init=epsg:25832
- res :
- [0.1 0.1]
- is_tiled :
- 0
- nodatavals :
- [nan nan nan nan]
- scales :
- [1. 1. 1. 1.]
- offsets :
- [0. 0. 0. 0.]
- AREA_OR_POINT :
- Area
[146892192 values with dtype=uint8]
Now we convert the data into tabular format:
df_train = training_data.to_dask_dataframe()[["buildings_mask","red","green","blue"]]
df_train
buildings_mask | red | green | blue | |
---|---|---|---|---|
npartitions=9 | ||||
0 | uint8 | uint8 | uint8 | uint8 |
2705292 | ... | ... | ... | ... |
... | ... | ... | ... | ... |
17279748 | ... | ... | ... | ... |
17803745 | ... | ... | ... | ... |
Then we take a sample from the data (as ~150 million pixels/samples is way too much for a random forest):
df_train_sample = df_train.sample(frac=0.0001,random_state=42).compute()
df_train_sample
buildings_mask | red | green | blue | |
---|---|---|---|---|
2315693 | 0 | 123 | 116 | 117 |
123035352 | 0 | 91 | 89 | 90 |
112477877 | 0 | 126 | 131 | 109 |
71267707 | 0 | 56 | 59 | 39 |
124322821 | 0 | 104 | 100 | 101 |
... | ... | ... | ... | ... |
52705504 | 0 | 155 | 155 | 144 |
126099645 | 0 | 140 | 118 | 99 |
106530192 | 0 | 145 | 122 | 104 |
28372898 | 0 | 137 | 146 | 114 |
112970162 | 0 | 55 | 60 | 63 |
14689 rows × 4 columns
Train RF¶
We can now check the performance of an RF model on the data and then train it:
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
predictors = ["red","green","blue"]
target = "buildings_mask"
result = cross_validate(RandomForestClassifier(n_estimators=100,n_jobs=-1), df_train_sample[predictors], y=df_train_sample[target], scoring="accuracy", cv=KFold(), n_jobs=-1, verbose=2)
# Print average accuracy score
result["test_score"].mean()
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done 5 out of 5 | elapsed: 1.3s finished
0.9555452104820702
rf_model = RandomForestClassifier(n_estimators=100,n_jobs=-1)
rf_model.fit(df_train_sample[predictors], df_train_sample[target])
RandomForestClassifier(n_jobs=-1)
Apply RF¶
Now we can apply the trained model on the test data:
# Load data
test_data = xr.open_dataset("data/volnsberg.nc",chunks={"x":2000,"y":2000})
test_data
<xarray.Dataset> Dimensions: (x: 4383, y: 4062) Coordinates: * y (y) float64 5.637e+06 5.637e+06 ... 5.637e+06 5.637e+06 * x (x) float64 4.351e+05 4.351e+05 ... 4.355e+05 4.355e+05 Data variables: buildings_mask (y, x) uint8 dask.array<chunksize=(2000, 2000), meta=np.ndarray> red (y, x) uint8 dask.array<chunksize=(2000, 2000), meta=np.ndarray> green (y, x) uint8 dask.array<chunksize=(2000, 2000), meta=np.ndarray> blue (y, x) uint8 dask.array<chunksize=(2000, 2000), meta=np.ndarray>
- x: 4383
- y: 4062
- y(y)float645.637e+06 5.637e+06 ... 5.637e+06
array([5637426.15, 5637426.05, 5637425.95, ..., 5637020.25, 5637020.15, 5637020.05])
- x(x)float644.351e+05 4.351e+05 ... 4.355e+05
array([435065.55, 435065.65, 435065.75, ..., 435503.55, 435503.65, 435503.75])
- buildings_mask(y, x)uint8dask.array<chunksize=(2000, 2000), meta=np.ndarray>
- transform :
- [ 1.0000000e-01 0.0000000e+00 4.3506550e+05 0.0000000e+00 -1.0000000e-01 5.6374262e+06]
- crs :
- +init=epsg:25832
- res :
- [0.1 0.1]
- is_tiled :
- 0
- nodatavals :
- [0. 0. 0. 0.]
- scales :
- [1. 1. 1. 1.]
- offsets :
- [0. 0. 0. 0.]
- AREA_OR_POINT :
- Area
Array Chunk Bytes 16.98 MiB 3.81 MiB Shape (4062, 4383) (2000, 2000) Count 10 Tasks 9 Chunks Type uint8 numpy.ndarray - red(y, x)uint8dask.array<chunksize=(2000, 2000), meta=np.ndarray>
- transform :
- [ 1.0000000e-01 0.0000000e+00 4.3506550e+05 0.0000000e+00 -1.0000000e-01 5.6374262e+06]
- crs :
- +init=epsg:25832
- res :
- [0.1 0.1]
- is_tiled :
- 0
- nodatavals :
- [nan nan nan nan]
- scales :
- [1. 1. 1. 1.]
- offsets :
- [0. 0. 0. 0.]
- AREA_OR_POINT :
- Area
Array Chunk Bytes 16.98 MiB 3.81 MiB Shape (4062, 4383) (2000, 2000) Count 10 Tasks 9 Chunks Type uint8 numpy.ndarray - green(y, x)uint8dask.array<chunksize=(2000, 2000), meta=np.ndarray>
- transform :
- [ 1.0000000e-01 0.0000000e+00 4.3506550e+05 0.0000000e+00 -1.0000000e-01 5.6374262e+06]
- crs :
- +init=epsg:25832
- res :
- [0.1 0.1]
- is_tiled :
- 0
- nodatavals :
- [nan nan nan nan]
- scales :
- [1. 1. 1. 1.]
- offsets :
- [0. 0. 0. 0.]
- AREA_OR_POINT :
- Area
Array Chunk Bytes 16.98 MiB 3.81 MiB Shape (4062, 4383) (2000, 2000) Count 10 Tasks 9 Chunks Type uint8 numpy.ndarray - blue(y, x)uint8dask.array<chunksize=(2000, 2000), meta=np.ndarray>
- transform :
- [ 1.0000000e-01 0.0000000e+00 4.3506550e+05 0.0000000e+00 -1.0000000e-01 5.6374262e+06]
- crs :
- +init=epsg:25832
- res :
- [0.1 0.1]
- is_tiled :
- 0
- nodatavals :
- [nan nan nan nan]
- scales :
- [1. 1. 1. 1.]
- offsets :
- [0. 0. 0. 0.]
- AREA_OR_POINT :
- Area
Array Chunk Bytes 16.98 MiB 3.81 MiB Shape (4062, 4383) (2000, 2000) Count 10 Tasks 9 Chunks Type uint8 numpy.ndarray
Note that here we “chunked” the data into multiple tiles/chunks so that our memory is not exceeded during model prediction.
We can now define a helper function that converts a given xr.Dataset into a pandas.DataFrame, uses the trained model to predict the target variable value for each pixel and then re-converts the data back into an xr.Dataset:
# Helper function for calling a model on an xr.Dataset and returning its predictions
def predict_func(ds,model,predictors):
# Convert the xr.Dataset (the tile) into a pandas.DataFrame
df = ds.to_dataframe()
# Use the model to predict values for each pixel
result = model.predict(df[predictors])
# Reshape the results back into a 2d-array
# Note that the "order" argument should be removed when working with Windows!
result = result.reshape(ds.y.size,ds.x.size,order="F")
# Return the resulting 2d-prediction-results as xr.DataArray
return xr.DataArray(result,dims=["y","x"],coords=[ds.y,ds.x])
We also need to specify a template for the coming process, which is a placeholder for our result (building / no building raster):
template = test_data[list(test_data.data_vars)[1]].rename("building")
template
<xarray.DataArray 'building' (y: 4062, x: 4383)> dask.array<open_dataset-e65f59fae7f46d8b063a2cb30821e8e2red, shape=(4062, 4383), dtype=uint8, chunksize=(2000, 2000), chunktype=numpy.ndarray> Coordinates: * y (y) float64 5.637e+06 5.637e+06 5.637e+06 ... 5.637e+06 5.637e+06 * x (x) float64 4.351e+05 4.351e+05 4.351e+05 ... 4.355e+05 4.355e+05 Attributes: transform: [ 1.0000000e-01 0.0000000e+00 4.3506550e+05 0.0000000e... crs: +init=epsg:25832 res: [0.1 0.1] is_tiled: 0 nodatavals: [nan nan nan nan] scales: [1. 1. 1. 1.] offsets: [0. 0. 0. 0.] AREA_OR_POINT: Area
- y: 4062
- x: 4383
- dask.array<chunksize=(2000, 2000), meta=np.ndarray>
Array Chunk Bytes 16.98 MiB 3.81 MiB Shape (4062, 4383) (2000, 2000) Count 10 Tasks 9 Chunks Type uint8 numpy.ndarray - y(y)float645.637e+06 5.637e+06 ... 5.637e+06
array([5637426.15, 5637426.05, 5637425.95, ..., 5637020.25, 5637020.15, 5637020.05])
- x(x)float644.351e+05 4.351e+05 ... 4.355e+05
array([435065.55, 435065.65, 435065.75, ..., 435503.55, 435503.65, 435503.75])
- transform :
- [ 1.0000000e-01 0.0000000e+00 4.3506550e+05 0.0000000e+00 -1.0000000e-01 5.6374262e+06]
- crs :
- +init=epsg:25832
- res :
- [0.1 0.1]
- is_tiled :
- 0
- nodatavals :
- [nan nan nan nan]
- scales :
- [1. 1. 1. 1.]
- offsets :
- [0. 0. 0. 0.]
- AREA_OR_POINT :
- Area
Now we can use the map_blocks
function to apply out helper function to apply the model to each chunk separately. This prevents our memory from overflowing:
%%time
prediction = test_data.map_blocks(predict_func, kwargs={"model": rf_model, "predictors": predictors}, template=template).compute()
CPU times: user 3min 28s, sys: 8.63 s, total: 3min 36s
Wall time: 15.5 s
Let’s take a look at the result:
import matplotlib.pyplot as plt
plt.figure(figsize=(18,18))
plt.imshow(np.array([test_data.red,test_data.green,test_data.blue]).transpose(1,2,0))
plt.imshow(np.where(prediction==1,1,np.nan),cmap="autumn_r",interpolation="None",alpha=0.5)
plt.show()
Well, that doesn’t look too good. Let’s check the statistics. We will use the metrics implementations of tensorflow here because we want to compare our results to the UNet (which is built on tensorflow) later.
import tensorflow as tf
# Keras implementation of accuracy
m = tf.keras.metrics.Accuracy()
m.update_state(prediction, test_data.buildings_mask)
m.result().numpy()
0.9054271
Ok, the accuracy is still at ~90%. This is due to the strong imbalance in the data (with most pixels being no-building pixels). So it’s very easy to achieve high accuracy values, even if buildings are not really well detected.
Another statistical measure, the MeanIoU
is more meaningful in this context. The Mean Intersection-Over-Union
is a common evaluation metric for semantic image segmentation, which first computes the IOU for each semantic class and then computes the average over classes. IOU is defined as follows:
\(IOU = \frac{TP}{TP + FP + FN}\)
with TP, FP and FN being true positives, false positives and false negatives respectively:
# MeanIoU (Image segmentation metrics)
m = tf.keras.metrics.MeanIoU(num_classes=2)
m.update_state(prediction, test_data.buildings_mask)
m.result().numpy()
0.5172027
By averaging over all classes, each class gets the same weight in the metric and thus the result of this score is much worse (because the “building” class performes really bad).