# Practical introduction¶

## Predicting forest covertypes¶

Before we dive into the manifold machine learning approaches in the context of remote sensing, let’s first take a look at a classic application: Predicting forest covertypes based on cartographic data. The samples in this dataset correspond to 30×30m patches of forest in the US (Colorado). The study area is depicted below:

Image source: [BD99].

The data set is described in detail here: https://archive.ics.uci.edu/ml/datasets/Covertype

It was also used in a kaggle competition: https://www.kaggle.com/c/forest-cover-type-prediction/overview

The cover types are encoded as:

• 1: Spruce/Fir

• 2: Lodgepole Pine

• 3: Ponderosa Pine

• 4: Cottonwood/Willow

• 5: Aspen

• 6: Douglas-fir

• 7: Krummholz

Our goal is to build a model that can predict these cover types using all or part of the 54 predictors given in the data set.

The original paper that was published on this data set achieved an overall accuracy of 70.58%.

Let’s see if we can beat that.

# general imports...
import warnings
warnings.filterwarnings('ignore')


We can easily load the data set using the scikit-learn library:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_covtype
dataset = fetch_covtype(as_frame=True)


This gives us a dictionary with all the information we need:

dataset.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

df            = dataset["frame"]
target_names  = dataset["target_names"]
feature_names = dataset["feature_names"]
class_names   = ["Spruce/Fir","Lodgepole Pine","Ponderosa Pine","Cottonwood/Willow","Aspen","Douglas-fir","Krummholz"]


Let’s first take a look at the data:

df

0 2596.0 51.0 3.0 258.0 0.0 510.0 221.0 232.0 148.0 6279.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5
1 2590.0 56.0 2.0 212.0 -6.0 390.0 220.0 235.0 151.0 6225.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5
2 2804.0 139.0 9.0 268.0 65.0 3180.0 234.0 238.0 135.0 6121.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2
3 2785.0 155.0 18.0 242.0 118.0 3090.0 238.0 238.0 122.0 6211.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2
4 2595.0 45.0 2.0 153.0 -1.0 391.0 220.0 234.0 150.0 6172.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
581007 2396.0 153.0 20.0 85.0 17.0 108.0 240.0 237.0 118.0 837.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3
581008 2391.0 152.0 19.0 67.0 12.0 95.0 240.0 237.0 119.0 845.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3
581009 2386.0 159.0 17.0 60.0 7.0 90.0 236.0 241.0 130.0 854.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3
581010 2384.0 170.0 15.0 60.0 5.0 90.0 230.0 245.0 143.0 864.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3
581011 2383.0 165.0 13.0 60.0 4.0 67.0 231.0 244.0 141.0 875.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3

581012 rows × 55 columns

We can see that the data is already structured the way we need it to be able to feed it into the machine learning process. However, we have to split the data into different subsets in order to be able to assess the performances of the models we are going to build.

### Split the data set¶

In the context of ML, we normally split a data set into the following subsets:

• train (used for training the model)

• test (used for testing the final model)

The train set can further be split into train and validation subsets. The validation subset is used within the feature selection and hyperparameter tuning processes in order to find the best possible model. The test set is used to evaluate the performance of the final model. The exact procedure of these splits can differ and is dependent on the data structure and the problem we want to solve.

We can use the built-in train_test_split function to make a random split into 80% train and 20% test data:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[feature_names], df[target_names], test_size=0.2, random_state=42)
y_train = y_train.values.ravel() # convert to 1d array
y_test  = y_test.values.ravel()  # convert to 1d array

print("Samples in train data: {0:5d}\nSamples in test data:  {1:5d}".format(y_train.size,y_test.size))

Samples in train data: 464809
Samples in test data:  116203


### Baseline model¶

For comparison purposes, let’s go ahead and build a simple Decision Tree classification model, fit it to the data and check its performance…

from sklearn.tree import DecisionTreeClassifier, plot_tree

dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

DecisionTreeClassifier(random_state=42)

dt_model.score(X_test, y_test)

0.9389516621773965


OK, that looks already pretty good. With an average accuracy of ~93.8% the simple Decision Tree model already achieves good results on the validation data set.

We can plot the fitted model to get an idea of what the model learned and how it uses the data to predict forest cover types:

plt.figure(figsize=(24,10))
plot_tree(dt_model,feature_names=feature_names,filled=True,class_names=class_names,proportion=True,precision=0,max_depth=2)
plt.show()


Looking at the scores of the different forest cover types, we can see, that, due to the different class counts, some classes are recognized worse by the model:

class_scores_dt = [dt_model.score(X_test[y_test==covertype],y_test[y_test==covertype]) for covertype in np.unique(y_test)]
class_counts    = np.unique(y_test,return_counts=True)[1]

fig, ax = plt.subplots(1,2,figsize=(12,3),sharey=True)
ax[0].barh(range(len(class_scores_dt)),class_scores_dt)
ax[0].set_xlim((0.8,1.0))
ax[0].set_yticklabels(class_names)
ax[0].set_yticks(range(7))
ax[0].set_title("Accuracy")
ax[1].barh(range(len(class_counts)),class_counts)
ax[1].set_title("Sample count")
plt.show()


To account for imbalances in the sample counts of the classes, we will also calculate the balanced accuracy score:

from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import accuracy_score

y_pred = dt_model.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.9389516621773965
0.8832865638983036


Still pretty high, but we can see that the balanced accuracy score only reaches around 88% due to some classes having lower scores.

### Random Forest model¶

Let’s see if we can achieve better results with a more advanced model. For this purpose, we’ll take a random forest model:

from sklearn.ensemble import RandomForestClassifier

%%time
model = RandomForestClassifier(random_state=42, n_jobs= -1, oob_score=True, class_weight='balanced')
model.fit(X_train, y_train)

CPU times: user 4min 17s, sys: 1.66 s, total: 4min 19s
Wall time: 26.1 s


When working with random forests, we don’t need an independent validation data set to assess model performance: We can just use the internally calculated OOB score (out-of-bag):

model.oob_score_

0.9537250784730933


Let’s compare that to the validation score:

model.score(X_test, y_test)

0.9555088939184014


As we can see, both values are pretty close to each other, indicating that the OOB score really was a good estimate of the general model performance. However, we want to compare the results to those of the baseline model, so we also calculate the balanced accuracy score:

y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.9555088939184014
0.8928133354659079


We see, even without further adjustments, the random forest model performs better on the data set. But let’s see if we can do even better…

### Clean up the data set¶

The most important thing you always have to do before jumping into the ML training phase is to clean up your data!

So let’s take a look at the data again:

df.describe()

count 581012.000000 581012.000000 581012.000000 581012.000000 581012.000000 581012.000000 581012.000000 581012.000000 581012.000000 581012.000000 ... 581012.000000 581012.000000 581012.000000 581012.000000 581012.000000 581012.000000 581012.000000 581012.000000 581012.000000 581012.000000
mean 2959.365301 155.656807 14.103704 269.428217 46.418855 2350.146611 212.146049 223.318716 142.528263 1980.291226 ... 0.090392 0.077716 0.002773 0.003255 0.000205 0.000513 0.026803 0.023762 0.015060 2.051471
std 279.984734 111.913721 7.488242 212.549356 58.295232 1559.254870 26.769889 19.768697 38.274529 1324.195210 ... 0.286743 0.267725 0.052584 0.056957 0.014310 0.022641 0.161508 0.152307 0.121791 1.396504
min 1859.000000 0.000000 0.000000 0.000000 -173.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 2809.000000 58.000000 9.000000 108.000000 7.000000 1106.000000 198.000000 213.000000 119.000000 1024.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
50% 2996.000000 127.000000 13.000000 218.000000 30.000000 1997.000000 218.000000 226.000000 143.000000 1710.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000
75% 3163.000000 260.000000 18.000000 384.000000 69.000000 3328.000000 231.000000 237.000000 168.000000 2550.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000
max 3858.000000 360.000000 66.000000 1397.000000 601.000000 7117.000000 254.000000 254.000000 254.000000 7173.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 7.000000

8 rows × 55 columns

df.keys()

Index(['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area_0',
'Wilderness_Area_1', 'Wilderness_Area_2', 'Wilderness_Area_3',
'Soil_Type_0', 'Soil_Type_1', 'Soil_Type_2', 'Soil_Type_3',
'Soil_Type_4', 'Soil_Type_5', 'Soil_Type_6', 'Soil_Type_7',
'Soil_Type_8', 'Soil_Type_9', 'Soil_Type_10', 'Soil_Type_11',
'Soil_Type_12', 'Soil_Type_13', 'Soil_Type_14', 'Soil_Type_15',
'Soil_Type_16', 'Soil_Type_17', 'Soil_Type_18', 'Soil_Type_19',
'Soil_Type_20', 'Soil_Type_21', 'Soil_Type_22', 'Soil_Type_23',
'Soil_Type_24', 'Soil_Type_25', 'Soil_Type_26', 'Soil_Type_27',
'Soil_Type_28', 'Soil_Type_29', 'Soil_Type_30', 'Soil_Type_31',
'Soil_Type_32', 'Soil_Type_33', 'Soil_Type_34', 'Soil_Type_35',
'Soil_Type_36', 'Soil_Type_37', 'Soil_Type_38', 'Soil_Type_39',
'Cover_Type'],
dtype='object')


Well, this looks ok. However, there are many 1-hot-features in the data set (wilderness areas and soil types). We can recode these into 2 categorical variables to reduce the dimensionality of the data set significantly:

df_clean = df[df.keys()[:10]]                                            # clip data set to first 10 features (without wilderness area and soil types)
df_clean["wilderness_area"] = df[df.keys()[10:14]].values.argmax(axis=1) # add wilderness areas encoded as one variable with values 0-3
df_clean["soil_type"] = df[df.keys()[14:-1]].values.argmax(axis=1)       # add soil types encoded as one variable with vlaues 0-39
df_clean["Cover_Type"] = df["Cover_Type"]                                # add target feature

df_clean

0 2596.0 51.0 3.0 258.0 0.0 510.0 221.0 232.0 148.0 6279.0 0 28 5
1 2590.0 56.0 2.0 212.0 -6.0 390.0 220.0 235.0 151.0 6225.0 0 28 5
2 2804.0 139.0 9.0 268.0 65.0 3180.0 234.0 238.0 135.0 6121.0 0 11 2
3 2785.0 155.0 18.0 242.0 118.0 3090.0 238.0 238.0 122.0 6211.0 0 29 2
4 2595.0 45.0 2.0 153.0 -1.0 391.0 220.0 234.0 150.0 6172.0 0 28 5
... ... ... ... ... ... ... ... ... ... ... ... ... ...
581007 2396.0 153.0 20.0 85.0 17.0 108.0 240.0 237.0 118.0 837.0 2 1 3
581008 2391.0 152.0 19.0 67.0 12.0 95.0 240.0 237.0 119.0 845.0 2 1 3
581009 2386.0 159.0 17.0 60.0 7.0 90.0 236.0 241.0 130.0 854.0 2 1 3
581010 2384.0 170.0 15.0 60.0 5.0 90.0 230.0 245.0 143.0 864.0 2 1 3
581011 2383.0 165.0 13.0 60.0 4.0 67.0 231.0 244.0 141.0 875.0 2 1 3

581012 rows × 13 columns

We could now also further examine the data set for specific features. For example, it always makes sense to look at the correlation of the predictors among each other as well as with the target feature to get an understanding of the relationships between the variables.

fig, ax = plt.subplots(1,1,figsize=(8,8))
im = ax.imshow(df_clean.corr(),cmap="Reds")
ax.set_xticks(np.arange(len(df_clean.keys()))); ax.set_xticklabels(df_clean.keys())
ax.set_yticks(np.arange(len(df_clean.keys()))); ax.set_yticklabels(df_clean.keys())
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
ax.set_title("Correlation matrix")
fig.tight_layout()
plt.show()


This correlation matrix shows us that the hillshade variables and the aspect show some correlation and that the elevation is somehow linked to the soil type. We can also see, that the wilderness area has the strongest correlation with our target variable.

There are many more analyses that could be applied to the data before conducting the machine learning part. However, here we want to focus on the application of the ML methods. So, on this “clean” data set, we can train and test the model again:

feature_names_clean = ['Elevation','Aspect','Slope','Horizontal_Distance_To_Hydrology','Vertical_Distance_To_Hydrology','Horizontal_Distance_To_Roadways','Hillshade_9am','Hillshade_Noon','Hillshade_3pm','Horizontal_Distance_To_Fire_Points','wilderness_area','soil_type']

X_train, X_test, y_train, y_test = train_test_split(df_clean[feature_names_clean], df_clean[target_names], test_size=0.2, random_state=42)
y_train = y_train.values.ravel() # convert to 1d array
y_test  = y_test.values.ravel()  # convert to 1d array

%%time
model = RandomForestClassifier(random_state=42, n_jobs= -1, oob_score=True, class_weight='balanced')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.9631076650344655
0.9032239907339351
CPU times: user 4min 3s, sys: 532 ms, total: 4min 3s
Wall time: 23.3 s


Which gives us a slight accuracy gain by ~0.01.

In summary, we learned: