Practical introduction¶
Predicting forest covertypes¶
Before we dive into the manifold machine learning approaches in the context of remote sensing, let’s first take a look at a classic application: Predicting forest covertypes based on cartographic data. The samples in this dataset correspond to 30×30m patches of forest in the US (Colorado). The study area is depicted below:
Image source: [BD99].
The data set is described in detail here: https://archive.ics.uci.edu/ml/datasets/Covertype
It was also used in a kaggle competition: https://www.kaggle.com/c/forest-cover-type-prediction/overview
The cover types are encoded as:
1: Spruce/Fir
2: Lodgepole Pine
3: Ponderosa Pine
4: Cottonwood/Willow
5: Aspen
6: Douglas-fir
7: Krummholz
Our goal is to build a model that can predict these cover types using all or part of the 54 predictors given in the data set.
The original paper that was published on this data set achieved an overall accuracy of 70.58%.
Let’s see if we can beat that.
# general imports...
import warnings
warnings.filterwarnings('ignore')
Load the data set¶
We can easily load the data set using the scikit-learn library:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_covtype
dataset = fetch_covtype(as_frame=True)
This gives us a dictionary with all the information we need:
dataset.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])
df = dataset["frame"]
target_names = dataset["target_names"]
feature_names = dataset["feature_names"]
class_names = ["Spruce/Fir","Lodgepole Pine","Ponderosa Pine","Cottonwood/Willow","Aspen","Douglas-fir","Krummholz"]
Let’s first take a look at the data:
df
Elevation | Aspect | Slope | Horizontal_Distance_To_Hydrology | Vertical_Distance_To_Hydrology | Horizontal_Distance_To_Roadways | Hillshade_9am | Hillshade_Noon | Hillshade_3pm | Horizontal_Distance_To_Fire_Points | ... | Soil_Type_31 | Soil_Type_32 | Soil_Type_33 | Soil_Type_34 | Soil_Type_35 | Soil_Type_36 | Soil_Type_37 | Soil_Type_38 | Soil_Type_39 | Cover_Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2596.0 | 51.0 | 3.0 | 258.0 | 0.0 | 510.0 | 221.0 | 232.0 | 148.0 | 6279.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5 |
1 | 2590.0 | 56.0 | 2.0 | 212.0 | -6.0 | 390.0 | 220.0 | 235.0 | 151.0 | 6225.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5 |
2 | 2804.0 | 139.0 | 9.0 | 268.0 | 65.0 | 3180.0 | 234.0 | 238.0 | 135.0 | 6121.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 |
3 | 2785.0 | 155.0 | 18.0 | 242.0 | 118.0 | 3090.0 | 238.0 | 238.0 | 122.0 | 6211.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 |
4 | 2595.0 | 45.0 | 2.0 | 153.0 | -1.0 | 391.0 | 220.0 | 234.0 | 150.0 | 6172.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
581007 | 2396.0 | 153.0 | 20.0 | 85.0 | 17.0 | 108.0 | 240.0 | 237.0 | 118.0 | 837.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 |
581008 | 2391.0 | 152.0 | 19.0 | 67.0 | 12.0 | 95.0 | 240.0 | 237.0 | 119.0 | 845.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 |
581009 | 2386.0 | 159.0 | 17.0 | 60.0 | 7.0 | 90.0 | 236.0 | 241.0 | 130.0 | 854.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 |
581010 | 2384.0 | 170.0 | 15.0 | 60.0 | 5.0 | 90.0 | 230.0 | 245.0 | 143.0 | 864.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 |
581011 | 2383.0 | 165.0 | 13.0 | 60.0 | 4.0 | 67.0 | 231.0 | 244.0 | 141.0 | 875.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 |
581012 rows × 55 columns
We can see that the data is already structured the way we need it to be able to feed it into the machine learning process. However, we have to split the data into different subsets in order to be able to assess the performances of the models we are going to build.
Split the data set¶
In the context of ML, we normally split a data set into the following subsets:
train
(used for training the model)test
(used for testing the final model)
The train
set can further be split into train
and validation
subsets. The validation
subset is used within the feature selection and hyperparameter tuning processes in order to find the best possible model. The test
set is used to evaluate the performance of the final model. The exact procedure of these splits can differ and is dependent on the data structure and the problem we want to solve.
We can use the built-in train_test_split
function to make a random split into 80% train
and 20% test
data:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[feature_names], df[target_names], test_size=0.2, random_state=42)
y_train = y_train.values.ravel() # convert to 1d array
y_test = y_test.values.ravel() # convert to 1d array
print("Samples in train data: {0:5d}\nSamples in test data: {1:5d}".format(y_train.size,y_test.size))
Samples in train data: 464809
Samples in test data: 116203
Baseline model¶
For comparison purposes, let’s go ahead and build a simple Decision Tree classification model, fit it to the data and check its performance…
from sklearn.tree import DecisionTreeClassifier, plot_tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
DecisionTreeClassifier(random_state=42)
dt_model.score(X_test, y_test)
0.9389516621773965
OK, that looks already pretty good. With an average accuracy of ~93.8% the simple Decision Tree model already achieves good results on the validation data set.
We can plot the fitted model to get an idea of what the model learned and how it uses the data to predict forest cover types:
plt.figure(figsize=(24,10))
plot_tree(dt_model,feature_names=feature_names,filled=True,class_names=class_names,proportion=True,precision=0,max_depth=2)
plt.show()
Looking at the scores of the different forest cover types, we can see, that, due to the different class counts, some classes are recognized worse by the model:
class_scores_dt = [dt_model.score(X_test[y_test==covertype],y_test[y_test==covertype]) for covertype in np.unique(y_test)]
class_counts = np.unique(y_test,return_counts=True)[1]
fig, ax = plt.subplots(1,2,figsize=(12,3),sharey=True)
ax[0].barh(range(len(class_scores_dt)),class_scores_dt)
ax[0].set_xlim((0.8,1.0))
ax[0].set_yticklabels(class_names)
ax[0].set_yticks(range(7))
ax[0].set_title("Accuracy")
ax[1].barh(range(len(class_counts)),class_counts)
ax[1].set_title("Sample count")
plt.show()
To account for imbalances in the sample counts of the classes, we will also calculate the balanced accuracy score:
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import accuracy_score
y_pred = dt_model.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(balanced_accuracy_score(y_test, y_pred, adjusted=True))
0.9389516621773965
0.8832865638983036
Still pretty high, but we can see that the balanced accuracy score only reaches around 88% due to some classes having lower scores.
Random Forest model¶
Let’s see if we can achieve better results with a more advanced model. For this purpose, we’ll take a random forest model:
from sklearn.ensemble import RandomForestClassifier
%%time
model = RandomForestClassifier(random_state=42, n_jobs= -1, oob_score=True, class_weight='balanced')
model.fit(X_train, y_train)
CPU times: user 4min 17s, sys: 1.66 s, total: 4min 19s
Wall time: 26.1 s
When working with random forests, we don’t need an independent validation data set to assess model performance: We can just use the internally calculated OOB score (out-of-bag):
model.oob_score_
0.9537250784730933
Let’s compare that to the validation score:
model.score(X_test, y_test)
0.9555088939184014
As we can see, both values are pretty close to each other, indicating that the OOB score really was a good estimate of the general model performance. However, we want to compare the results to those of the baseline model, so we also calculate the balanced accuracy score:
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(balanced_accuracy_score(y_test, y_pred, adjusted=True))
0.9555088939184014
0.8928133354659079
We see, even without further adjustments, the random forest model performs better on the data set. But let’s see if we can do even better…
Clean up the data set¶
The most important thing you always have to do before jumping into the ML training phase is to clean up your data!
So let’s take a look at the data again:
df.describe()
Elevation | Aspect | Slope | Horizontal_Distance_To_Hydrology | Vertical_Distance_To_Hydrology | Horizontal_Distance_To_Roadways | Hillshade_9am | Hillshade_Noon | Hillshade_3pm | Horizontal_Distance_To_Fire_Points | ... | Soil_Type_31 | Soil_Type_32 | Soil_Type_33 | Soil_Type_34 | Soil_Type_35 | Soil_Type_36 | Soil_Type_37 | Soil_Type_38 | Soil_Type_39 | Cover_Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 | ... | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 | 581012.000000 |
mean | 2959.365301 | 155.656807 | 14.103704 | 269.428217 | 46.418855 | 2350.146611 | 212.146049 | 223.318716 | 142.528263 | 1980.291226 | ... | 0.090392 | 0.077716 | 0.002773 | 0.003255 | 0.000205 | 0.000513 | 0.026803 | 0.023762 | 0.015060 | 2.051471 |
std | 279.984734 | 111.913721 | 7.488242 | 212.549356 | 58.295232 | 1559.254870 | 26.769889 | 19.768697 | 38.274529 | 1324.195210 | ... | 0.286743 | 0.267725 | 0.052584 | 0.056957 | 0.014310 | 0.022641 | 0.161508 | 0.152307 | 0.121791 | 1.396504 |
min | 1859.000000 | 0.000000 | 0.000000 | 0.000000 | -173.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 2809.000000 | 58.000000 | 9.000000 | 108.000000 | 7.000000 | 1106.000000 | 198.000000 | 213.000000 | 119.000000 | 1024.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
50% | 2996.000000 | 127.000000 | 13.000000 | 218.000000 | 30.000000 | 1997.000000 | 218.000000 | 226.000000 | 143.000000 | 1710.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
75% | 3163.000000 | 260.000000 | 18.000000 | 384.000000 | 69.000000 | 3328.000000 | 231.000000 | 237.000000 | 168.000000 | 2550.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
max | 3858.000000 | 360.000000 | 66.000000 | 1397.000000 | 601.000000 | 7117.000000 | 254.000000 | 254.000000 | 254.000000 | 7173.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 7.000000 |
8 rows × 55 columns
df.keys()
Index(['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area_0',
'Wilderness_Area_1', 'Wilderness_Area_2', 'Wilderness_Area_3',
'Soil_Type_0', 'Soil_Type_1', 'Soil_Type_2', 'Soil_Type_3',
'Soil_Type_4', 'Soil_Type_5', 'Soil_Type_6', 'Soil_Type_7',
'Soil_Type_8', 'Soil_Type_9', 'Soil_Type_10', 'Soil_Type_11',
'Soil_Type_12', 'Soil_Type_13', 'Soil_Type_14', 'Soil_Type_15',
'Soil_Type_16', 'Soil_Type_17', 'Soil_Type_18', 'Soil_Type_19',
'Soil_Type_20', 'Soil_Type_21', 'Soil_Type_22', 'Soil_Type_23',
'Soil_Type_24', 'Soil_Type_25', 'Soil_Type_26', 'Soil_Type_27',
'Soil_Type_28', 'Soil_Type_29', 'Soil_Type_30', 'Soil_Type_31',
'Soil_Type_32', 'Soil_Type_33', 'Soil_Type_34', 'Soil_Type_35',
'Soil_Type_36', 'Soil_Type_37', 'Soil_Type_38', 'Soil_Type_39',
'Cover_Type'],
dtype='object')
Well, this looks ok. However, there are many 1-hot-features in the data set (wilderness areas and soil types). We can recode these into 2 categorical variables to reduce the dimensionality of the data set significantly:
df_clean = df[df.keys()[:10]] # clip data set to first 10 features (without wilderness area and soil types)
df_clean["wilderness_area"] = df[df.keys()[10:14]].values.argmax(axis=1) # add wilderness areas encoded as one variable with values 0-3
df_clean["soil_type"] = df[df.keys()[14:-1]].values.argmax(axis=1) # add soil types encoded as one variable with vlaues 0-39
df_clean["Cover_Type"] = df["Cover_Type"] # add target feature
df_clean
Elevation | Aspect | Slope | Horizontal_Distance_To_Hydrology | Vertical_Distance_To_Hydrology | Horizontal_Distance_To_Roadways | Hillshade_9am | Hillshade_Noon | Hillshade_3pm | Horizontal_Distance_To_Fire_Points | wilderness_area | soil_type | Cover_Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2596.0 | 51.0 | 3.0 | 258.0 | 0.0 | 510.0 | 221.0 | 232.0 | 148.0 | 6279.0 | 0 | 28 | 5 |
1 | 2590.0 | 56.0 | 2.0 | 212.0 | -6.0 | 390.0 | 220.0 | 235.0 | 151.0 | 6225.0 | 0 | 28 | 5 |
2 | 2804.0 | 139.0 | 9.0 | 268.0 | 65.0 | 3180.0 | 234.0 | 238.0 | 135.0 | 6121.0 | 0 | 11 | 2 |
3 | 2785.0 | 155.0 | 18.0 | 242.0 | 118.0 | 3090.0 | 238.0 | 238.0 | 122.0 | 6211.0 | 0 | 29 | 2 |
4 | 2595.0 | 45.0 | 2.0 | 153.0 | -1.0 | 391.0 | 220.0 | 234.0 | 150.0 | 6172.0 | 0 | 28 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
581007 | 2396.0 | 153.0 | 20.0 | 85.0 | 17.0 | 108.0 | 240.0 | 237.0 | 118.0 | 837.0 | 2 | 1 | 3 |
581008 | 2391.0 | 152.0 | 19.0 | 67.0 | 12.0 | 95.0 | 240.0 | 237.0 | 119.0 | 845.0 | 2 | 1 | 3 |
581009 | 2386.0 | 159.0 | 17.0 | 60.0 | 7.0 | 90.0 | 236.0 | 241.0 | 130.0 | 854.0 | 2 | 1 | 3 |
581010 | 2384.0 | 170.0 | 15.0 | 60.0 | 5.0 | 90.0 | 230.0 | 245.0 | 143.0 | 864.0 | 2 | 1 | 3 |
581011 | 2383.0 | 165.0 | 13.0 | 60.0 | 4.0 | 67.0 | 231.0 | 244.0 | 141.0 | 875.0 | 2 | 1 | 3 |
581012 rows × 13 columns
We could now also further examine the data set for specific features. For example, it always makes sense to look at the correlation of the predictors among each other as well as with the target feature to get an understanding of the relationships between the variables.
fig, ax = plt.subplots(1,1,figsize=(8,8))
im = ax.imshow(df_clean.corr(),cmap="Reds")
ax.set_xticks(np.arange(len(df_clean.keys()))); ax.set_xticklabels(df_clean.keys())
ax.set_yticks(np.arange(len(df_clean.keys()))); ax.set_yticklabels(df_clean.keys())
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
ax.set_title("Correlation matrix")
fig.tight_layout()
plt.show()
This correlation matrix shows us that the hillshade variables and the aspect show some correlation and that the elevation is somehow linked to the soil type. We can also see, that the wilderness area has the strongest correlation with our target variable.
There are many more analyses that could be applied to the data before conducting the machine learning part. However, here we want to focus on the application of the ML methods. So, on this “clean” data set, we can train and test the model again:
feature_names_clean = ['Elevation','Aspect','Slope','Horizontal_Distance_To_Hydrology','Vertical_Distance_To_Hydrology','Horizontal_Distance_To_Roadways','Hillshade_9am','Hillshade_Noon','Hillshade_3pm','Horizontal_Distance_To_Fire_Points','wilderness_area','soil_type']
X_train, X_test, y_train, y_test = train_test_split(df_clean[feature_names_clean], df_clean[target_names], test_size=0.2, random_state=42)
y_train = y_train.values.ravel() # convert to 1d array
y_test = y_test.values.ravel() # convert to 1d array
%%time
model = RandomForestClassifier(random_state=42, n_jobs= -1, oob_score=True, class_weight='balanced')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(balanced_accuracy_score(y_test, y_pred, adjusted=True))
0.9631076650344655
0.9032239907339351
CPU times: user 4min 3s, sys: 532 ms, total: 4min 3s
Wall time: 23.3 s
Which gives us a slight accuracy gain by ~0.01.
In summary, we learned:
Look into your data
Clean up your data
Split your data into training and testing
Try different models
In the next session, we will take a look at how we can apply these techniques in remote sensing tasks.