{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Practical introduction\n", "\n", "## Predicting forest covertypes\n", "\n", "![trees](images/trees.jpg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Before we dive into the manifold machine learning approaches in the context of remote sensing, let's first take a look at a classic application: Predicting forest covertypes based on cartographic data. The samples in this dataset correspond to 30×30m patches of forest in the US (Colorado). The study area is depicted below:\n", "\n", "![](images/tree_cov_aoi.jpg) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Image source: {cite}`Blackard1999`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The data set is described in detail here: https://archive.ics.uci.edu/ml/datasets/Covertype\n", "\n", "It was also used in a kaggle competition: https://www.kaggle.com/c/forest-cover-type-prediction/overview" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The cover types are encoded as:\n", "\n", "- 1: Spruce/Fir\n", "- 2: Lodgepole Pine\n", "- 3: Ponderosa Pine\n", "- 4: Cottonwood/Willow\n", "- 5: Aspen\n", "- 6: Douglas-fir\n", "- 7: Krummholz" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Our goal is to build a model that can predict these cover types using all or part of the 54 predictors given in the data set.\n", "\n", "The original paper that was published on this data set achieved an overall accuracy of **70.58%.**\n", "\n", "Let's see if we can beat that." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# general imports...\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Load the data set\n", "\n", "We can easily load the data set using the scikit-learn library:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "-" }, "tags": [ "pyflyby-cell" ] }, "outputs": [], "source": [ "from sklearn.datasets import fetch_covtype\n", "dataset = fetch_covtype(as_frame=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This gives us a dictionary with all the information we need:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.keys()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df = dataset[\"frame\"]\n", "target_names = dataset[\"target_names\"]\n", "feature_names = dataset[\"feature_names\"]\n", "class_names = [\"Spruce/Fir\",\"Lodgepole Pine\",\"Ponderosa Pine\",\"Cottonwood/Willow\",\"Aspen\",\"Douglas-fir\",\"Krummholz\"]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's first take a look at the data:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ElevationAspectSlopeHorizontal_Distance_To_HydrologyVertical_Distance_To_HydrologyHorizontal_Distance_To_RoadwaysHillshade_9amHillshade_NoonHillshade_3pmHorizontal_Distance_To_Fire_Points...Soil_Type_31Soil_Type_32Soil_Type_33Soil_Type_34Soil_Type_35Soil_Type_36Soil_Type_37Soil_Type_38Soil_Type_39Cover_Type
02596.051.03.0258.00.0510.0221.0232.0148.06279.0...0.00.00.00.00.00.00.00.00.05
12590.056.02.0212.0-6.0390.0220.0235.0151.06225.0...0.00.00.00.00.00.00.00.00.05
22804.0139.09.0268.065.03180.0234.0238.0135.06121.0...0.00.00.00.00.00.00.00.00.02
32785.0155.018.0242.0118.03090.0238.0238.0122.06211.0...0.00.00.00.00.00.00.00.00.02
42595.045.02.0153.0-1.0391.0220.0234.0150.06172.0...0.00.00.00.00.00.00.00.00.05
..................................................................
5810072396.0153.020.085.017.0108.0240.0237.0118.0837.0...0.00.00.00.00.00.00.00.00.03
5810082391.0152.019.067.012.095.0240.0237.0119.0845.0...0.00.00.00.00.00.00.00.00.03
5810092386.0159.017.060.07.090.0236.0241.0130.0854.0...0.00.00.00.00.00.00.00.00.03
5810102384.0170.015.060.05.090.0230.0245.0143.0864.0...0.00.00.00.00.00.00.00.00.03
5810112383.0165.013.060.04.067.0231.0244.0141.0875.0...0.00.00.00.00.00.00.00.00.03
\n", "

581012 rows × 55 columns

\n", "
" ], "text/plain": [ " Elevation Aspect Slope Horizontal_Distance_To_Hydrology \\\n", "0 2596.0 51.0 3.0 258.0 \n", "1 2590.0 56.0 2.0 212.0 \n", "2 2804.0 139.0 9.0 268.0 \n", "3 2785.0 155.0 18.0 242.0 \n", "4 2595.0 45.0 2.0 153.0 \n", "... ... ... ... ... \n", "581007 2396.0 153.0 20.0 85.0 \n", "581008 2391.0 152.0 19.0 67.0 \n", "581009 2386.0 159.0 17.0 60.0 \n", "581010 2384.0 170.0 15.0 60.0 \n", "581011 2383.0 165.0 13.0 60.0 \n", "\n", " Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways \\\n", "0 0.0 510.0 \n", "1 -6.0 390.0 \n", "2 65.0 3180.0 \n", "3 118.0 3090.0 \n", "4 -1.0 391.0 \n", "... ... ... \n", "581007 17.0 108.0 \n", "581008 12.0 95.0 \n", "581009 7.0 90.0 \n", "581010 5.0 90.0 \n", "581011 4.0 67.0 \n", "\n", " Hillshade_9am Hillshade_Noon Hillshade_3pm \\\n", "0 221.0 232.0 148.0 \n", "1 220.0 235.0 151.0 \n", "2 234.0 238.0 135.0 \n", "3 238.0 238.0 122.0 \n", "4 220.0 234.0 150.0 \n", "... ... ... ... \n", "581007 240.0 237.0 118.0 \n", "581008 240.0 237.0 119.0 \n", "581009 236.0 241.0 130.0 \n", "581010 230.0 245.0 143.0 \n", "581011 231.0 244.0 141.0 \n", "\n", " Horizontal_Distance_To_Fire_Points ... Soil_Type_31 Soil_Type_32 \\\n", "0 6279.0 ... 0.0 0.0 \n", "1 6225.0 ... 0.0 0.0 \n", "2 6121.0 ... 0.0 0.0 \n", "3 6211.0 ... 0.0 0.0 \n", "4 6172.0 ... 0.0 0.0 \n", "... ... ... ... ... \n", "581007 837.0 ... 0.0 0.0 \n", "581008 845.0 ... 0.0 0.0 \n", "581009 854.0 ... 0.0 0.0 \n", "581010 864.0 ... 0.0 0.0 \n", "581011 875.0 ... 0.0 0.0 \n", "\n", " Soil_Type_33 Soil_Type_34 Soil_Type_35 Soil_Type_36 Soil_Type_37 \\\n", "0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... ... \n", "581007 0.0 0.0 0.0 0.0 0.0 \n", "581008 0.0 0.0 0.0 0.0 0.0 \n", "581009 0.0 0.0 0.0 0.0 0.0 \n", "581010 0.0 0.0 0.0 0.0 0.0 \n", "581011 0.0 0.0 0.0 0.0 0.0 \n", "\n", " Soil_Type_38 Soil_Type_39 Cover_Type \n", "0 0.0 0.0 5 \n", "1 0.0 0.0 5 \n", "2 0.0 0.0 2 \n", "3 0.0 0.0 2 \n", "4 0.0 0.0 5 \n", "... ... ... ... \n", "581007 0.0 0.0 3 \n", "581008 0.0 0.0 3 \n", "581009 0.0 0.0 3 \n", "581010 0.0 0.0 3 \n", "581011 0.0 0.0 3 \n", "\n", "[581012 rows x 55 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We can see that the data is already structured the way we need it to be able to feed it into the machine learning process. However, we have to split the data into different subsets in order to be able to assess the performances of the models we are going to build." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Split the data set\n", "\n", "In the context of ML, we normally split a data set into the following subsets:\n", "\n", "- `train` (used for training the model)\n", "- `test` (used for testing the final model)\n", "\n", "The `train` set can further be split into `train` and `validation` subsets. The `validation` subset is used within the feature selection and hyperparameter tuning processes in order to find the best possible model. The `test` set is used to evaluate the performance of the final model. The exact procedure of these splits can differ and is dependent on the data structure and the problem we want to solve.\n", "\n", "We can use the built-in `train_test_split` function to make a random split into `80% train` and `20% test` data:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(df[feature_names], df[target_names], test_size=0.2, random_state=42)\n", "y_train = y_train.values.ravel() # convert to 1d array\n", "y_test = y_test.values.ravel() # convert to 1d array" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Samples in train data: 464809\n", "Samples in test data: 116203\n" ] } ], "source": [ "print(\"Samples in train data: {0:5d}\\nSamples in test data: {1:5d}\".format(y_train.size,y_test.size))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Baseline model\n", "For comparison purposes, let's go ahead and build a simple Decision Tree classification model, fit it to the data and check its performance..." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier, plot_tree" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(random_state=42)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt_model = DecisionTreeClassifier(random_state=42)\n", "dt_model.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.9389516621773965" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt_model.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "OK, that looks already pretty good. With an average accuracy of ~93.8% the simple Decision Tree model already achieves good results on the validation data set." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can plot the fitted model to get an idea of what the model learned and how it uses the data to predict forest cover types:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(24,10))\n", "plot_tree(dt_model,feature_names=feature_names,filled=True,class_names=class_names,proportion=True,precision=0,max_depth=2)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Looking at the scores of the different forest cover types, we can see, that, due to the different class counts, some classes are recognized worse by the model:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAxIAAADSCAYAAADexPbiAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAnqElEQVR4nO3deZhlVX3v//fHFkFsbGTQtAi2Iogo2gLiBAaHnwo44IiIMlwj8SZOV9EQR7jJjai5SpQoFxMZHQhXQSMOeEEmQbCBphsVCEITRRQRaUaJtN/fH3uVHstT1XWgqk6d4v16nnrOPmuvvfd37dq1zvmetfapVBWSJEmSNIj7DTsASZIkSaPHREKSJEnSwEwkJEmSJA3MREKSJEnSwEwkJEmSJA3MREKSJEnSwEwkJEmS5rgkhyQ5YdhxSL1MJKQRk+TMJL9Osu6wY5Gk+S7JzknOS7I6yU1JvpvkKcOOa65LsirJ84Ydh2aWiYQ0QpIsAXYBCnjJLB73/rN1LEmaK5I8GPga8ElgI2Az4FDgrmHGJc0VJhLSaNkX+B5wDLDfWGGSzZN8Ockvk/wqyRE9696Y5EdJbk3ywyTbt/JK8pieesck+fu2vGuSnyb5myQ/B45O8pAkX2vH+HVbfkTP9hslOTrJz9r6U1r5ZUle3FNvnSQ3Jlk6Q+dIkqbL1gBV9YWqWlNVd1bVaVW1AiDJlknOaP3ujUk+l2TDsY3bp/LvSrIiye1J/jXJw5J8o/XJ/y/JQ1rdJa1fPrD1o9cneedEgSV5WhspuTnJpUl2naRu39eIJPdL8r4k1ya5IclxSRa1dbsm+em4/fx+lKFNtfq3ts2tSX6QZMe27nhgC+Dfk9yW5N334NxrBJhISKNlX+Bz7ecF7QVpAd0nZtcCS+g+MfsiQJJXAYe07R5MN4rxqyke68/oPoF7JHAgXX9xdHu+BXAncERP/eOB9YHHAw8FPt7KjwNe11Nvd+D6qlo+xTgkaViuBNYkOTbJbmNv+nsE+BDwcOBxwOZ0fW6vVwD/H11S8mLgG8B7gE3o+tW3jqv/bGAr4PnAwf2mByXZDDgV+Hu6fvog4EtJNu1Td8LXCGD/9vNs4NHAQv64X1+bl7R9bQh8dWzbqno98J/Ai6tqYVV9ZIB9aoSYSEgjIsnOdG/i/62qLgJ+DLwW2InuRexdVXV7Vf2mqs5tm/0F8JGq+n51rqqqa6d4yN8BH6yqu9qncL+qqi9V1R1VdSvwv4A/b7EtBnYD3lRVv66q31bVWW0/JwC7tykCAK+nSzokaU6rqluAnemmk34G+GWSryZ5WFt/VVV9u/WTvwQ+RusXe3yyqn5RVdcB5wAXVNUlVXUXcDLw5HH1D219+Uq6D2/27hPa64CvV9XXq+p3VfVtYBndBzXjTfYasQ/wsaq6uqpuA/4WeM0A01nPbTGsoevXnzTF7TRPmEhIo2M/4LSqurE9/3wr2xy4tqru7rPN5nQJxz3xy6r6zdiTJOsn+T9tCPwW4Gxgw/Zp1+bATVX16/E7qaqfAd8FXtGG/HejG1GRpDmvqn5UVftX1SOAJ9C9KT8cIMlDk3wxyXWtXzyBbqSh1y96lu/s83zhuPo/6Vm+th1vvEcCr2rTmm5OcjNdwrO4T93JXiMe3o7Re7z7Aw/rU7efn/cs3wGs5z119y3+sqURkOSBwKuBBe2eBYB16YaTfwFskeT+fV4ofgJsOcFu76CbijTmz4De+bA1rv47gccCT62qn7d7HC6hG9r/CbBRkg2r6uY+xzqWbnTk/sD57ZM5SRopVXV5kmOAv2xFH6LrK59YVb9KsieDTQ3qZ3Pg8ra8BfCzPnV+AhxfVW+cwv5+wsSvET+jS0rGbAHcTfe68nB6XiPah0Z/MnVqEuNfQzQPOSIhjYY9gTXAtsDS9vM4umHyPYHrgcOSPCjJekme2bb7F+CgJDuk85gkYy8ay4HXJlmQ5IX86XD8eBvQfXp2c5KNgA+Oraiq6+nm/X6q3ZS9TpJn9Wx7CrA98Da6eyYkac5Lsk2Sd459sUSSzemmGn2vVdkAuI2uX9wMeNc0HPb9bQT48cABwIl96pwAvDjJC1ofvl67OfoRfepeyMSvEV8A/keSRyVZCPwDcGJLOK6kG2HYI8k6wPvoPsCaql/Q3XehecxEQhoN+wFHV9V/VtXPx37oPvnam+4GvsfQ3dz2U2AvgKo6ie5ehs8Dt9K9od+o7fNtbbub6ebJnrKWGA4HHgjcSPci+s1x618P/Jbuk7QbgLePraiqO4EvAY8Cvjz1ZkvSUN0KPBW4IMntdH3fZXQjtNB9Fez2wGq6m5+no387C7gKOB34x6o6bXyFqvoJ8FK6m7Z/STfq8C76vK9r9y/0fY0APkt3b8PZwDXAb4C3tO1WA39F94HUdcDt/PGo9dp8CHhfm3p10ADbaYSkypEnSTMvyQeAravqdWutLEn3Men+T9A1wDoT3M8gzTneIyFpxrWpUG+gG7WQJEnzgFObJM2oJG+kG3b/RlWdPex4JEnS9HBqkyRJkqSBOSIhSZIkaWAmEpIkSZIG5s3WI2iTTTapJUuWDDsMSbpHLrroohurapB/bDXy7LcljarJ+mwTiRG0ZMkSli1bNuwwJOkeSXLtsGOYbfbbkkbVZH22U5skSZIkDcxEQpIkSdLATCQkSZIkDcxEQpIkSdLATCQkSZIkDcxvbRpBK69bzZKDTx12GJL6WHXYHsMOQXPQdPfbXmeS5gJHJCRJkiQNzERCkiRJ0sBMJCRJkiQNzERCkiRJ0sBGOpFIclvP8u5J/iPJFkOMZ1WSTQaov3+SI2YyJkmSJGkmjHQiMSbJc4FPAi+sqv8ct85vppIkSZKm2ci/yU6yC/AZYPeq+nErOwa4CXgycHGSW4Hbquof2/rLgBe1XXwTOBd4GnApcDRwKPBQYJ+qujDJIcCjgMXA1sA7Wv3dgOuAF1fVb9v+3pLkxcA6wKuq6vIkGwGfBR4N3AEcWFUrxrVjec/Tx9IlRWfd6xMkSZIkzYBRH5FYF/gKsGdVXT5u3dbA86rqnWvZx2OAfwKeCGwDvBbYGTgIeE9PvS2BPYCXAicA36mq7YA7W/mYG6tqe+DTbR/QJSaXVNUT2z6PGx9EVS2tqqXA+4FlwHlriVuSJEkamlFPJH5L94b7DX3WnVRVa6awj2uqamVV/Q74AXB6VRWwEljSU+8bbdRhJbCAbiSDPvW+3B4v6infGTgeoKrOADZOsmh8IEm2Aj4K7NUzwjG27sAky5IsW3PH6ik0S5I0TPbbkua7UU8kfge8GnhKkveMW3d7z/Ld/HFb1+tZvmvc/u7qWb7/+Hot4fhtSzYmrAes6SlPn9ir90mSBwH/Bryxqn72J5WrjqqqHatqxwXr/0kOIkmaY+y3Jc13o55IUFV30N3vsE+SfiMTAKuA7QGSbE93v8NsOhvYpx1/V7rpT7eMq3M0cHRVnTO7oUmSJEmDG/mbrQGq6qYkLwTOTnJjnypfAvZtNzR/H7hyNuMDDgGOTrKC7mbr/XpXJnkk8Epg6yT/rRX/RVUtm9UoJUmSpCka6USiqhb2LP+EP4w0fGVcvTuB50+wmyf01Nu/Z3nV2LqqOmSS4x7Ss7ykZ3kZsGtbvonuJu3x8R8DHNOejvzokCRJku47fPMqSZIkaWAmEpIkSZIGZiIhSZIkaWAmEpIkSZIGNtI3W99XbbfZIpYdtsfaK0qS5gT7bUnzkSMSkiRJkgZmIiFJkiRpYCYSkiRJkgbmPRIjaOV1q1ly8KnDDkNzxCrnXUtz3nT12/69S5pLHJGQJEmSNDATCUmSJEkDM5GQJEmSNDATCUmSJEkDG/lEIsmaJMuT/CDJpUnekWTa25XkkCQHTcN+tmnxXpJkyyTnTUd8kiRJ0myaD9/adGdVLQVI8lDg88Ai4IPDDGoSewJfqaqx+J4xvkKSBVW1ZlajkiRJkgYw8iMSvarqBuBA4M3prJfk6CQr2wjAswGS7J/kiLHtknwtya5t+Q1JrkxyZpLP9Nbrqf/GJN9vIyBfSrJ+K39Vksta+dl9ttsdeDvwF0m+08pua4+7JvlOks8DK6f3zEiSJEnTaz6MSPyRqrq6TW16KPC6VrZdkm2A05JsPdG2SR4OvB/YHrgVOAO4tE/VL1fVZ9o2fw+8Afgk8AHgBVV1XZIN+8T29SRHArdV1T/22e9OwBOq6popN1iSJEkagnk1ItEj7XFn4HiAqrocuBaYMJGgeyN/VlXdVFW/BU6aoN4TkpyTZCWwD/D4Vv5d4JgkbwQW3IO4L5woiUhyYJJlSZatuWP1Pdi1JGk22W9Lmu/mXSKR5NHAGuAG/pBQjHc3f9z29cY2n+JhjgHeXFXbAYeObV9VbwLeB2wOLE+ycZtatTzJ16ew39snWlFVR1XVjlW144L1F00xTEnSsNhvS5rv5lUikWRT4EjgiKoq4Gy6EQPalKYtgCuAVcDSJPdLsjndSATAhcCfJ3lIkvsDr5jgUBsA1ydZZ2z/7RhbVtUFVfUB4EZg86o6oKqWVtXu091eSZIkaVjmwz0SD0yyHFiHbqTheOBjbd2ngCPbFKS7gf2r6q4k3wWuobup+TLgYoB2b8M/ABcAPwN+CPQbj35/q3Nt28cGrfyjSbaiG9k4nf73V0iSJEkjb+QTiaqa8F6EqvoNsH+f8qJnJGGcz1fVUW1E4mTgtLbNIT3bfxr4dJ/9vnwK8R4y7vnC9ngmcObatpckSZLmgnk1tWmaHNJGOC6jG7U4ZajRSJIkSXPQyI9ITLequtf/vVqSJEma7xyRkCRJkjQwRyRG0HabLWLZYXsMOwxJ0hTZb0uajxyRkCRJkjQwEwlJkiRJAzORkCRJkjQwEwlJkiRJA/Nm6xG08rrVLDn41GGHMbBV3mgo6T5qqv22/aSkUeKIhCRJkqSBmUhIkiRJGpiJhCRJkqSBmUhIkiRJGpiJxBQleVmSSrLNsGORJEmShs1EYur2Bs4FXjPsQCRJkqRhM5GYgiQLgWcCb6AlEkkWJzk7yfIklyXZpZXfluR/J7k4yelJNm3lWyb5ZpKLkpwzNrKR5Jgkn0hyXpKrk7xySM2UJEmSpsxEYmr2BL5ZVVcCNyXZHngt8K2qWgo8CVje6j4IuLiqtgfOAj7Yyo8C3lJVOwAHAZ/q2f9iYGfgRcBh/QJIcmCSZUmWrblj9TQ2TZI0E+y3Jc13/kO6qdkbOLwtf7E9/3fgs0nWAU6pquVt/e+AE9vyCcCX24jGM4CTkoztc92e/Z9SVb8DfpjkYf0CqKqj6JIR1l28VU1DmyRJM8h+W9J8ZyKxFkk2Bp4DPCFJAQuAAt4NPAvYAzg+yUer6rg+uyi6kZ+b2+hFP3f1HnK6YpckSZJmilOb1u6VwHFV9ciqWlJVmwPX0CURN1TVZ4B/BbZv9e/XtoFu+tO5VXULcE2SVwGk86RZbYUkSZI0jRyRWLu9+dP7Fr4EHAPcnuS3wG3Avm3d7cDjk1wErAb2auX7AJ9O8j5gHbopUpfObOiSJEnSzDCRWIuq2rVP2SeAT0yyzfuB948ruwZ4YZ+6+497vvAehipJkiTNGqc2SZIkSRqYicQ0c0RBkiRJ9wUmEpIkSZIG5j0SI2i7zRax7LA9hh2GJGmK7LclzUeOSEiSJEkamImEJEmSpIGZSEiSJEkamImEJEmSpIF5s/UIWnndapYcfOq93s8qb/yTpFkx1m/b70qaTxyRkCRJkjQwEwlJkiRJAzORkCRJkjQwEwlJkiRJA5tSIpHkz5J8McmPk/wwydeTbD1B3SVJXtvzfGmS3acr4OmWZP8kR/Q8X5zktCQnJ9mzp/yKJO/ref6lJC9P8qYk+7ayY5K8si2fmWTHWWyKJEmSNGvWmkgkCXAycGZVbVlV2wLvAR42wSZLgNf2PF8KzNlEoo8XAt8CzgOeAZBkY+A24Ok99Z4OnFdVR1bVcbMepSRJkjREUxmReDbw26o6cqygqpYD5yb5aJLLkqxMsldbfRiwS5LlSf4G+J/AXu35Xkk2SnJKkhVJvpfkiQBJDkny2fZJ/tVJ3trK392z/PEkZ7Tl5yY5oS3v3WK4LMmHx+KcpPyAJFcmOQt45rj2vhD4BvBdWiLRHr8GbJrOo4A7q+rnLe6DJjuB/eJI8uokH2vLb0tydVveMsm5a/+1SJIkScMzlf8j8QTgoj7lL6cbbXgSsAnw/SRnAwcDB1XViwCS/ALYsare3J5/ErikqvZM8hzguLYfgG3oEpcNgCuSfBo4G3gn8AlgR2DdJOsAOwPnJHk48GFgB+DXwGltStKFE5RfABzaylcD3wEuabEtAB5bVT9Msi7whCQPoEskzgIeDTwOeDJdorFWk8R3NvCuVm0X4FdJNhtrV5/9HAgcCLDgwZtO5dCSpCGy35Y0392bm613Br5QVWuq6hd0b7SfMsXtjgeoqjOAjZMsautOraq7qupG4Aa66VMXATsk2QC4CzifLqHYhe4N91Popl39sqruBj4HPGuS8qf2lP8XcGJPbE+lSzSoqruAHwDbA09r5efTJRXPoJv6NBV946iqnwMLW7s2Bz7f4htr1x+pqqOqaseq2nHB+ovGr5YkzTH225Lmu6kkEj+g+zR9vNzDY/bbrtrjXT1la4D7V9VvgVXAAXRv3s+hG7XYEvjRJHFMFl9NUL4b8M2e5+fRvbnfoKp+DXyPPyQSUxqRWEsc59O16wq6du1Cd+/FVPctSZIkDcVUEokz6KYTvXGsIMlT6Kbp7JVkQZJN6d5wXwjcSjc1acz452cD+7T97ArcWFW3rCWGs4GD2uM5wJuA5VVVdCMFf55kkzY1aW+60ZHJyndNsnGbIvWqnuM8Fzi95/l3gb8ELm3PV9CNTmxBl2BNxURxjG/XJXQJ0l1VtXqK+5YkSZKGYq33SFRVJXkZcHiSg4Hf0I0QvB1YSPcmu4B3t5uPfwXcneRS4BjgWODgJMuBDwGHAEcnWQHcAew3hTjPAd4LnF9Vtyf5TSujqq5P8rd09zoE+HpVfQVgkvJD6EYDrgcuBsaSod+MS2rOo7sv4kPtWHcnuQH4SVX9bgpxTxpfa8PmwNlVtSbJT4DLp7JfSZIkaZjSfaivJK8DHlFVhw07lrVZd/FWtXi/w+/1flYdtse9D0aSBpTkoqq6T/2fnbF+235X0qiZrM+eyrc23SdU1QnDjkGSJEkaFffmW5skSZIk3UeZSEiSJEkamFObRtB2my1imfNsJWlk2G9Lmo8ckZAkSZI0MBMJSZIkSQMzkZAkSZI0MO+RGEErr1vNkoNPHXYY0kjwe/s1F8y1ftu/C0nTwREJSZIkSQMzkZAkSZI0MBMJSZIkSQMzkZAkSZI0MBMJSZIkSQOb1UQiyZoky5NcluSkJOtPwz4PSXLQdMR3L2JYkuTO1rYfJjkyyf2SvCTJwcOMTZIkSZoJsz0icWdVLa2qJwD/Bbxplo9Pkpn6ytsfV9VS4InAtsCeVfXVqjpsho4nSZIkDc0wpzadAzwmyUZJTkmyIsn3kjwRfj/S8NkkZya5OslbxzZM8t4kVyT5f8Bje8q3TPLNJBclOSfJNq38mCQfS/Id4MNJlrZjrUhycpKHtHpvbSMKK5J8sZXtlOS8JJe0x8cyiaq6GzivtW3/JEf0xPCJto+rk7yyJ+53Jfl+O+6h03R+JUmSpBkzlESijQrsBqwEDgUuqaonAu8Bjuupug3wAmAn4INJ1kmyA/Aa4MnAy4Gn9NQ/CnhLVe0AHAR8qmfd1sDzquqd7Rh/0465Evhgq3Mw8ORWPjZacjnwrKp6MvAB4B/W0rb1gee2/Y63GNgZeBFwWKv/fGCr1salwA5JntVnvwcmWZZk2Zo7Vk8WgiRpDrDfljTfzfZ/tn5gkuVt+RzgX4ELgFcAVNUZSTZOsqjVObWq7gLuSnID8DBgF+DkqroDIMlX2+NC4BnASUnGjrduz7FPqqo1bd8bVtVZrfxY4KS2vAL4XJJTgFNa2SLg2CRbAQWsM0HbtmxtK+ArVfWNJPuPq3NKVf0O+GGSh7Wy57efS9rzhXSJxdm9G1bVUXSJEusu3qomiEGSNEfYb0ua72Y7kbiz3Ufwe+l5199jrMO9q6dsDX+It1+HfD/g5vH773H7FOLbA3gW8BLg/UkeD/wd8J2qelmSJcCZE2z740mOPaa3Pel5/FBV/Z8pxCdJkiTNCXPh61/PBvYBSLIrcGNV3bKW+i9L8sAkGwAvBmjbXJPkVW1fSfKk8RtX1Wrg10l2aUWvB85Kcj9g86r6DvBuYEO60YFFwHWt7v73vJkT+hbw39qICkk2S/LQGTiOJEmSNG1me0Sin0OAo5OsAO4A9pusclVdnOREYDlwLd0UqTH7AJ9O8j66KUhfBC7ts5v9gCPb/QxXAwcAC4AT2tSnAB+vqpuTfIRuatM7gDPucSsnbs9pSR4HnN8GZ24DXgfcMN3HkiRJkqZLqpy2OWrWXbxVLd7v8GGHIY2EVYftMewQNE6Si6pqx2HHMZvmWr/t34WkqZqsz54LU5skSZIkjRgTCUmSJEkDM5GQJEmSNLC5cLO1BrTdZotY5vxWSRoZ9tuS5iNHJCRJkiQNzERCkiRJ0sBMJCRJkiQNzHskRtDK61az5OBThx2GdJ/ld/BrUPbbkoZtJl67HJGQJEmSNDATCUmSJEkDM5GQJEmSNDATCUmSJEkDm/VEIslt93C7JUkum+54JjneIUkOGqD+/kl+mWR5kh8meWMr/59JnjdzkUqSJEmzz29tml4nVtWbkzwU+EGSr1bVB4YdlCRJkjTd5sTUpiRLk3wvyYokJyd5SCvfIcmlSc4H/rqn/vpJ/q3VPzHJBUl2bOuen+T8JBcnOSnJwla+KsmHk1zYfh7Tyh+Z5PS2r9OTbNEnvi2TfDPJRUnOSbLNZO2pqhuAHwOPTHJMklf2xHBoi23l2H6SPCjJZ5N8P8klSV46LSdWkiRJmiFzIpEAjgP+pqqeCKwEPtjKjwbeWlVPH1f/r4Bft/p/B+wAkGQT4H3A86pqe2AZ8I6e7W6pqp2AI4DDW9kRwHFtX58DPtEnvqOAt1TVDsBBwKcma0ySRwOPBq7qs/rGFtun274A3gucUVVPAZ4NfDTJgyY7hiRJkjRMQ5/alGQRsGFVndWKjgVO6lN+PLBbW94Z+CeAqrosyYpW/jRgW+C7SQAeAJzfc7gv9Dx+vC0/HXh5zzE+Mi6+hcAzWkxjxetO0Jy9kuwM3AX8ZVXd1LPNmC+3x4t6jvt84CU992SsB2wB/KgnjgOBAwEWPHjTCQ4vSZor7LclzXdDTyQmEaAmWTdR+berau8J1tcEyxPVgW7U5uaqWjpB/V4nVtWb11Lnrva4hj+c/wCvqKorJtqoqo6iGxlh3cVbTRS7JGmOsN+WNN8NfWpTVa0Gfp1kl1b0euCsqroZWN0+4QfYp2ezc4FXAyTZFtiulX8PeGbP/Q/rJ9m6Z7u9eh7HRirOA17Tc4xzx8V3C3BNkle1fSbJk+5hcyfyLeAtacMXSZ48zfuXJEmSptUwRiTWT/LTnucfA/YDjkyyPnA1cEBbdwDw2SR30L3ZHvMp4Ng2pekSYAWwuqp+mWR/4AtJxqYfvQ+4si2vm+QCugRqbNTire0Y7wJ+2XPsXvsAn07yPmAd4IvApfeo9f39Hd09GytaMrEKeNE07l+SJEmaVqkavdHWJAuAdarqN0m2BE4Htq6q/5pkm1XAjlV14yyFOWPWXbxVLd7v8GGHId1nrTpsj2GHMNKSXFRVOw47jtlkvy1p2O7pa9dkffZcvkdiMusD30myDt39Bf99siRCkiRJ0vQayUSiqm4FBvo0q6qWzEw0kiRJ0n3P0G+2liRJkjR6TCQkSZIkDWwkpzbd12232SKWebOnJI0M+21J85EjEpIkSZIGZiIhSZIkaWAmEpIkSZIG5j0SI2jldatZcvCpww5DGmn+UznNJvttaebZr88+RyQkSZIkDcxEQpIkSdLATCQkSZIkDcxEQpIkSdLA5m0ikeS9SX6QZEWS5UmeOqQ41klyUVte02IZ+1mS5LxhxCVJkiTdG/PyW5uSPB14EbB9Vd2VZBPgAVPc9v5Vdfc0hrMzMJYs3FlVS8etf0afGBZU1ZppjEGSJEmaVvMykQAWAzdW1V0AVXUjQJJVwInAs1u911bVVUmOAW4CngxcnORW4Laq+se23WXAi6pqVZJ9gYOAAlZU1euTbAocCWzR9vv2qvpuW34h8I2JAk1yW1UtTLIr8EHgemApsO29PQmSJEnSTJmvU5tOAzZPcmWSTyX58551t1TVTsARwOE95VsDz6uqd0600ySPB94LPKeqngS8ra36J+DjVfUU4BXAv/Rs9mzgzLb8wJ5pTSf3OcROwHuryiRCkiRJc9q8HJGoqtuS7ADsQvdG/sQkB7fVX+h5/HjPZidNYTrRc4D/OzbCUVU3tfLnAdsmGav34CQbABsAN1XVHa2839SmXhdW1TX9ViQ5EDgQYMGDN11LmJKkYbPfljTfzctEAqAlBWcCZyZZCew3tqq3Ws/y7T3Ld/PHozXrtceM22bM/YCnV9WdvYVJXg18a4Cwb59oRVUdBRwFsO7irfrFIEmaQ+y3Jc1383JqU5LHJtmqp2gpcG1b3qvn8fwJdrEK2L7ta3vgUa38dODVSTZu6zZq5acBb+45/tK2OOn9EZIkSdKomq8jEguBTybZkG504Sq64eUXAesmuYAuidp7gu2/BOybZDnwfeBKgKr6QZL/BZyVZA1wCbA/8Fbgn5OsoDunZyf5a2Crqrp8RlooSZIkDdG8TCSq6iL6f60qwD9X1aHj6u8/7vmdwPMn2PexwLHjym7kDyMdY8faGfjeuHoL++xvYXs8kz/clC1JkiTNafMykZgLqupc4NxhxyFJkiTNhPtUIlFVS4YdgyRJkjQfzMubrSVJkiTNrPvUiMR8sd1mi1h22B7DDkOSNEX225LmI0ckJEmSJA3MREKSJEnSwEwkJEmSJA3MREKSJEnSwEwkJEmSJA3MREKSJEnSwEwkJEmSJA0sVTXsGDSgJLcCVww7jj42AW4cdhATmKuxGddgjGswczWux1bVBsMOYjbN4X77npir19U9MZ/aAvOrPbZl7nhkVW3ab4X/kG40XVFVOw47iPGSLJuLccHcjc24BmNcg5nLcQ07hiGYk/32PTFXr6t7Yj61BeZXe2zLaHBqkyRJkqSBmUhIkiRJGpiJxGg6atgBTGCuxgVzNzbjGoxxDca45o751GbbMnfNp/bYlhHgzdaSJEmSBuaIhCRJkqSBmUjMAUlemOSKJFclObjP+kVJ/j3JpUl+kOSAtW2bZKMk307yH+3xIbMVV5LNk3wnyY9a+dt6tjkkyXVJlref3WcrrrZuVZKV7djLesqHeb4e23M+lie5Jcnb27rZOF8PSXJykhVJLkzyhLVtO0vnq29cc+D6mux8DfP6muh8zfT19dkkNyS5bIL1SfKJFveKJNuvrU3Tcb7mirX93oal3+9tsvOe5G9bG65I8oKe8h3aNX9V+z2nla+b5MRWfkGSJTPYlr59wii2J8l67e/30taWQ0e1LT1xLEhySZKvzYO2/EkfP8rtmRZV5c8Qf4AFwI+BRwMPAC4Fth1X5z3Ah9vypsBNre6E2wIfAQ5uywePbT9LcS0Gtm/lGwBX9sR1CHDQMM5Xe74K2KTPfod2vvrs5+d039k8W+fro8AH2/I2wOlr23aWztdEcQ37+uob1xy4viaMa6aur7aPZwHbA5dNsH534BtAgKcBF8z09TVXfqbyextibH/ye5vovAPbttjXBR7V2rSgrbsQeHr7/X4D2K2V/xVwZFt+DXDiDLalb58wiu1px13YltcBLmh/NyPXlp42vQP4PPC1Ub7O2jFWMa6PH+X2TMePIxLDtxNwVVVdXVX/BXwReOm4OgVs0DLWhXRvQO9ey7YvBY5ty8cCe85WXFV1fVVdDFBVtwI/AjYb8PjTHtda9ju08zWuznOBH1fVtQMe/97EtS1wOkBVXQ4sSfKwtWw7G+erb1xz4Pqa6HxNZmjna1yd6b6+qKqz6a7libwUOK463wM2TLKYmb2+5oqp/N6GYoLf20Tn/aXAF6vqrqq6BrgK2Kn9Hh9cVedX987nuHHbjO3r/wLPHfvUdQbaMlGfMHLtaX8nt7Wn67SfGsW2ACR5BLAH8C89xSPZlknMt/YMxERi+DYDftLz/Kf86ZuiI4DHAT8DVgJvq6rfrWXbh1XV9dB1ssBDZzGu32vDck+m+1RlzJvTTXH4bAafsnBv4yrgtCQXJTmwZ5s5cb7oPoH4wriymT5flwIvB0iyE/BI4BFr2XY2ztdEcf3ekK6vyeIa5vW11vPF9F9fUzFR7DN5fc0VU/m9zSUTnffJfoc/7VP+R9tU1d3AamDjGYu8GdcnjGR72lSg5cANwLeramTbAhwOvBvofa0b1bZA/z5+lNtzr5lIDF+/TLPGPX8BsBx4OLAUOCLJg6e47TDi6naQLAS+BLy9qm5pxZ8Gtmz1rwf+9yzH9cyq2h7YDfjrJM8a8PgzFRdJHgC8BDipZ5vZOF+HAQ9pL1xvAS6hGykZ9vU1UVzdDoZ3fU0W1zCvr7Wdr5m4vqZiothn8vqaK+ZLG+/J73DW2z5Bn9C3ap+yOdOeqlpTVUvpPgjYKT33YfUxZ9uS5EXADVV10VQ36VM2J9rSY5A+fhTac6+ZSAzfT4HNe54/gu4T614HAF9uQ55XAdfQzYGebNtftOEz2uMNsxgXSdah69A/V1VfHtugqn7ROsnfAZ+hG/qftbiq6mft8Qbg5J7jD/V8NbsBF1fVL8YKZuN8VdUtVXVAe+Hal+7+jWvWsu2Mn69J4hrq9TVZXMO8viaLq5mJ6+vexD6T19dcMZV+YS6Z6LxP9jt8RJ/yP9omyf2BRUw+Be5emaBPGNn2AFTVzcCZwAsZzbY8E3hJklV00/qek+SEEW0LMGEfP7LtmQ4mEsP3fWCrJI9qnxi+BvjquDr/STe3mTbn+bHA1WvZ9qvAfm15P+ArsxVXm8/3r8CPqupjvRuM/bE1LwP6ftPLDMX1oCQbtPIHAc/vOf7QzlfP+r0ZN+1kNs5Xkg3bOoC/AM5un+YN9fqaKK5hX1+TxDXU62uS3+OYmbi+puKrwL7pPA1Y3Yb/Z/L6mium0i/MJROd968Cr0n3jTKPArYCLmy/x1uTPK39Xe47bpuxfb0SOKPNB592k/QJI9eeJJsm2bAtPxB4HnD5KLalqv62qh5RVUvorv0zqup1o9gW6Pr1Cfr4kWzPtKk5cMf3ff2H7ltNrqS7o/+9rexNwJva8sOB0+jm1V8GvG6ybVv5xnQ3Xv5He9xotuICdqYbiltBN5VnObB7W3d8q7+C7g9m8SzG9Wi6eeSXAj+YK+errVsf+BWwaNw+Z+N8Pb21+3Lgy8BD5sj11TeuOXB9TRTXsK+vyX6PM3l9fYFuWtRv6T5Ne8O4uAL8c4t7JbDjbFxfc+VnojYO+2eC39uE5x14b2vDFbRvmGnlO9L1Zz+muw9s7B/drkc3je4qum+oefQMtqVvnzCK7QGeSDctcUWL4wOtfOTaMq5du/KHb20aybYwQR8/qu2Zrh//s7UkSZKkgTm1SZIkSdLATCQkSZIkDcxEQpIkSdLATCQkSZIkDcxEQpIkSdLATCQkSZIkDcxEQpIkSdLATCQkSZIkDez/B7H6HHuaHKKRAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "class_scores_dt = [dt_model.score(X_test[y_test==covertype],y_test[y_test==covertype]) for covertype in np.unique(y_test)]\n", "class_counts = np.unique(y_test,return_counts=True)[1]\n", "\n", "fig, ax = plt.subplots(1,2,figsize=(12,3),sharey=True)\n", "ax[0].barh(range(len(class_scores_dt)),class_scores_dt)\n", "ax[0].set_xlim((0.8,1.0))\n", "ax[0].set_yticklabels(class_names)\n", "ax[0].set_yticks(range(7))\n", "ax[0].set_title(\"Accuracy\")\n", "ax[1].barh(range(len(class_counts)),class_counts)\n", "ax[1].set_title(\"Sample count\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "To account for imbalances in the sample counts of the classes, we will also calculate the [balanced accuracy score](https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score):" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "from sklearn.metrics import balanced_accuracy_score\n", "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9389516621773965\n", "0.8832865638983036\n" ] } ], "source": [ "y_pred = dt_model.predict(X_test)\n", "print(accuracy_score(y_test, y_pred))\n", "print(balanced_accuracy_score(y_test, y_pred, adjusted=True))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Still pretty high, but we can see that the balanced accuracy score only reaches around 88% due to some classes having lower scores." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Random Forest model\n", "\n", "Let's see if we can achieve better results with a more advanced model. For this purpose, we'll take a random forest model:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4min 17s, sys: 1.66 s, total: 4min 19s\n", "Wall time: 26.1 s\n" ] } ], "source": [ "%%time\n", "model = RandomForestClassifier(random_state=42, n_jobs= -1, oob_score=True, class_weight='balanced')\n", "model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "When working with random forests, we don't need an independent validation data set to assess model performance: We can just use the internally calculated [OOB score (out-of-bag)](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr): " ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "0.9537250784730933" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.oob_score_" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Let's compare that to the validation score:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "0.9555088939184014" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "As we can see, both values are pretty close to each other, indicating that the OOB score really was a good estimate of the general model performance. However, we want to compare the results to those of the baseline model, so we also calculate the balanced accuracy score:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9555088939184014\n", "0.8928133354659079\n" ] } ], "source": [ "y_pred = model.predict(X_test)\n", "print(accuracy_score(y_test, y_pred))\n", "print(balanced_accuracy_score(y_test, y_pred, adjusted=True))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "We see, even without further adjustments, the random forest model performs better on the data set. But let's see if we can do even better..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Clean up the data set\n", "\n", "The most important thing you **always** have to do before jumping into the ML training phase is to **clean up your data!**\n", "\n", "So let's take a look at the data again:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ElevationAspectSlopeHorizontal_Distance_To_HydrologyVertical_Distance_To_HydrologyHorizontal_Distance_To_RoadwaysHillshade_9amHillshade_NoonHillshade_3pmHorizontal_Distance_To_Fire_Points...Soil_Type_31Soil_Type_32Soil_Type_33Soil_Type_34Soil_Type_35Soil_Type_36Soil_Type_37Soil_Type_38Soil_Type_39Cover_Type
count581012.000000581012.000000581012.000000581012.000000581012.000000581012.000000581012.000000581012.000000581012.000000581012.000000...581012.000000581012.000000581012.000000581012.000000581012.000000581012.000000581012.000000581012.000000581012.000000581012.000000
mean2959.365301155.65680714.103704269.42821746.4188552350.146611212.146049223.318716142.5282631980.291226...0.0903920.0777160.0027730.0032550.0002050.0005130.0268030.0237620.0150602.051471
std279.984734111.9137217.488242212.54935658.2952321559.25487026.76988919.76869738.2745291324.195210...0.2867430.2677250.0525840.0569570.0143100.0226410.1615080.1523070.1217911.396504
min1859.0000000.0000000.0000000.000000-173.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.000000
25%2809.00000058.0000009.000000108.0000007.0000001106.000000198.000000213.000000119.0000001024.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.000000
50%2996.000000127.00000013.000000218.00000030.0000001997.000000218.000000226.000000143.0000001710.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000002.000000
75%3163.000000260.00000018.000000384.00000069.0000003328.000000231.000000237.000000168.0000002550.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000002.000000
max3858.000000360.00000066.0000001397.000000601.0000007117.000000254.000000254.000000254.0000007173.000000...1.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000007.000000
\n", "

8 rows × 55 columns

\n", "
" ], "text/plain": [ " Elevation Aspect Slope \\\n", "count 581012.000000 581012.000000 581012.000000 \n", "mean 2959.365301 155.656807 14.103704 \n", "std 279.984734 111.913721 7.488242 \n", "min 1859.000000 0.000000 0.000000 \n", "25% 2809.000000 58.000000 9.000000 \n", "50% 2996.000000 127.000000 13.000000 \n", "75% 3163.000000 260.000000 18.000000 \n", "max 3858.000000 360.000000 66.000000 \n", "\n", " Horizontal_Distance_To_Hydrology Vertical_Distance_To_Hydrology \\\n", "count 581012.000000 581012.000000 \n", "mean 269.428217 46.418855 \n", "std 212.549356 58.295232 \n", "min 0.000000 -173.000000 \n", "25% 108.000000 7.000000 \n", "50% 218.000000 30.000000 \n", "75% 384.000000 69.000000 \n", "max 1397.000000 601.000000 \n", "\n", " Horizontal_Distance_To_Roadways Hillshade_9am Hillshade_Noon \\\n", "count 581012.000000 581012.000000 581012.000000 \n", "mean 2350.146611 212.146049 223.318716 \n", "std 1559.254870 26.769889 19.768697 \n", "min 0.000000 0.000000 0.000000 \n", "25% 1106.000000 198.000000 213.000000 \n", "50% 1997.000000 218.000000 226.000000 \n", "75% 3328.000000 231.000000 237.000000 \n", "max 7117.000000 254.000000 254.000000 \n", "\n", " Hillshade_3pm Horizontal_Distance_To_Fire_Points ... Soil_Type_31 \\\n", "count 581012.000000 581012.000000 ... 581012.000000 \n", "mean 142.528263 1980.291226 ... 0.090392 \n", "std 38.274529 1324.195210 ... 0.286743 \n", "min 0.000000 0.000000 ... 0.000000 \n", "25% 119.000000 1024.000000 ... 0.000000 \n", "50% 143.000000 1710.000000 ... 0.000000 \n", "75% 168.000000 2550.000000 ... 0.000000 \n", "max 254.000000 7173.000000 ... 1.000000 \n", "\n", " Soil_Type_32 Soil_Type_33 Soil_Type_34 Soil_Type_35 \\\n", "count 581012.000000 581012.000000 581012.000000 581012.000000 \n", "mean 0.077716 0.002773 0.003255 0.000205 \n", "std 0.267725 0.052584 0.056957 0.014310 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.000000 \n", "50% 0.000000 0.000000 0.000000 0.000000 \n", "75% 0.000000 0.000000 0.000000 0.000000 \n", "max 1.000000 1.000000 1.000000 1.000000 \n", "\n", " Soil_Type_36 Soil_Type_37 Soil_Type_38 Soil_Type_39 \\\n", "count 581012.000000 581012.000000 581012.000000 581012.000000 \n", "mean 0.000513 0.026803 0.023762 0.015060 \n", "std 0.022641 0.161508 0.152307 0.121791 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.000000 \n", "50% 0.000000 0.000000 0.000000 0.000000 \n", "75% 0.000000 0.000000 0.000000 0.000000 \n", "max 1.000000 1.000000 1.000000 1.000000 \n", "\n", " Cover_Type \n", "count 581012.000000 \n", "mean 2.051471 \n", "std 1.396504 \n", "min 1.000000 \n", "25% 1.000000 \n", "50% 2.000000 \n", "75% 2.000000 \n", "max 7.000000 \n", "\n", "[8 rows x 55 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Index(['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',\n", " 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',\n", " 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',\n", " 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area_0',\n", " 'Wilderness_Area_1', 'Wilderness_Area_2', 'Wilderness_Area_3',\n", " 'Soil_Type_0', 'Soil_Type_1', 'Soil_Type_2', 'Soil_Type_3',\n", " 'Soil_Type_4', 'Soil_Type_5', 'Soil_Type_6', 'Soil_Type_7',\n", " 'Soil_Type_8', 'Soil_Type_9', 'Soil_Type_10', 'Soil_Type_11',\n", " 'Soil_Type_12', 'Soil_Type_13', 'Soil_Type_14', 'Soil_Type_15',\n", " 'Soil_Type_16', 'Soil_Type_17', 'Soil_Type_18', 'Soil_Type_19',\n", " 'Soil_Type_20', 'Soil_Type_21', 'Soil_Type_22', 'Soil_Type_23',\n", " 'Soil_Type_24', 'Soil_Type_25', 'Soil_Type_26', 'Soil_Type_27',\n", " 'Soil_Type_28', 'Soil_Type_29', 'Soil_Type_30', 'Soil_Type_31',\n", " 'Soil_Type_32', 'Soil_Type_33', 'Soil_Type_34', 'Soil_Type_35',\n", " 'Soil_Type_36', 'Soil_Type_37', 'Soil_Type_38', 'Soil_Type_39',\n", " 'Cover_Type'],\n", " dtype='object')" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.keys()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Well, this looks ok. However, there are many 1-hot-features in the data set (wilderness areas and soil types). We can recode these into 2 categorical variables to reduce the dimensionality of the data set significantly:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df_clean = df[df.keys()[:10]] # clip data set to first 10 features (without wilderness area and soil types)\n", "df_clean[\"wilderness_area\"] = df[df.keys()[10:14]].values.argmax(axis=1) # add wilderness areas encoded as one variable with values 0-3\n", "df_clean[\"soil_type\"] = df[df.keys()[14:-1]].values.argmax(axis=1) # add soil types encoded as one variable with vlaues 0-39\n", "df_clean[\"Cover_Type\"] = df[\"Cover_Type\"] # add target feature" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ElevationAspectSlopeHorizontal_Distance_To_HydrologyVertical_Distance_To_HydrologyHorizontal_Distance_To_RoadwaysHillshade_9amHillshade_NoonHillshade_3pmHorizontal_Distance_To_Fire_Pointswilderness_areasoil_typeCover_Type
02596.051.03.0258.00.0510.0221.0232.0148.06279.00285
12590.056.02.0212.0-6.0390.0220.0235.0151.06225.00285
22804.0139.09.0268.065.03180.0234.0238.0135.06121.00112
32785.0155.018.0242.0118.03090.0238.0238.0122.06211.00292
42595.045.02.0153.0-1.0391.0220.0234.0150.06172.00285
..........................................
5810072396.0153.020.085.017.0108.0240.0237.0118.0837.0213
5810082391.0152.019.067.012.095.0240.0237.0119.0845.0213
5810092386.0159.017.060.07.090.0236.0241.0130.0854.0213
5810102384.0170.015.060.05.090.0230.0245.0143.0864.0213
5810112383.0165.013.060.04.067.0231.0244.0141.0875.0213
\n", "

581012 rows × 13 columns

\n", "
" ], "text/plain": [ " Elevation Aspect Slope Horizontal_Distance_To_Hydrology \\\n", "0 2596.0 51.0 3.0 258.0 \n", "1 2590.0 56.0 2.0 212.0 \n", "2 2804.0 139.0 9.0 268.0 \n", "3 2785.0 155.0 18.0 242.0 \n", "4 2595.0 45.0 2.0 153.0 \n", "... ... ... ... ... \n", "581007 2396.0 153.0 20.0 85.0 \n", "581008 2391.0 152.0 19.0 67.0 \n", "581009 2386.0 159.0 17.0 60.0 \n", "581010 2384.0 170.0 15.0 60.0 \n", "581011 2383.0 165.0 13.0 60.0 \n", "\n", " Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways \\\n", "0 0.0 510.0 \n", "1 -6.0 390.0 \n", "2 65.0 3180.0 \n", "3 118.0 3090.0 \n", "4 -1.0 391.0 \n", "... ... ... \n", "581007 17.0 108.0 \n", "581008 12.0 95.0 \n", "581009 7.0 90.0 \n", "581010 5.0 90.0 \n", "581011 4.0 67.0 \n", "\n", " Hillshade_9am Hillshade_Noon Hillshade_3pm \\\n", "0 221.0 232.0 148.0 \n", "1 220.0 235.0 151.0 \n", "2 234.0 238.0 135.0 \n", "3 238.0 238.0 122.0 \n", "4 220.0 234.0 150.0 \n", "... ... ... ... \n", "581007 240.0 237.0 118.0 \n", "581008 240.0 237.0 119.0 \n", "581009 236.0 241.0 130.0 \n", "581010 230.0 245.0 143.0 \n", "581011 231.0 244.0 141.0 \n", "\n", " Horizontal_Distance_To_Fire_Points wilderness_area soil_type \\\n", "0 6279.0 0 28 \n", "1 6225.0 0 28 \n", "2 6121.0 0 11 \n", "3 6211.0 0 29 \n", "4 6172.0 0 28 \n", "... ... ... ... \n", "581007 837.0 2 1 \n", "581008 845.0 2 1 \n", "581009 854.0 2 1 \n", "581010 864.0 2 1 \n", "581011 875.0 2 1 \n", "\n", " Cover_Type \n", "0 5 \n", "1 5 \n", "2 2 \n", "3 2 \n", "4 5 \n", "... ... \n", "581007 3 \n", "581008 3 \n", "581009 3 \n", "581010 3 \n", "581011 3 \n", "\n", "[581012 rows x 13 columns]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_clean" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We could now also further examine the data set for specific features. For example, it always makes sense to look at the correlation of the predictors among each other as well as with the target feature to get an understanding of the relationships between the variables." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(1,1,figsize=(8,8))\n", "im = ax.imshow(df_clean.corr(),cmap=\"Reds\")\n", "ax.set_xticks(np.arange(len(df_clean.keys()))); ax.set_xticklabels(df_clean.keys())\n", "ax.set_yticks(np.arange(len(df_clean.keys()))); ax.set_yticklabels(df_clean.keys())\n", "plt.setp(ax.get_xticklabels(), rotation=45, ha=\"right\", rotation_mode=\"anchor\")\n", "ax.set_title(\"Correlation matrix\")\n", "fig.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "This correlation matrix shows us that the hillshade variables and the aspect show some correlation and that the elevation is somehow linked to the soil type. We can also see, that the wilderness area has the strongest correlation with our target variable." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "There are many more analyses that could be applied to the data before conducting the machine learning part. However, here we want to focus on the application of the ML methods. So, on this \"clean\" data set, we can train and test the model again:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "feature_names_clean = ['Elevation','Aspect','Slope','Horizontal_Distance_To_Hydrology','Vertical_Distance_To_Hydrology','Horizontal_Distance_To_Roadways','Hillshade_9am','Hillshade_Noon','Hillshade_3pm','Horizontal_Distance_To_Fire_Points','wilderness_area','soil_type']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(df_clean[feature_names_clean], df_clean[target_names], test_size=0.2, random_state=42)\n", "y_train = y_train.values.ravel() # convert to 1d array\n", "y_test = y_test.values.ravel() # convert to 1d array" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9631076650344655\n", "0.9032239907339351\n", "CPU times: user 4min 3s, sys: 532 ms, total: 4min 3s\n", "Wall time: 23.3 s\n" ] } ], "source": [ "%%time\n", "model = RandomForestClassifier(random_state=42, n_jobs= -1, oob_score=True, class_weight='balanced')\n", "model.fit(X_train, y_train)\n", "y_pred = model.predict(X_test)\n", "print(accuracy_score(y_test, y_pred))\n", "print(balanced_accuracy_score(y_test, y_pred, adjusted=True))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Which gives us a slight accuracy gain by ~0.01." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "In summary, we learned:\n", "\n", "- Look into your data\n", "- Clean up your data\n", "- Split your data into training and testing\n", "- Try different models\n", "\n", "In the next session, we will take a look at how we can apply these techniques in remote sensing tasks." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 4 }