Building a Pokémon Recomendation Machine

Jul 15, 2019 60 min read 0 Comments Blog, Data Science, Machine Learning, R

https://upload.wikimedia.org/wikipedia

1 Introduction

When I saw the Complete Pokémon Dataset on Kaggle, I just had to download it and have a look! When I was younger, I was a big fan of Pokémon and used to play it regularly and watch the TV show (to this day I can recite much of the original Pokémon Rap). More recently, my daughter has become a fan, and watches the show incessently (although it beats watching Peppa Pig…). So I am going to have a look over these data and see what they can show us about these pocket monsters.

This is a fairly comprehensive analysis of the data, and will include introductions to a number of different data science techniques. I may further develop these into posts of their own in the future, so will only skim over most of them here. I hope that this post shows a fairly complete example of the types of analyses that it is possible to do with data such as these for prediction and recommendation.

2 Download data

The data about each of the Pokémon can be downloaded directly from Kaggle here, and I have also downloaded some images for each of the Pokémon (at least for Generations 1 to 6) from Kaggle here.

The main data are in the form of a compressed comma separated values (CSV) file. After unzipping the file, we are left with a plain text file where every row is a separate entry, and the various columns of the data set are separated by commas. So let’s load the data in using the read.csv function, which will generate a data.frame object, and take a look at it:

pokedat <- read.csv("pokemon.csv")
rownames(pokedat) <- pokedat[["name"]]
dim(pokedat)

## [1] 801  41

So we have information for 41 variables for 801 unique Pokémon. Each Pokémon has a unique number assigned to it in the so called “Pokédex”, which is common between this data set and the list of Pokémon images. This makes it easy to link the two. Back in my day, there were only 151 Pokémon to keep track of, but many years later they have added more and more with each “Generation”. Pokémon Generation 8 is the most recent and has only recently been released, so we can see which generations are present in this data set by using the table function, which will show us the number of entries with each value:

table(pokedat[["generation"]])

## 
##   1   2   3   4   5   6   7 
## 151 100 135 107 156  72  80

So we can see the 151 Generation 1 Pokémon, but also additional Pokémon from up to Generation 7. We can get a fairly broad overview of the data set by using the str function to gain an overview of what is contained within each of the columns of this data.frame:

str(pokedat)

## 'data.frame':    801 obs. of  41 variables:
##  $ abilities        : Factor w/ 482 levels "['Adaptability', 'Download', 'Analytic']",..: 244 244 244 22 22 22 453 453 453 348 ...
##  $ against_bug      : num  1 1 1 0.5 0.5 0.25 1 1 1 1 ...
##  $ against_dark     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_dragon   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_electric : num  0.5 0.5 0.5 1 1 2 2 2 2 1 ...
##  $ against_fairy    : num  0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 ...
##  $ against_fight    : num  0.5 0.5 0.5 1 1 0.5 1 1 1 0.5 ...
##  $ against_fire     : num  2 2 2 0.5 0.5 0.5 0.5 0.5 0.5 2 ...
##  $ against_flying   : num  2 2 2 1 1 1 1 1 1 2 ...
##  $ against_ghost    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_grass    : num  0.25 0.25 0.25 0.5 0.5 0.25 2 2 2 0.5 ...
##  $ against_ground   : num  1 1 1 2 2 0 1 1 1 0.5 ...
##  $ against_ice      : num  2 2 2 0.5 0.5 1 0.5 0.5 0.5 1 ...
##  $ against_normal   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_poison   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_psychic  : num  2 2 2 1 1 1 1 1 1 1 ...
##  $ against_rock     : num  1 1 1 2 2 4 1 1 1 2 ...
##  $ against_steel    : num  1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 1 ...
##  $ against_water    : num  0.5 0.5 0.5 2 2 2 0.5 0.5 0.5 1 ...
##  $ attack           : int  49 62 100 52 64 104 48 63 103 30 ...
##  $ base_egg_steps   : int  5120 5120 5120 5120 5120 5120 5120 5120 5120 3840 ...
##  $ base_happiness   : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ base_total       : int  318 405 625 309 405 634 314 405 630 195 ...
##  $ capture_rate     : Factor w/ 34 levels "100","120","125",..: 26 26 26 26 26 26 26 26 26 21 ...
##  $ classfication    : Factor w/ 588 levels "Abundance Pokémon",..: 449 449 449 299 187 187 531 546 457 585 ...
##  $ defense          : int  49 63 123 43 58 78 65 80 120 35 ...
##  $ experience_growth: int  1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1000000 ...
##  $ height_m         : num  0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
##  $ hp               : int  45 60 80 39 58 78 44 59 79 45 ...
##  $ japanese_name    : Factor w/ 801 levels "Abagouraアバゴーラ",..: 200 201 199 288 417 416 794 334 336 80 ...
##  $ name             : Factor w/ 801 levels "Abomasnow","Abra",..: 73 321 745 95 96 93 656 764 56 88 ...
##  $ percentage_male  : num  88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 50 ...
##  $ pokedex_number   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sp_attack        : int  65 80 122 60 80 159 50 65 135 20 ...
##  $ sp_defense       : int  65 80 120 50 65 115 64 80 115 20 ...
##  $ speed            : int  45 60 80 65 80 100 43 58 78 45 ...
##  $ type1            : Factor w/ 18 levels "bug","dark","dragon",..: 10 10 10 7 7 7 18 18 18 1 ...
##  $ type2            : Factor w/ 19 levels "","bug","dark",..: 15 15 15 1 1 9 1 1 1 1 ...
##  $ weight_kg        : num  6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
##  $ generation       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ is_legendary     : int  0 0 0 0 0 0 0 0 0 0 ...

So we gave a lot of different information contained in this data set, with the vast majority being represented by numerical values. As well as the name by which we likely know them, we have the original Japanese name, as well as their fighting statistics such as hp (Health Points), speed, attack and defense. We also get their specific abilities, which are given in a listed format within square brackets (e.g. three abilities – ['Adaptability', 'Download', 'Analytic'] – for Abomasnow). There are also various other things that we will explore now in the following sections. So let’s explore these data to see how they look, and to check for any inconsistencies that need to be corrected.

3 Data Cleaning

The first step in any data analysis is to check the consistency of the data to ensure that there are no missing values (and if there are, decide the best thing to do with them), to make sure that the data are consistent and fit within expected bounds, and to generally make sure that these data make sense. These are data that somebody else has generated, so it is best not to assume that they are perfect. If there are any issues, this will propogate to our downstream analyses.

3.1 Missing Data

First of all, let’s take a look to see which of the entries for each of the columns is missing (i.e. is coded as NA):

has_na <- apply(pokedat, MAR = 2, FUN = function(x) which(is.na(x))) 
has_na[sapply(has_na, length) > 0]

## $height_m
##   Rattata  Raticate    Raichu Sandshrew Sandslash    Vulpix Ninetales 
##        19        20        26        27        28        37        38 
##   Diglett   Dugtrio    Meowth   Persian   Geodude  Graveler     Golem 
##        50        51        52        53        74        75        76 
##    Grimer       Muk Exeggutor   Marowak     Hoopa  Lycanroc 
##        88        89       103       105       720       745 
## 
## $percentage_male
##  Magnemite   Magneton    Voltorb  Electrode     Staryu    Starmie 
##         81         82        100        101        120        121 
##      Ditto    Porygon   Articuno     Zapdos    Moltres     Mewtwo 
##        132        137        144        145        146        150 
##        Mew      Unown   Porygon2     Raikou      Entei    Suicune 
##        151        201        233        243        244        245 
##      Lugia      Ho-Oh     Celebi   Shedinja   Lunatone    Solrock 
##        249        250        251        292        337        338 
##     Baltoy    Claydol     Beldum     Metang  Metagross   Regirock 
##        343        344        374        375        376        377 
##     Regice  Registeel     Kyogre    Groudon   Rayquaza    Jirachi 
##        378        379        382        383        384        385 
##     Deoxys    Bronzor   Bronzong  Magnezone  Porygon-Z      Rotom 
##        386        436        437        462        474        479 
##       Uxie    Mesprit      Azelf     Dialga     Palkia  Regigigas 
##        480        481        482        483        484        486 
##   Giratina     Phione    Manaphy    Darkrai    Shaymin     Arceus 
##        487        489        490        491        492        493 
##    Victini      Klink      Klang  Klinklang  Cryogonal     Golett 
##        494        599        600        601        615        622 
##     Golurk   Cobalion  Terrakion   Virizion   Reshiram     Zekrom 
##        623        638        639        640        643        644 
##     Kyurem     Keldeo   Meloetta   Genesect    Carbink    Xerneas 
##        646        647        648        649        703        716 
##    Yveltal    Zygarde    Diancie      Hoopa  Volcanion Type: Null 
##        717        718        719        720        721        772 
##   Silvally     Minior   Dhelmise  Tapu Koko  Tapu Lele  Tapu Bulu 
##        773        774        781        785        786        787 
##  Tapu Fini     Cosmog    Cosmoem   Solgaleo     Lunala   Nihilego 
##        788        789        790        791        792        793 
##   Buzzwole  Pheromosa  Xurkitree Celesteela    Kartana   Guzzlord 
##        794        795        796        797        798        799 
##   Necrozma   Magearna 
##        800        801 
## 
## $weight_kg
##   Rattata  Raticate    Raichu Sandshrew Sandslash    Vulpix Ninetales 
##        19        20        26        27        28        37        38 
##   Diglett   Dugtrio    Meowth   Persian   Geodude  Graveler     Golem 
##        50        51        52        53        74        75        76 
##    Grimer       Muk Exeggutor   Marowak     Hoopa  Lycanroc 
##        88        89       103       105       720       745

In general, this seems to be a fairly complete data set, with only three of the variables showing any NA data. We can see that there are 20 Pokémon with no height nor weight data:

subset(pokedat, is.na(height_m) | is.na(weight_kg))[, c("name", "height_m", "weight_kg")]

##                name height_m weight_kg
## Rattata     Rattata       NA        NA
## Raticate   Raticate       NA        NA
## Raichu       Raichu       NA        NA
## Sandshrew Sandshrew       NA        NA
## Sandslash Sandslash       NA        NA
## Vulpix       Vulpix       NA        NA
## Ninetales Ninetales       NA        NA
## Diglett     Diglett       NA        NA
## Dugtrio     Dugtrio       NA        NA
## Meowth       Meowth       NA        NA
## Persian     Persian       NA        NA
## Geodude     Geodude       NA        NA
## Graveler   Graveler       NA        NA
## Golem         Golem       NA        NA
## Grimer       Grimer       NA        NA
## Muk             Muk       NA        NA
## Exeggutor Exeggutor       NA        NA
## Marowak     Marowak       NA        NA
## Hoopa         Hoopa       NA        NA
## Lycanroc   Lycanroc       NA        NA

Many of these are Pokémon that I know from Generation 1, and in fact seem to be sets of evolutions. For instance, Rattata evolves into Raticate:

Rattata Raticate

Sandshrew evolves into Sandslash:

Sandshrew Sandslash

And Vulpix evolves into Ninetales:

Vulpix Ninetales

There are also a couple of other none-Generation 1 Pokémon, including Lycanroc which is one of my daughter’s favourited from Pokémon Sun and Moon:

Lycanroc

However, there is no obvious reason why values are missing. There are methods that can be used to account for missing data. One possible approach is to impute the data – that is, we use the rest of the data to give us a rough idea of what we should see for these missing values. An example of this is to simply use the mean of the non-missing values for the missing variable. However, in this case, we can actually find these missing values by visiting an online Pokedex, so let’s correct these, ensuring that we match the units for weight (kg) and height (m):

missing_height <- list(Rattata   = c(height_m = 0.3, weight_kg =   3.5), 
                       Raticate  = c(height_m = 0.7, weight_kg =  18.5),
                       Raichu    = c(height_m = 0.8, weight_kg =  30.0),
                       Sandshrew = c(height_m = 0.6, weight_kg =  12.0),
                       Sandslash = c(height_m = 1.0, weight_kg =  29.5),
                       Vulpix    = c(height_m = 0.6, weight_kg =   9.9),
                       Ninetales = c(height_m = 1.1, weight_kg =  19.9),
                       Diglett   = c(height_m = 0.2, weight_kg =   0.8),
                       Dugtrio   = c(height_m = 0.7, weight_kg =  33.3),
                       Meowth    = c(height_m = 0.4, weight_kg =   4.2),
                       Persian   = c(height_m = 1.0, weight_kg =  32.0),
                       Geodude   = c(height_m = 0.4, weight_kg =  20.0),
                       Graveler  = c(height_m = 0.3, weight_kg = 105.0),
                       Golem     = c(height_m = 1.4, weight_kg = 300.0),
                       Grimer    = c(height_m = 0.9, weight_kg =  30.0),
                       Muk       = c(height_m = 1.2, weight_kg =  30.0),
                       Exeggutor = c(height_m = 2.0, weight_kg = 120.0),
                       Marowak   = c(height_m = 1.0, weight_kg =  45.0),
                       Hoopa     = c(height_m = 0.5, weight_kg =   9.0),
                       Lycanroc  = c(height_m = 0.8, weight_kg =  25.0))
missing_height <- t(rbind.data.frame(missing_height))
pokedat[match(rownames(missing_height), pokedat[["name"]]), c("height_m", "weight_kg")] <- missing_height

There are also 98 Pokémon with a missing percentage_male value. This value gives the proportion of the Pokémon out in the world that you might come across in the game that are male as a percentage. These seem to be spread throughout the entire list of Pokémon across all Generations, with no clear reason as to why they have missing values:

head(subset(pokedat, is.na(percentage_male)))

##                                            abilities against_bug
## Magnemite      ['Magnet Pull', 'Sturdy', 'Analytic']         0.5
## Magneton       ['Magnet Pull', 'Sturdy', 'Analytic']         0.5
## Voltorb        ['Soundproof', 'Static', 'Aftermath']         1.0
## Electrode      ['Soundproof', 'Static', 'Aftermath']         1.0
## Staryu    ['Illuminate', 'Natural Cure', 'Analytic']         1.0
## Starmie   ['Illuminate', 'Natural Cure', 'Analytic']         2.0
##           against_dark against_dragon against_electric against_fairy
## Magnemite            1            0.5              0.5           0.5
## Magneton             1            0.5              0.5           0.5
## Voltorb              1            1.0              0.5           1.0
## Electrode            1            1.0              0.5           1.0
## Staryu               1            1.0              2.0           1.0
## Starmie              2            1.0              2.0           1.0
##           against_fight against_fire against_flying against_ghost
## Magnemite           2.0          2.0           0.25             1
## Magneton            2.0          2.0           0.25             1
## Voltorb             1.0          1.0           0.50             1
## Electrode           1.0          1.0           0.50             1
## Staryu              1.0          0.5           1.00             1
## Starmie             0.5          0.5           1.00             2
##           against_grass against_ground against_ice against_normal
## Magnemite           0.5              4         0.5            0.5
## Magneton            0.5              4         0.5            0.5
## Voltorb             1.0              2         1.0            1.0
## Electrode           1.0              2         1.0            1.0
## Staryu              2.0              1         0.5            1.0
## Starmie             2.0              1         0.5            1.0
##           against_poison against_psychic against_rock against_steel
## Magnemite              0             0.5          0.5          0.25
## Magneton               0             0.5          0.5          0.25
## Voltorb                1             1.0          1.0          0.50
## Electrode              1             1.0          1.0          0.50
## Staryu                 1             1.0          1.0          0.50
## Starmie                1             0.5          1.0          0.50
##           against_water attack base_egg_steps base_happiness base_total
## Magnemite           1.0     35           5120             70        325
## Magneton            1.0     60           5120             70        465
## Voltorb             1.0     30           5120             70        330
## Electrode           1.0     50           5120             70        490
## Staryu              0.5     45           5120             70        340
## Starmie             0.5     75           5120             70        520
##           capture_rate      classfication defense experience_growth
## Magnemite          190     Magnet Pokémon      70           1000000
## Magneton            60     Magnet Pokémon      95           1000000
## Voltorb            190       Ball Pokémon      50           1000000
## Electrode           60       Ball Pokémon      70           1000000
## Staryu             225  Starshape Pokémon      55           1250000
## Starmie             60 Mysterious Pokémon      85           1250000
##           height_m hp        japanese_name      name percentage_male
## Magnemite      0.3 25           Coilコイル Magnemite              NA
## Magneton       1.0 50   Rarecoilレアコイル  Magneton              NA
## Voltorb        0.5 40 Biriridamaビリリダマ   Voltorb              NA
## Electrode      1.2 60   Marumineマルマイン Electrode              NA
## Staryu         0.8 30  Hitodemanヒトデマン    Staryu              NA
## Starmie        1.1 60    Starmieスターミー   Starmie              NA
##           pokedex_number sp_attack sp_defense speed    type1   type2
## Magnemite             81        95         55    45 electric   steel
## Magneton              82       120         70    70 electric   steel
## Voltorb              100        55         55   100 electric        
## Electrode            101        80         80   150 electric        
## Staryu               120        70         55    85    water        
## Starmie              121       100         85   115    water psychic
##           weight_kg generation is_legendary
## Magnemite       6.0          1            0
## Magneton       60.0          1            0
## Voltorb        10.4          1            0
## Electrode      66.6          1            0
## Staryu         34.5          1            0
## Starmie        80.0          1            0

However, by looking at a few of these in the Pokedex, it would appear that these are generally genderless Pokémon, which would explain the missing values. A sensible value to use in these cases would therefore be 0.5, representing an equal spit of male and female:

pokedat[is.na(pokedat[["percentage_male"]]), "percentage_male"] <- 0.5

It is also worth noting that there are also missing values for the type2 Pokémon. These were not picked up as NA values, because they are encoded as a blank entry "“. We can convert this to a more descriptive factor such as”none":

levels(pokedat[["type2"]])[levels(pokedat[["type2"]]) == "none"] <- "none"

3.2 Numeric Range

The vast majority of these variables are numeric in nature, so it is worth checking the range of these values to ensure that they are within typical ranges that we might expect. For instance, we would not expect negative values, zero values, or values greater than some sensible limit for things like height, weight, etc. So let’s take an overall look at these data ranges by using the summary() function:

summary(pokedat)

##                                      abilities    against_bug    
##  ['Levitate']                             : 29   Min.   :0.2500  
##  ['Beast Boost']                          :  7   1st Qu.:0.5000  
##  ['Shed Skin']                            :  5   Median :1.0000  
##  ['Clear Body', 'Light Metal']            :  4   Mean   :0.9963  
##  ['Justified']                            :  4   3rd Qu.:1.0000  
##  ['Keen Eye', 'Tangled Feet', 'Big Pecks']:  4   Max.   :4.0000  
##  (Other)                                  :748                   
##   against_dark   against_dragon   against_electric against_fairy  
##  Min.   :0.250   Min.   :0.0000   Min.   :0.000    Min.   :0.250  
##  1st Qu.:1.000   1st Qu.:1.0000   1st Qu.:0.500    1st Qu.:1.000  
##  Median :1.000   Median :1.0000   Median :1.000    Median :1.000  
##  Mean   :1.057   Mean   :0.9688   Mean   :1.074    Mean   :1.069  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.000    3rd Qu.:1.000  
##  Max.   :4.000   Max.   :2.0000   Max.   :4.000    Max.   :4.000  
##                                                                   
##  against_fight    against_fire   against_flying  against_ghost  
##  Min.   :0.000   Min.   :0.250   Min.   :0.250   Min.   :0.000  
##  1st Qu.:0.500   1st Qu.:0.500   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.066   Mean   :1.135   Mean   :1.193   Mean   :0.985  
##  3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##                                                                 
##  against_grass   against_ground   against_ice    against_normal 
##  Min.   :0.250   Min.   :0.000   Min.   :0.250   Min.   :0.000  
##  1st Qu.:0.500   1st Qu.:1.000   1st Qu.:0.500   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.034   Mean   :1.098   Mean   :1.208   Mean   :0.887  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :1.000  
##                                                                 
##  against_poison   against_psychic  against_rock  against_steel   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.25   Min.   :0.2500  
##  1st Qu.:0.5000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:0.5000  
##  Median :1.0000   Median :1.000   Median :1.00   Median :1.0000  
##  Mean   :0.9753   Mean   :1.005   Mean   :1.25   Mean   :0.9835  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:2.00   3rd Qu.:1.0000  
##  Max.   :4.0000   Max.   :4.000   Max.   :4.00   Max.   :4.0000  
##                                                                  
##  against_water       attack       base_egg_steps  base_happiness  
##  Min.   :0.250   Min.   :  5.00   Min.   : 1280   Min.   :  0.00  
##  1st Qu.:0.500   1st Qu.: 55.00   1st Qu.: 5120   1st Qu.: 70.00  
##  Median :1.000   Median : 75.00   Median : 5120   Median : 70.00  
##  Mean   :1.058   Mean   : 77.86   Mean   : 7191   Mean   : 65.36  
##  3rd Qu.:1.000   3rd Qu.:100.00   3rd Qu.: 6400   3rd Qu.: 70.00  
##  Max.   :4.000   Max.   :185.00   Max.   :30720   Max.   :140.00  
##                                                                   
##    base_total     capture_rate          classfication    defense      
##  Min.   :180.0   45     :250   Dragon Pokémon  :  8   Min.   :  5.00  
##  1st Qu.:320.0   190    : 75   Mouse Pokémon   :  6   1st Qu.: 50.00  
##  Median :435.0   255    : 69   Mushroom Pokémon:  6   Median : 70.00  
##  Mean   :428.4   75     : 61   Balloon Pokémon :  5   Mean   : 73.01  
##  3rd Qu.:505.0   3      : 58   Fairy Pokémon   :  5   3rd Qu.: 90.00  
##  Max.   :780.0   120    : 55   Flame Pokémon   :  5   Max.   :230.00  
##                  (Other):233   (Other)         :766                   
##  experience_growth    height_m            hp        
##  Min.   : 600000   Min.   : 0.100   Min.   :  1.00  
##  1st Qu.:1000000   1st Qu.: 0.600   1st Qu.: 50.00  
##  Median :1000000   Median : 1.000   Median : 65.00  
##  Mean   :1054996   Mean   : 1.155   Mean   : 68.96  
##  3rd Qu.:1059860   3rd Qu.: 1.500   3rd Qu.: 80.00  
##  Max.   :1640000   Max.   :14.500   Max.   :255.00  
##                                                     
##              japanese_name         name     percentage_male 
##  Abagouraアバゴーラ :  1   Abomasnow :  1   Min.   :  0.00  
##  Absolアブソル      :  1   Abra      :  1   1st Qu.: 50.00  
##  Abulyアブリー      :  1   Absol     :  1   Median : 50.00  
##  Aburibbonアブリボン:  1   Accelgor  :  1   Mean   : 48.47  
##  Achamoアチャモ     :  1   Aegislash :  1   3rd Qu.: 50.00  
##  Agehuntアゲハント  :  1   Aerodactyl:  1   Max.   :100.00  
##  (Other)            :795   (Other)   :795                   
##  pokedex_number   sp_attack        sp_defense         speed       
##  Min.   :  1    Min.   : 10.00   Min.   : 20.00   Min.   :  5.00  
##  1st Qu.:201    1st Qu.: 45.00   1st Qu.: 50.00   1st Qu.: 45.00  
##  Median :401    Median : 65.00   Median : 66.00   Median : 65.00  
##  Mean   :401    Mean   : 71.31   Mean   : 70.91   Mean   : 66.33  
##  3rd Qu.:601    3rd Qu.: 91.00   3rd Qu.: 90.00   3rd Qu.: 85.00  
##  Max.   :801    Max.   :194.00   Max.   :230.00   Max.   :180.00  
##                                                                   
##      type1         type2       weight_kg        generation  
##  water  :114          :384   Min.   :  0.10   Min.   :1.00  
##  normal :105   flying : 95   1st Qu.:  9.00   1st Qu.:2.00  
##  grass  : 78   ground : 34   Median : 27.30   Median :4.00  
##  bug    : 72   poison : 34   Mean   : 60.94   Mean   :3.69  
##  psychic: 53   fairy  : 29   3rd Qu.: 63.00   3rd Qu.:5.00  
##  fire   : 52   psychic: 29   Max.   :999.90   Max.   :7.00  
##  (Other):327   (Other):196                                  
##   is_legendary    
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.08739  
##  3rd Qu.:0.00000  
##  Max.   :1.00000  
##

We can learn a few things from this. For instance, if we look at the character type variables, “Levitate” is the most common ability (although note that this is only counting the cases where there is only a single ability – more on this later), “Dragon Pokémon” are the most common classfication [sic] (although the majority of the Pokémon are given a unique classification, which seems to go against the idea of classification), and “water” and “normal” type Pokémon are very common, and many Pokémon have flying as their secondary type. If we look at the numeric type variables, we can see that the against_ values appear to be numeric values between 0 and 4 in multiples of 0.25 (they represent the Pokémon’s strength against particular Pokémon types during battle), attack, defense and speed vary between 5 and 185, 230 and 180 respectively (so there is quite a big range depending on which Pokémon you have in a fight), there is at least one Pokémon that starts with a base happiness score of 0 (in fact there are 36, which is quite sad!), the percentage of males varies between 0 and 100 with 98 missing values (as expected). But overall there do not appear to be any strange outliers in these data. There are some pretty big Pokémon, but we will explore this in a little more detail later.

3.3 Change Variable Class

The m ajority of these data have been classified correctly by the read.data() function as either numeric or characters (which are factorised automaticallyfor use in model and potting functions later). However, there are some changes that we may wish to make. Firstly, the generation and is_legendary variables should not be considered numeric, even though they are represented by numbers. For instance, a Pokémon could not come from Generation 2.345. So let’s make these changes from numeric to factorised character variables. Note that there is a natural order to the generation values, so I will ensure that these are factorised as ordinal, just in case this comes into play later:

pokedat[["generation"]] <- factor(as.character(pokedat[["generation"]]), order = TRUE)
pokedat[["is_legendary"]] <- factor(as.character(pokedat[["is_legendary"]]))

These variabbles have been classified as factors even though they are represented by numeric values. The values representing the factor levels can actually be anything we choose, so it may be useful to change these to something more descriptive. For instance, it may be worth adding the word “Generation” to the generation variable, whilst the is_legendary variable may be more useful as a binary TRUE/FALSE value. It doesn’t matter what we put, since for later analyses dummy numeric variables will be used for calculations:

levels(pokedat[["generation"]]) <- paste("Generation", levels(pokedat[["generation"]]))
levels(pokedat[["is_legendary"]]) <- c(FALSE, TRUE)

The defaults for the other factors seem to be suitable, so I will leave thse as they are.

In comparison, the capture_rate value is being classed as a factor, even though it should be a number. Why is this?

levels(pokedat[["capture_rate"]])

##  [1] "100"                      "120"                     
##  [3] "125"                      "127"                     
##  [5] "130"                      "140"                     
##  [7] "145"                      "15"                      
##  [9] "150"                      "155"                     
## [11] "160"                      "170"                     
## [13] "180"                      "190"                     
## [15] "200"                      "205"                     
## [17] "220"                      "225"                     
## [19] "235"                      "25"                      
## [21] "255"                      "3"                       
## [23] "30"                       "30 (Meteorite)255 (Core)"
## [25] "35"                       "45"                      
## [27] "50"                       "55"                      
## [29] "60"                       "65"                      
## [31] "70"                       "75"                      
## [33] "80"                       "90"

Aha. So there is a non-numeric value in there – 30 (Meteorite)255 (Core). Let’s see which Pokémon this applies to:

as.character(subset(pokedat, capture_rate == "30 (Meteorite)255 (Core)")[["name"]])

## [1] "Minior"

This is a “Meteor” type Pokémon that appears to have a different capture rate under different conditions. Baased on the online Pokedex, the canonical value to use here is 30 for its Meteorite form, so let’s use this and convert it into a numeric value. One thing to bear in mind here is that factors can be a little funny. When I convert to a number, it will convert the level values to a number, not necessarily the values themselves. So in this case, it will not give me the values but instead will give me the order of the values. For instance. look at the output of the following example:

t <- factor(c("1", "2", "3", "20"))
data.frame(factor = t, level = levels(t), number = as.numeric(t))

##   factor level number
## 1      1     1      1
## 2      2     2      2
## 3      3    20      4
## 4     20     3      3

So the levels are based on a character sort rather than a numeric sort, which puts “20” ahead of “4”. Then, to add to this further, the factor “20” is actually the 3^rd factor so gets the value 3, whilst the factor “3” is the 4^th level, and so gets a value 4. So the output is definitely not what we would expect when converting this to a number. To avoid this, we need to convert to a character value before we do anything else:

pokedat[["capture_rate"]] <- as.character(pokedat[["capture_rate"]])
pokedat[pokedat[["name"]] == "Minior", "capture_rate"] = "30"
pokedat[["capture_rate"]] <- as.numeric(pokedat[["capture_rate"]])

This should now be a pretty clean data set ready for analysis.

4 Exploratory Analyses

There are a lot of data here, and I could spend ages exploring every facet, but I just really wanrt to get a tester for these data here. First of all, let’s take a look at the distribution of some of the main statistics to see how they look across the dataset. For the majority of these plots, I will be using the ggplot2 library which offers a very nice way to produce publication-quality figures using a standardised lexicon, along with the dplyr package which provides a simple way to rearrange data into suitable formats for producing plots. These are part of the Tidyverse suite of packages from Hadley Wickham, and form a very useful suite of packages with a common grammar that can be used together to make Data Science more efficient and to produce beautiful plots easily. Handy cheat sheets can be found for dplyr and ggplot2:

library("ggplot2")
library("dplyr")
library("ggrepel")
library("RColorBrewer")

4.1 Attack

First of all, let’s take a look at the attack strength of all Pokémon split by their main type:

pokedat %>% mutate('MainType' = type1) %>%
ggplot(aes(x = attack, fill = MainType)) + 
  geom_density(alpha = 0.2) + 
  xlab("Attack Strength") + ylab("Density") +
  theme_bw() + 
  theme(axis.title   = element_text(size = 24, face = "bold"),
        axis.text    = element_text(size = 18),
        legend.title = element_text(size = 24, face = "bold"),
        legend.text  = element_text(size = 20))

In general, there is a wide range of attack strengths and they are largely distributed with a roughly normal distribution around the mean attack strength of 77.86 that we saw earlier. However, we do see that there are some differences in the different Pokémon types available. However, this probably isn’t the easiest way to see differences between the groups. Instead, let’s use a boxplot. Here, we can see the overall distribution, with the 25^th percentile, the median (50^th percentile), and the 75^th percentile making up the range of the box, and outliers (I believe those with values in the bottom 2.5^th percentile or upper 97.5^th percentile) highlighted as points on the plot. I have added notches to the boxplots, which show the confidence interval represeneted by:

$median \pm \frac{1.58*IQR}{\sqrt{n}}$ Where IQR is the interquartile range. Essentially, if the notch values do not overlap between two boxes, it suggests that there may be a statistically significant difference between them:

pokedat %>% mutate('MainType' = type1) %>%
ggplot(aes(y = attack, x = MainType, fill = MainType)) + 
  geom_boxplot(notch = TRUE) + 
  xlab("Attack Strength") + ylab("Density") +
  theme_bw() + 
  theme(axis.title   = element_text(size = 24, face = "bold"),
        axis.text.x  = element_text(size = 18, angle = 90, hjust = 1),
    axis.text.y  = element_text(size = 18),
        legend.title = element_text(size = 24, face = "bold"),
        legend.text  = element_text(size = 20))

From this plot, we can see that ‘dragon’, ‘fighting’, ‘ground’, ‘rock’ and ‘steel’ type Pokémon have higher attack strength, whilst ‘fairy’ and ‘psychic’ type seem to have slightly lower attack strength. This would make sense based on the names, and we can test this a little more formally by using an ANOVA (analysis of variance) analysis to look for significant effects of the Pokémon type on the attack strength. This can be done quite simply by using linear models:

attack_vs_type1 <- lm(attack ~ type1, data = pokedat)
summary(attack_vs_type1)

## 
## Call:
## lm(formula = attack ~ type1, data = pokedat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -70.16 -22.31  -3.50  19.26 114.88 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)    70.1250     3.6263  19.338 < 0.0000000000000002 ***
## type1dark      17.6681     6.7675   2.611             0.009208 ** 
## type1dragon    36.2824     6.9439   5.225          0.000000223 ***
## type1electric   0.6955     6.1178   0.114             0.909515    
## type1fairy     -8.0139     8.1087  -0.988             0.323308    
## type1fighting  29.0536     6.8531   4.239          0.000025078 ***
## type1fire      11.3750     5.5998   2.031             0.042561 *  
## type1flying    -3.4583    18.1316  -0.191             0.848783    
## type1ghost      2.6157     6.9439   0.377             0.706501    
## type1grass      3.6442     5.0288   0.725             0.468871    
## type1ground    24.6875     6.5375   3.776             0.000171 ***
## type1ice        3.1793     7.3700   0.431             0.666301    
## type1normal     5.0369     4.7082   1.070             0.285036    
## type1poison     2.5313     6.5375   0.387             0.698719    
## type1psychic   -4.5590     5.5691  -0.819             0.413253    
## type1rock      20.5417     5.8473   3.513             0.000468 ***
## type1steel     22.9583     7.2527   3.166             0.001608 ** 
## type1water      3.1820     4.6320   0.687             0.492311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.77 on 783 degrees of freedom
## Multiple R-squared:  0.1039, Adjusted R-squared:  0.08448 
## F-statistic: 5.343 on 17 and 783 DF,  p-value: 0.00000000002374

It is not a great model by any means, as the R-squared values suggest that Pokémon type alone explains only a small proportion of the variance in the data. From the summary of the model, we can see that the fit model has an intercept of 70.1 (which is close to the overall mean of 77.86), with an offset specific to each of the different types. As we saw from the boxplot, ‘dragon’, ‘fighting’, ‘ground’, ‘rock’ and ‘steel’ all have significant increases in attack (as can be seen by the resulting p-value), but also ‘dark’ and ‘fire’ type which I missed from looking at the boxplot. We see also that ‘fairy’ and ‘psychic’ type Pokémon show a decrease in attack strength, however this difference is not significant. So if you want a strong Pokémon, then these types are your best shot:

dark
dragon
fighting
fire
ground
rock
steel

It is also worth mentioning that there are a few outliers here, which represent ridiculously powerful Pokémon. In general, these are those with a strength greater than 150, but there are also two fairy Pokémon with higher strength than is typically seen within the ‘fairy’ type Pokémon. Let’s use the Cook’s Distance to identify the outliers from the above model. The idea here is to test the influence of each Pokémon by comparing the least-squares regression with and without the sample included. A higher value indicates a possible outlier. So let’s have a look at the outliers, which we are going to class as those with a Cook’s Distance greater than 4 times the mean of the Cook’s Distance over the entire data set (note that this is a commonly used threshold, but is entirely arbitrary):

library("ggrepel")
library("RColorBrewer")
cookdist <- cooks.distance(attack_vs_type1)
cookdist_lim <- 4*mean(cookdist, na.rm=T)#0.012
data.frame(Pokenum   = pokedat[["pokedex_number"]], 
           Pokename  = pokedat[["name"]], 
           CooksDist = cookdist,
           Pokelab   = ifelse(cookdist > cookdist_lim, as.character(pokedat[["name"]]), ""),
           Outlier   = ifelse(cookdist > cookdist_lim, TRUE, FALSE),
           PokeCol   = ifelse(cookdist > cookdist_lim, as.character(pokedat[["type1"]]), "")) %>%
  ggplot(aes(x = Pokenum, y = CooksDist, color = PokeCol, shape = Outlier, size = Outlier, label = Pokelab)) +
  geom_point() + 
  scale_shape_manual(values = c(16, 8)) +
  scale_size_manual(values = c(2, 5)) +
  scale_color_manual(values = c("black", colorRampPalette(brewer.pal(9, "Set1"))(length(table(pokedat[["type1"]]))))) +
  geom_text_repel(nudge_x = 0.2, size = 8) +
  geom_hline(yintercept = cookdist_lim, linetype = "dashed", color = "red") +
  xlab("Pokedex Number") + ylab("Cook's Distance") +
  theme_bw() +
  theme(#legend.position = "name",
        axis.title      = element_text(size = 24, face = "bold"),
        axis.text       = element_text(size = 18),
        legend.title    = element_text(size = 24, face = "bold"),
        legend.text     = element_text(size = 20),
        )

The flying Pokémon Tornadus and Noibat are the biggest outliers. However, interestingly whilst Tornadus has a particularly high attack strength of 100 for a flying type Pokémon, Noibat has a low attack strength of 30. So both are outliers compared to the flyiong type Pokémon as a whole, with a mean of 66.6666667 ± 35.1188458. It is worth mentioning that Tornadus is a Legendary Pokémon which probably explains why it has a higher attack strength.

Tornadus

Whilst these two do not show up on the barplots, it is likely that these are responsible for the strange distribution of the notched boxplot. I suspect this odd shape is a result of the calculation of the confidence interval for the notches exceeding the 25^th and 75^th percentiles due to a wide range of values, but I do enjoy the fact that it looks like it is itself flying!

The two outlying fairy type Pokémon are Xerneas and Togepi, another favourite of my daughter’s:

Togepi

4.2 Defence

Let’s repeat this to look at the defence values across the different Pokémon types:

pokedat %>% mutate('MainType' = type1) %>%
ggplot(aes(x = defense, fill = MainType)) + 
  geom_density(alpha = 0.2) + 
  xlab("Defence Strength") + ylab("Density") +
  theme_bw() + 
  theme(axis.title   = element_text(size = 24, face = "bold"),
        axis.text    = element_text(size = 18),
        legend.title = element_text(size = 24, face = "bold"),
        legend.text  = element_text(size = 20))

There seems to be a bit more of a distinction between the different types as compared to the attack distribution, so let’s also look at the boxplots:

pokedat %>% mutate('MainType' = type1) %>%
ggplot(aes(y = defense, x = MainType, fill = MainType)) + 
  geom_boxplot(notch = TRUE) + 
  xlab("Defence Strength") + ylab("Density") +
  theme_bw() + 
  theme(axis.title   = element_text(size = 24, face = "bold"),
        axis.text.x  = element_text(size = 18, angle = 90, hjust = 1),
    axis.text.y  = element_text(size = 18),
        legend.title = element_text(size = 24, face = "bold"),
        legend.text  = element_text(size = 20))

We see more outliers here as compared to the attack strength and a very clear increase in defence for steel and rock type Pokémon (unsurprisingly), as well as dragon type. There are no clear types with lower defence, similar again to what we saw in the attack strength distribution plots. Let’s once again look at this using linear models:

defence_vs_type1 <- lm(defense ~ type1, data = pokedat)
summary(defence_vs_type1)

## 
## Call:
## lm(formula = defense ~ type1, data = pokedat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.208 -20.847  -3.031  16.518 159.153 
## 
## Coefficients:
##                Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)    70.84722    3.37247  21.008 < 0.0000000000000002 ***
## type1dark      -0.32998    6.29376  -0.052               0.9582    
## type1dragon    15.41204    6.45779   2.387               0.0172 *  
## type1electric  -9.02671    5.68954  -1.587               0.1130    
## type1fairy     -2.68056    7.54108  -0.355               0.7223    
## type1fighting  -4.45437    6.37337  -0.699               0.4848    
## type1fire      -3.05876    5.20784  -0.587               0.5571    
## type1flying    -5.84722   16.86236  -0.347               0.7289    
## type1ghost      8.67130    6.45779   1.343               0.1797    
## type1grass      0.02457    4.67678   0.005               0.9958    
## type1ground    13.05903    6.07981   2.148               0.0320 *  
## type1ice        1.06582    6.85403   0.156               0.8765    
## type1normal   -11.15198    4.37865  -2.547               0.0111 *  
## type1poison    -0.81597    6.07981  -0.134               0.8933    
## type1psychic   -1.58307    5.17923  -0.306               0.7599    
## type1rock      25.41944    5.43795   4.674    0.000003469793310 ***
## type1steel     49.36111    6.74494   7.318    0.000000000000623 ***
## type1water      2.63523    4.30777   0.612               0.5409    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.62 on 783 degrees of freedom
## Multiple R-squared:  0.1534, Adjusted R-squared:  0.135 
## F-statistic: 8.347 on 17 and 783 DF,  p-value: < 0.00000000000000022

As expected, we see very clear significant increase in defence strength for rock and steel type Pokémon, with a slight increase seen also for dragon and ground type Pokémon. There is also a slight reduction for normal type Pokémon. Flying Pokémon again show an odd distribution, so I expect again to see outliers for this class. Let’s look at the outliers now, again using the COok’s Distance:

cookdist <- cooks.distance(defence_vs_type1)
cookdist_lim <- 4*mean(cookdist, na.rm=T)#0.012
data.frame(Pokenum   = pokedat[["pokedex_number"]], 
           Pokename  = pokedat[["name"]], 
           CooksDist = cookdist,
           Pokelab   = ifelse(cookdist > cookdist_lim, as.character(pokedat[["name"]]), ""),
           Outlier   = ifelse(cookdist > cookdist_lim, TRUE, FALSE),
           PokeCol   = ifelse(cookdist > cookdist_lim, as.character(pokedat[["type1"]]), "")) %>%
  ggplot(aes(x = Pokenum, y = CooksDist, color = PokeCol, shape = Outlier, size = Outlier, label = Pokelab)) +
  geom_point() + 
  scale_shape_manual(values = c(16, 8)) +
  scale_size_manual(values = c(2, 5)) +
  scale_color_manual(values = c("black", colorRampPalette(brewer.pal(9, "Set1"))(length(table(pokedat[["type1"]]))))) +
  geom_text_repel(nudge_x = 0.2, size = 8) +
  geom_hline(yintercept = cookdist_lim, linetype = "dashed", color = "red") +
  xlab("Pokedex Number") + ylab("Cook's Distance") +
  theme_bw() +
  theme(axis.title      = element_text(size = 24, face = "bold"),
        axis.text       = element_text(size = 18),
        legend.title    = element_text(size = 24, face = "bold"),
        legend.text     = element_text(size = 20),
        )

As expected, Noibat and Tornadus are outliers here, and are joined by Noivern (which is an evolution of Noibat). As it turns out, these are the only 3 Pokémon in the flying type, which explains why they show up as outliers and why the boxplot has such a strange distribution. Other outliers include ice-type Pokémon Avalugg (defence 184), steel-type Pokémon Steelix and Aggron (both with defence 230), and bug-type Pokémon Shuckle (defence 230) which doubles up as a rock type Pokémon:

Shuckle

As seen with Shuckle, we should bear in mind that we have only focussed on the primary Pokémon type here, and have not considered the type2 values. So it is possible that we are missing the full picture when looking only at the first class for some of these Pokémon.

4.3 Other

We could do this for all of the different values, and there are many different ways that we may want to examine these data. We can look at eaxch variable in the data set and examine them for odd distributions, we can look for outliers (as we have done above), we can start to examine relationships between variables, and we can look for correlation between different values (which we will do further below). Indeed a considerable amount of time should be spent exploring data in this way to ensure it is of good quality – the old saying “garbage in, garbage out” is very true. But to be honest there are more interesting things that I want to do with these data, so let’s crack on…

4.4 Ensuring we use accurate data classes throughout

It is important to ensure that we are using the data in the correct way for our analyses. First of all let’s take a look at the abilities that each Pokémon has. As it stands, the abilities variable is not terribly useful, as it fails to link Pokémon who may share abilities but not exactly. We could spend some time decoding the abilities, and create a “dictionary” of different abilities for each Pokémon, but perhaps a better measure may be to simply look at the number of abilities that each Pokémon has:

abilities <- strsplit(as.character(pokedat[["abilities"]]), ", ")
abilities <- sapply(abilities, FUN = function(x) gsub("\\[|\\'|\\]", "", x))
names(abilities) <- rownames(pokedat)
pokedat[["number_abilities"]] <- sapply(abilities, FUN = length)
table(pokedat[["number_abilities"]])

## 
##   1   2   3   4   6 
## 109 245 427   7  13

So the majority of Pokémon have only 1, 2 or 3 abilities, with only a small number of Pokémon have more than 3 abilities, and around half having exactly 3.

I want to now look only at the variables that might theoretically be descriptive of the Pokémon in some kind of model. So we can remove the individual abilties (although the number of abiltities will likely be of interest), the names and Pokedex number, and the classification which seems to be fairly ambiguous. We also need to ensure that we use the dummy variables for the variables encoded as factors.

5 Normalization

Prior to doing this, I want to look at a few different ways to normalize all of the variables to ensure that they all have values that are comparable. Some algorithms are more sensitive to normalization than others, and it becomes quite obvious why this may be something to consider here when you consider that whilst the stats value attack lies between 5 and 185, the experience_growth value ranges between 600,000 and 1,640,000. That’s a >10,000-fold increase. This variable will completely dominate the calculations for certain machine-learning algorithms like K-Nearest Neighbours (KNN) and Support Vector Machines (SVM) where the distance between data points is important. SO let’s look at how we might want to normalize these data prior to analysis.

5.1 Min-Max Normalization

There are a few different ways to do this. One way is to standardize the data, so that we bring them all into a common scale of $[0,1]$ , but we maintain the distribution for each specific variable. So if one variable has a very skewed distriburtion, this will be maintained. A simple way to do this is to use the min-max scaling approach, where you subtract the lowest value (so that the minimum value is always zero) and then divide by the range of the data so that the data are spread in the range $[0,1]$ :

$x' = \frac{x - x_{min}}{x_{max} - x_{min}}$

Let’s compare the standardized and non-standardized attack values:

attack_std <- (pokedat[["attack"]] - min(pokedat[["attack"]]))/(max(pokedat[["attack"]]) - min(pokedat[["attack"]]))
data.frame(Raw          = pokedat[["attack"]],
           Standardized = attack_std) %>%
  tidyr::gather("class", "attack", Raw, Standardized) %>%
  mutate(class = factor(class, levels = c("Raw", "Standardized"))) %>%
  ggplot(aes(x = attack, fill = class)) +
  geom_density(alpha = 0.2) +
  facet_wrap(. ~ class, scales = "free") + 
  xlab("Attack Strength") + ylab("Density") +
  theme_bw() + 
  theme(legend.position = "none",
    axis.title      = element_text(size = 24, face = "bold"),
    axis.text       = element_text(size = 18),
    legend.title    = element_text(size = 24, face = "bold"),
    legend.text     = element_text(size = 20),
    strip.text      = element_text(size = 24, face = "bold")
  )

So as you can see here, the distribution of the attack scores remains the same, but the scales over which the distribution is spread are different.

5.2 Z-Score Normalization

However, sometimes it is preferable to instead that every feature has a standard distribution that can be easily described to make them more comparable. In this case, you can normalize the data so that the distribution of all features is shifted towards that of a normal (Gaussian) distribution. This is a typical “bell-shaped” curve, which can be exclusively described by the mean and the standard deviation. An example where I use this approach regularly is in gene-expression analysis, where we normalize the data such that the levels of expression of each gene is represented by a normal distribution, so that we can test for samples where the expression is significantly outside of the confines of this distribution. We can then look at all of the genes that seem to show significantly different expression than we would expect using hypothesis testing approaches, and look for common functions of these genes through the use of network analysis, gene ontology analysis and other downstream analyses.

A simple normalization approach is the z-score normalization, which will transform the data so that the distribution has a mean of 0 and a standard deviation of 1. The transformation is as follows:

$x' = \frac{x - \mu}{\sigma}$

So we simply subtract the mean $\mu$ (to center the data), and divide by the standard deviation $\sigma$ . Let’s again visualise this by comparing the raw and normalized data:

attack_norm <- (pokedat[["attack"]] - mean(pokedat[["attack"]]))/sd(pokedat[["attack"]])
data.frame(Raw          = pokedat[["attack"]],
           Normalized   = attack_norm) %>%
  tidyr::gather("class", "attack", Raw, Normalized) %>%
  mutate(class = factor(class, levels = c("Raw", "Normalized"))) %>%
  ggplot(aes(x = attack, fill = class)) +
  geom_density(alpha = 0.2) +
  facet_wrap(. ~ class, scales = "free") + 
  xlab("Attack Strength") + ylab("Density") +
  theme_bw() + 
  theme(legend.position = "none",
    axis.title      = element_text(size = 24, face = "bold"),
    axis.text       = element_text(size = 18),
    legend.title    = element_text(size = 24, face = "bold"),
    legend.text     = element_text(size = 20),
    strip.text      = element_text(size = 24, face = "bold")
  )

So the normalized data now has a mean of 0, with the majority of values lying between -2 and 2. Note that since this is a fairly simple normalization technique, the distribution itself is not really changed, but there are other methods such as quantile normalization which can reshape the data to a Gaussian curve, or indeed any other dstribution as required.

5.3 Choosing a Normalization Method

The choice of which method to use will largely depend on what you are trying to do. The min-max scaling method is a common method used in machine learning, but does not handle outliers very well – a sample with an extreme value in the raw data will still have an extreme value in the scaled data, resulting in bunching up of the remaining data since it is all bounded within the range $[0,1]$ . The Z-score normalization is better at dealing with outliers as they will appear far away from the mean value of 0, but the range of data is no longer bounded meaning that different variables will now have different ranges. Other normalization processes may make your data more homogenous, but at the cost of potentially losing aspects of the data that may be useful.

Many machine learning algorithms use some kind of distance measure, in order to look for similarity between different items. This can be a simple Euclidean distance, which is a k-dimensional measure of “as the crow flies”. If the scales are very different between your variables, then this will cause a significant issue.

For the following analyses, I will use Z-score normalization, since I know that there are a few extreme outliers in these data.

6 Looking for Patterns

6.1 Correlation of Variables

One thing that it is worth doing before heading down a route of data modelling is to look at the correlation structure between the variables in the data set. If there are highly correlated variables (for instance one might imagine height and weight to be highly correlated), then both will offer the same information to the model – adding both will give no additional predictive power than adding only one. There is then a danger of over-fitting the data, which can happen if you create a model so complicated that, whilst it may fit the training data very well, it is so complex as to no longer be applicable to external data sets making it essentially useful. An extreme example is if we create a model where we have one variable for each sample, that is equal to 1 for that specific sample, and 0 for every other sample. This would fit the data perfectly, but would be of absolutely no use for any other data. In addition, we can risk biasing the data by including multiple variables that essentially encode the same information.

So we can calculate a pairwise correlation matrix, which will give us a measure of similarity between each pair of variables by comparing the two vectors of values for the 801 Pokémon. There are multiple measures of correlation that can be used. The standard is the Pearson Correlation, which is a measure of the linearity of the relationship between two values. A value of 1 represents an entirely linear monotonic relationship (as one value increases, so does the other, but with every unit increase in one variable matching a unit increase in the other in a linear way), a value of 0 represents no linearity between the values (no clear relationship between the two), and a value of -1 represents an inverse monotonic linear relationship (anti-correlated).

For this analysis, I will be using the Spearman Correlation coefficient. Instead of using the values themselves, this method first ranks the data, and then looks at the correlation. The idea here is that the unit difference between each successive value is kept constant, meaning that outliers do not have a big impact. This is therefore independent of the distribution of the data, and is therefore a non-parametric method. Since we identified some outliers with some of these variables, this method will remove the effect that these might have.

We can represent these values by using a heatmap, which is a way of representing 3-dimensional data. The value of the correlation, rather than being represented on an axis as a value, will be represented by a colour. Values of the SPearman correlation closer to 1 will appear more red, whilst those closer to -1 will appear more blue. Those closer to 0 will appear white.

In addition, I will apply a hierarchical clustering method to ensure that the variables most similar to one another are located close to one another on the figure. Pairwise distances are calculated by looking at the Euclidean distance, and similar variables are clustered together, allowing us to pick out by eye those most similar to one another.

As mentioned earlier, we first want to ensure that we are looking at numerical values, or at least numerical representations of factors.

pokedat_vals <- pokedat[, !names(pokedat) %in% c("abilities", "classfication", "japanese_name", "name", "pokedex_number")]
pokedat_vals[["type1"]]        <- as.numeric(pokedat_vals[["type1"]])
pokedat_vals[["type2"]]        <- as.numeric(pokedat_vals[["type2"]])
pokedat_vals[["generation"]]   <- as.numeric(pokedat_vals[["generation"]])
pokedat_vals[["is_legendary"]] <- as.numeric(pokedat_vals[["is_legendary"]])

This will give us a purely numeric data set for use in numeric calculations. Finally we will normalize the data using Z-score normalization:

pokedat_norm <- scale(pokedat_vals, center = TRUE, scale = TRUE)

Finally let’s take a look at the correlation plot:

library("pheatmap")
pokecor_var <- cor(pokedat_norm, method = "spearman")        
colors  <- colorRampPalette(c('dark blue','white','dark red'))(255)#colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(pokecor_var, 
         clustering_method = "complete",
         show_colnames = FALSE,
         show_rownames = TRUE,
         col=colors, 
         fontsize_row = 24)

We can see a few clear cases of correlated variables here, especially between a few of the values giving attack strength against certain Pokémon types. For instance, attack against ghost- and dark-type Pokémon are very similar (which makes sense), as are attacks against electric- and rock-type Pokémon (makes less sense). The clearest aspect of the figure is a block of high positive correlations between a number of variables associated with the Pokémon’s vital statistics. So the most similar variables are those like speed, defence strength, attack strength, height, weight, health points, experience growth, etc. This makes a lot of sense, with bigger Pokémon having better attack and defence, more health points, etc. We also see that these variables are very highly anti-correlated with the capture-rate, which again makes sense – the better Pokémon are harder to catch. However, none of these correlations seem significantly high to require the remobval of any variables prior to analysis.

6.2 Height vs Weight

Let’s take a look at the height vs the weight. We can also include a few additional variables to see how the Pokémon strength and defence affect the relationship. I am going to scale both the x-and y-axes by using a $log_{10}$ transformation to avoid much larger Pokémon drowing out the smaller ones:

ggplot(aes(x = height_m, y = weight_kg, color = attack, size = defense), data = pokedat) +
  geom_point(alpha = 0.2) +
  scale_color_gradient2(name = "Attack", midpoint = mean(pokedat[["attack"]]), low = "blue", mid = "white", high = "red") +
  scale_size_continuous(name = "Defence", range = c(1, 10)) +
  xlab("Height (m)") + ylab("Weight (Kg)") +
  scale_x_log10() +
  scale_y_log10() +
  #theme_bw() + 
  theme(axis.title      = element_text(size = 24, face = "bold"),
        axis.text       = element_text(size = 18),
        legend.title    = element_text(size = 24, face = "bold"),
        legend.text     = element_text(size = 20),
        strip.text      = element_text(size = 24, face = "bold")
  )

So there is a clear relationship, and in general taller Pokémon (such as Walor, the largest of the Pokémon) also weigh more as we might expect:

Walord, the largest Pokemon

We also see that in general the attack strength of the larger Pokémon is higher than the smaller ones, although there are definitely some outliers, such as Cosmeom, a legendary Pokémon only 10 cm tall that weighs nearly 1,000 kg!

Cosmoem, the smallest Pokemon

6.3 Correlation between Pokémon

pokecor_sample <- cor(t(pokedat_norm), method = "spearman")        
colors  <- colorRampPalette(c('dark blue','white','dark red'))(255)
pheatmap(pokecor_sample, 
         clustering_method = "complete",
         show_colnames = FALSE,
         show_rownames = FALSE,
         annotation = pokedat[, c("generation", "type1", "type2", "is_legendary")],
         col=colors)

So here we see the correlation structure between the 801 Pokémon, and it very clearly shows similarities between groups of Pokémon. The red boxes that we see down the diagonal represent highly similar groups, and at the top I have annotated the factor variables generation, type1, type2 and is_legendary. Almost all of the legendary Pokémon cluster together, suggesting that these are all similar to one another across these variables. Similarly there are a number of Pokémon types that clearly cluster together, such as rock type and dark type Pokémon. Interestingly the Generation number seems to be unrelated to Pokémon similarity.

6.4 Principal Component Analysis

A good way to explore data such as these to look for underlying trends in the data is to use a dimensional-reduction algorithm to reduce these high-dimensional data down into asmaller number of easy to digest chunks. Principal component analysis (PCA) is one such approach, and can be used to look for the largest sources of variation within a high dimensional data set. For a data set with n variables, we can think of these data existing in an n-dimensional space. PCA is a mathematical trick that rotates these axes in n-dimensional space so that the x-axis of the rotation explains the largest possible amount of variation in the data, the y-axis then explains the next largest possible amount of variation, the z-axis the next largest amount, etc. In many cases, a very large amount of the total variation in the original n-dimensional data set can be captured by only a small number of so-called principal components. This can be used to look for underlying trends in the data. So let’s have a look at how this looks:

pc <- prcomp(pokedat_vals, scale. = TRUE)
pc_plot <- as.data.frame(pc[["x"]])
pc_plot <- cbind(pc_plot, pokedat[rownames(pc_plot), c("generation", "type1", "type2", "is_legendary")])
explained_variance <- 100*((pc[["sdev"]])^2 / sum(pc[["sdev"]]^2))
screeplot(pc, type = "line", main = "Principal Component Loadings")

In this plot, we see that PC1 and PC2 explain a lot of variance compared to the other PCs, but the drop off is not complete at this point. In a lot of data sets, the first few PCs explain the vast majority of the variance, and so this plot drops off considerably to a flat line by PC3 or PC4. Let’s take a look at how discriminatory these PCs are to these data:

ggplot(aes(x = PC1, y = PC2, shape = is_legendary, color = type1), data = pc_plot) +
  geom_point(size = 5, alpha = 0.7) +
  xlab(sprintf("PC%d (%.2f%%)", 1, explained_variance[1])) +
  ylab(sprintf("PC%d (%.2f%%)", 2, explained_variance[2])) +
  theme(axis.title   = element_text(size = 14, face = "bold"),
        axis.text    = element_text(size = 12),
        legend.title = element_text(size = 18, face = "bold"),
        legend.text  = element_text(size = 14))

We can see that this PCA approach clearly discriminates the legendary Pokémon (triangles) from the other Pokémon (circles) in PC1, which is the new axis that explains the most variance in the data (although note that this is still only 17.2%, so not a huge amount really). There is some separation seen between the different Pokémon groups, which seems to represent the second most significant source of variation, accounting for another 10.4%. Let’s look at PC3 as well to see if this is able to discriminate the samples further:

ggplot(aes(x = PC2, y = PC3, shape = is_legendary, color = type1), data = pc_plot) +
  geom_point(size = 5, alpha = 0.7) +
  xlab(sprintf("PC%d (%.2f%%)", 2, explained_variance[2])) +
  ylab(sprintf("PC%d (%.2f%%)", 3, explained_variance[3])) +
  theme(axis.title   = element_text(size = 14, face = "bold"),
        axis.text    = element_text(size = 12),
        legend.title = element_text(size = 18, face = "bold"),
        legend.text  = element_text(size = 14))

By looking at the 2^nd and 3^rd PCs, we do see a slight separation between the Pokémon types, but it is not strong.

6.5 Correlation-Based Recommendation

As a little aside, I want to see whether it is possible to recommend similar Pokémon to any Pokémon that you might suggest. For instance, my daughter’s absolute favourite is Eevee (a cuddly fox-like Pokémon):

Eeevee

So let’s see whether there are any other Pokémon that are similar to Eevee, based on a simple correlation match. To do this, I will calculate the Spearman correlation coefficient between Eevee and every other Pokémon, and see which the most similar are:

eevee <- as.numeric(pokedat_vals["Eevee", ])
eevee_cor <- apply(pokedat_vals, MAR = 1, FUN = function (x) cor(x, eevee, method = "spearman"))
head(sort(eevee_cor, decreasing = TRUE), 10)

##     Eevee     Aipom   Sentret Teddiursa    Patrat Zigzagoon   Rattata 
## 1.0000000 0.9781018 0.9728923 0.9695460 0.9693680 0.9690263 0.9689643 
##    Meowth  Raticate    Bidoof 
## 0.9687552 0.9676089 0.9669312

So here are a few of the most similar Pokémon:

Aipom

Sentret

Teddiursa

Patrat

Well they all look kind of cuddly like Eevee, except for the really creepy evil beaver Patrat at the end!

However, what would be even more efficient (maybe not in this case, but in the case of a much higher-dimensional data set like a gene-expression data set of 30,000 genes) is to use the reduced dataset after using PCA. The first 10 PCs explained around two thirds of the variance on the data, so by using these 10 values rather than the 37 values originally, we reduce the computations with a relatively small loss of data. Let’s see what the outputs are in this case:

pc_df_sub <- as.data.frame(pc[["x"]])[, 1:10]
eevee <- as.numeric(pc_df_sub["Eevee", ])
eevee_cor_pca <- apply(pc_df_sub, MAR = 1, FUN = function (x) cor(x, eevee, method = "spearman"))
head(sort(eevee_cor_pca, decreasing = TRUE), 10)

##     Eevee    Spinda  Delcatty   Watchog     Aipom    Furret  Smeargle 
## 1.0000000 0.9636364 0.9515152 0.9515152 0.9393939 0.9272727 0.9030303 
##   Herdier Vanillish    Meowth 
## 0.9030303 0.9030303 0.8909091

Spinda

Delcatty

Watchog

Furret

Similarly cute, but with another creepy one! After checking with the experiment subject (my daughter), it seems that the best hits are definitely Watchog and Furret, so this seems to cope well at picking out similarly cuddly looking Pokémon.

Obviously here we are simply looking for Pokémon with similar characteristics. This is not as in-depth as a collaboritive filtering method where we have some subjective ranking of items to help us to determine the best Pokémon to match somebody’s needs. However, by using the reduced PCA data set we are able to find very close matches using only a reduced subset of the data.

7 Predicting Legendary Pokémon

Based on our previous analyses, we see that there is a clear discrimination between normal and Legendary Pokémon. These are incredibly rare, and very powerful Pokémon in the game. So is it possible to identify a Legendary Pokémon based on the variables available from this database? And if so, which variables are the most important?

To do predictive modelling, we need to first split our data up into a training data set to use to train the model, and then a validation data set to use to confirm the accuracy of the predictions that this model makes. There are more robust ways to do this, such as cross-validation and bootstrapping, which allow you to assess the accuracy of the model. Cross validation can be applied simply by using the caret package in R.

We will also split the data randomly into two data sets – one containing 80% of the Pokémon for training the model, and one containing 20% of the Pokémon for validation purposes. For the following model fitting approaches, we do not need to have the categorical data converted into numbers as we did for the correlation analysis, as the factors will be treated correctly automatically We do however still need to ignore the non-informative variables. In addition, I am going to remove the “against_” columns to avoid overfitting of the data. So let’s generate our two data sets:

library("caret")
set.seed(0)
split_index <- createDataPartition(y = pokedat[["is_legendary"]], p = 0.8, list = FALSE)
rm_index       <- which(names(pokedat) %in% c("abilities", "classfication", "japanese_name", "name", "pokedex_number"))
rm_index       <- c(grep("against", names(pokedat)), rm_index)
training_dat   <- pokedat[split_index,  -rm_index]
validation_dat <- pokedat[-split_index, -rm_index]

So let’s try a few different common methods used for classification purposes.

7.1 Support-Vector Machine

We can then train our SVM model, incorporating a preprocessing step to center and scale the data, and performing 10-fold repeated cross validation which we repeat 3 times:

set.seed(0)
trctrl    <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
model_svm <- train(is_legendary ~., 
                   data = training_dat, 
                   method = "svmLinear",
                   trControl = trctrl,
                   preProcess = c("center", "scale"),
                   tuneLength = 10)
model_svm

## Support Vector Machines with Linear Kernel 
## 
## 641 samples
##  18 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 577, 576, 576, 577, 577, 578, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9802138  0.8782632
## 
## Tuning parameter 'C' was held constant at a value of 1

So this model is able to predict whether the Pokémon is Legendary based on these 18 predictor variables with 98.02% accuracy (correct predictions). The Kappa value is normalised to account for the fact that we could get pretty good accuracy if we just called everything not Legendary due to the imbalance in the classes.

This accuracy is based on resampling of the training data. How does it cope with the validation data? Let’s take a look at the confusion matrix for the predicted outcomes compared to the true values:

predict_legendary <- predict(model_svm, newdata = validation_dat)
svm_confmat <- confusionMatrix(predict_legendary, validation_dat[["is_legendary"]])
svm_confmat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   144    1
##      TRUE      2   13
##                                           
##                Accuracy : 0.9812          
##                  95% CI : (0.9462, 0.9961)
##     No Information Rate : 0.9125          
##     P-Value [Acc > NIR] : 0.000314        
##                                           
##                   Kappa : 0.8863          
##                                           
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.9863          
##             Specificity : 0.9286          
##          Pos Pred Value : 0.9931          
##          Neg Pred Value : 0.8667          
##              Prevalence : 0.9125          
##          Detection Rate : 0.9000          
##    Detection Prevalence : 0.9062          
##       Balanced Accuracy : 0.9574          
##                                           
##        'Positive' Class : FALSE           
##

SO according to these multiple statistics, of the 14 Legendary Pokémon in the validation data, we correctly identified 13 of them, but missed 1. 2 were identified incorrectly. Our ultimate accuracy is 98.12%, which seems to be pretty good. However, it is possible to further tune this model, by adjusting the tuning parameter C, by tweaking the parameters included in the model, and moving from a linear SVM model. However, all told, this is a pretty good result.

7.2 k-Nearest Neighbour

Another classification method is the k-Nearest Neighbour (kNN) algorithm. The idea here is that a record is kept of all of the data in the training data, and a new sample is compared to find the k samples “closest” to it. The classification is then calculated based on some average of the classifications of these nearest neighbours. The method of determining the “nearest” neighbour can be one of a number of different methods, including Euclidean distance as described earlier.

Also, the value of k is very important. Using the caret package in R, we are able to test using multiple different values of k to find the value that optimises the model accuracy. So let’s train our model, again performing 10-fold repeated cross validation which we repeat 3 times:

set.seed(0)
trctrl    <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
model_knn <- train(is_legendary ~., 
                   data = training_dat, 
                   method = "knn",
                   trControl = trctrl,
                   preProcess = c("center", "scale"),
                   tuneLength = 10)
model_knn

## k-Nearest Neighbors 
## 
## 641 samples
##  18 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 577, 576, 576, 577, 577, 578, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9470304  0.5515501
##    7  0.9444345  0.5158836
##    9  0.9428470  0.4854601
##   11  0.9376461  0.4246775
##   13  0.9350660  0.3903400
##   15  0.9355788  0.3898290
##   17  0.9376542  0.4161790
##   19  0.9376784  0.4063420
##   21  0.9355786  0.3765296
##   23  0.9381750  0.4070014
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

We can see that the accuracy is optimised by using k = 5 nearest neighbours, which gives an accuracy of 94.7% – below that of the SVM accuracy. Now let’s test it on the validation data set:

predict_legendary <- predict(model_knn, newdata = validation_dat)
knn_confmat <- confusionMatrix(predict_legendary, validation_dat[["is_legendary"]])
knn_confmat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   145    4
##      TRUE      1   10
##                                           
##                Accuracy : 0.9688          
##                  95% CI : (0.9286, 0.9898)
##     No Information Rate : 0.9125          
##     P-Value [Acc > NIR] : 0.004163        
##                                           
##                   Kappa : 0.7833          
##                                           
##  Mcnemar's Test P-Value : 0.371093        
##                                           
##             Sensitivity : 0.9932          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.9732          
##          Neg Pred Value : 0.9091          
##              Prevalence : 0.9125          
##          Detection Rate : 0.9062          
##    Detection Prevalence : 0.9313          
##       Balanced Accuracy : 0.8537          
##                                           
##        'Positive' Class : FALSE           
##

So we can already see that the results of this model are less positive than the SVM model.

7.3 Logistic Regression

For classification problems with only two groups, linear regression is often a good first option. This is a generalised linear model, where we require a transformation of the response variable to ensure that it fits a continuous scale. In this case, the response variable is the probability of being in the Legendary class, so we need to map this to the simple yes/no outcome of the input training data. The logit function is often used, which is a transformation between a probability p and a real number:

$logit(p) = log(\frac{p}{1-p})$

So let’s fit a logistic regression model containing all of our data, again using 3 repeats of 10-fold cross validation, and see what we get back:

set.seed(0)
trctrl    <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
model_logreg <- train(is_legendary ~., 
                      data = training_dat, 
                      method = "glm",
                      family = "binomial",
                      trControl = trctrl,
                      preProcess = c("center", "scale"),
                      tuneLength = 10)
model_logreg

## Generalized Linear Model 
## 
## 641 samples
##  18 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 577, 576, 576, 577, 577, 578, ... 
## Resampling results:
## 
##   Accuracy   Kappa   
##   0.9667197  0.802881

So we actually get a better accuracy from the cross-validation than for SVM or kNN at 96.67% accuracy. So let’s test it on the validation data set:

predict_legendary <- predict(model_logreg, newdata = validation_dat)
logreg_confmat <- confusionMatrix(predict_legendary, validation_dat[["is_legendary"]])
logreg_confmat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   141    1
##      TRUE      5   13
##                                           
##                Accuracy : 0.9625          
##                  95% CI : (0.9202, 0.9861)
##     No Information Rate : 0.9125          
##     P-Value [Acc > NIR] : 0.01131         
##                                           
##                   Kappa : 0.792           
##                                           
##  Mcnemar's Test P-Value : 0.22067         
##                                           
##             Sensitivity : 0.9658          
##             Specificity : 0.9286          
##          Pos Pred Value : 0.9930          
##          Neg Pred Value : 0.7222          
##              Prevalence : 0.9125          
##          Detection Rate : 0.8812          
##    Detection Prevalence : 0.8875          
##       Balanced Accuracy : 0.9472          
##                                           
##        'Positive' Class : FALSE           
##

The accuracy of this model on the validation data set is 96.25%.

7.4 Random Forest

As a final model, we will look at using a random forest classifier. A random forest is essentially created by creating a number of decision trees and averaging over them all. I will use the same cross-validation scheme as previously, and will allow caret to identify the optimum parameters:

set.seed(0)
trctrl    <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
model_rf <- train(is_legendary ~., 
                  data = training_dat, 
                  method = "rf",
                  trControl = trctrl,
                  preProcess = c("center", "scale"),
                  tuneLength = 10)
model_rf

## Random Forest 
## 
## 641 samples
##  18 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 577, 576, 576, 577, 577, 578, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9911611  0.9396362
##    8    0.9984292  0.9905178
##   14    0.9984292  0.9905178
##   20    0.9984292  0.9905178
##   26    0.9984292  0.9905178
##   32    0.9984292  0.9905178
##   38    0.9984292  0.9905178
##   44    0.9984292  0.9905178
##   50    0.9984292  0.9905178
##   56    0.9984292  0.9905178
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.

This approach identified an accuracy of 99.84% (with a normalised kappa value of 99.05%) with an mtry value of 8. This is incredibly good, and is the best outcome so far. However, it is worth noting that this approach is significantly slower than the others. Let’s check against our validation data set:

predict_legendary <- predict(model_rf, newdata = validation_dat)
confusionMatrix(predict_legendary, validation_dat[["is_legendary"]])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   145    1
##      TRUE      1   13
##                                           
##                Accuracy : 0.9875          
##                  95% CI : (0.9556, 0.9985)
##     No Information Rate : 0.9125          
##     P-Value [Acc > NIR] : 0.00005782      
##                                           
##                   Kappa : 0.9217          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9932          
##             Specificity : 0.9286          
##          Pos Pred Value : 0.9932          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.9125          
##          Detection Rate : 0.9062          
##    Detection Prevalence : 0.9125          
##       Balanced Accuracy : 0.9609          
##                                           
##        'Positive' Class : FALSE           
##

So whilst the training accuracy is nearly 100%, here we see slightly lower accuracy on the test data set. It is still pretty good going, but could still be improved.

8 Conclusion

We have managed to explore these data in a number of different ways and have identified some interesting themes and patterns. However, there is a lot of additional work that can be done. Correlation-based recommendation seems to work to identify similar Pokémon, at least in the small number of cases that I looked at. My wife showed me this toy that can guess any Pokémon that you think of, which probably uses a similar approach. I’m in the wrong business…

It is worth noting that here I have concentrated predominantly on the accuracy of the model, but in general this is not the only metric that we should be interested in. As mentioned previously, given how many more non-Legendary Pokémon there are, we would get a pretty good accuracy if we just assumed that none of the test samples were Legendary. Sensitivity and Specificity for instance are useful to look at, which define the proportion of samples called correctly as Legendary (True Positives) and the proportion of those called correctly as not Legendary (Truse Negatives). Often, if we think of the opposite of these, False Negatives are more of an issue than False Positives – e.g. for disease prediction we would probably prefer not to miss any potential cases.

Also, whilst it was interesting to play with some of the most commonly used machine learning methods, I have spent no real time tuning the model parameters. The variables themselves used in the data set could be tweaked to identify those variables most associated with the response variable. Given that these are incredibly rare and powerful Pokémon, we will inevitably find that the fighting based variables like attack and defence are highly discriminative, but also those such as the capture rate and experience growth associated. In addition, we have looked only at additive models here and have not considered interaction terms. We can also spend time fine-tuning the parameters of the models themselves to improve the accuracy further.

However, despite this, we got some very good results from the default model parameters for several commonly used methods. kNN, SVM, logistic regression and random forests gave very similar results, but with SVM and random forest giving the best predictive outcomes.

Ultimately, the key to developing any machine learning algorithm is to trial multiple different approaches and tune the parameters for the specific data in question. Also, a machine learning algorithm is only as good as the data that it is trained on, so feeding in more good quality cleaned data will help the model to learn. Ultimately, machine learning is aimed at developing a model that is able to predict something from new data, so it needs to be as generalised as possible.

So hopefully when I come across a new Pokémon not currently found in my Pokedex, I can easily check whether it is Legendary by asking it a few questions about its vital statistics. Very useful.

Gotta catch ’em all!

9 Session Info

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] caret_6.0-84       lattice_0.20-38    pheatmap_1.0.12   
## [4] RColorBrewer_1.1-2 ggrepel_0.8.1      dplyr_0.8.3       
## [7] ggplot2_3.2.1     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2          lubridate_1.7.4     tidyr_1.0.0        
##  [4] class_7.3-15        assertthat_0.2.1    zeallot_0.1.0      
##  [7] digest_0.6.22       ipred_0.9-9         foreach_1.4.7      
## [10] R6_2.4.0            plyr_1.8.4          backports_1.1.5    
## [13] stats4_3.6.1        evaluate_0.14       e1071_1.7-2        
## [16] blogdown_0.16       pillar_1.4.2        rlang_0.4.1        
## [19] lazyeval_0.2.2      data.table_1.12.6   kernlab_0.9-27     
## [22] rpart_4.1-15        Matrix_1.2-17       rmarkdown_1.16     
## [25] labeling_0.3        splines_3.6.1       gower_0.2.1        
## [28] stringr_1.4.0       munsell_0.5.0       compiler_3.6.1     
## [31] xfun_0.10           pkgconfig_2.0.3     htmltools_0.4.0    
## [34] nnet_7.3-12         tidyselect_0.2.5    tibble_2.1.3       
## [37] prodlim_2018.04.18  bookdown_0.14       codetools_0.2-16   
## [40] randomForest_4.6-14 crayon_1.3.4        withr_2.1.2        
## [43] MASS_7.3-51.4       recipes_0.1.7       ModelMetrics_1.2.2 
## [46] grid_3.6.1          nlme_3.1-141        gtable_0.3.0       
## [49] lifecycle_0.1.0     magrittr_1.5        scales_1.0.0       
## [52] stringi_1.4.3       reshape2_1.4.3      timeDate_3043.102  
## [55] ellipsis_0.3.0      generics_0.0.2      vctrs_0.2.0        
## [58] lava_1.6.6          iterators_1.0.12    tools_3.6.1        
## [61] glue_1.3.1          purrr_0.3.3         survival_2.44-1.1  
## [64] yaml_2.2.0          colorspace_1.4-1    knitr_1.25

R Data Science Machine Learning KNN SVM Random Forest Logistic Regression PCA Recommendation Normalization