Overview and Motivation

Due to the nature of Chess, it should be possible to either player to come out on top. It is precisely this aspect of the game that makes it difficult to tell who is winning the game at face value until the last move has been made - at least without the use of advanced machine learning algorithms. Even with these algorithms, every move has a profound impact on the outcome of the match.

Like many others around the globe, Chess is a game I have long been interested in - though I am far from the definition of a good player. While playing against a friend of mine, losing more matches than I care to admit, I realized that this project could be a good opportunity for me to understand what factors of the game I could use to improve more efficiently.

To help determine what individual aspects of the game has an impact on the outcome, and thus may help me improve, I have compiled graphs of different factors as well as my interpretation of what they mean.

Initial Questions

Initially I wanted to research the following questions:

  1. Is there an implicit bias in favor of the white piece as a result of getting the first move?
  2. Which opening’s have the highest win rate, and which have the lowest win rate?
  3. Is there a correlation between the above and the player’s ELO (skill rating)?
  4. Can the outcome of a game be predicted using machine learning?

As I researched the dataset more and the questions I can answer due to its limitations, these questions developed into the following:

  1. Is there an implicit bias in favor of the white piece as a result of getting the first move?
  2. What variables are useful in predicting who will win a game?

These are the two primary questions I want to solve.

To answer these questions I will be using a dataset from Kaggle. The dataset contains over 20,000 Chess games from the 2nd largest online Chess platform LiChess.com

Dataset - https://www.kaggle.com/datasets/datasnaek/chess

Exploratory Data Analysis

Loading and Exploring the Data

Load the data into a dataframe

Before anything can be done with the data, it must first be loaded using the read.csv() function. After loading the data, look at the head and tail to see what the fields look like.

chess_games <- read.csv("Data/chess_games.csv")

head(chess_games)
##         id rated created_at last_move_at turns victory_status winner
## 1 TZJHLljE FALSE    1.5e+12      1.5e+12    13      outoftime  white
## 2 l1NXvwaE  TRUE    1.5e+12      1.5e+12    16         resign  black
## 3 mIICvQHh  TRUE    1.5e+12      1.5e+12    61           mate  white
## 4 kWKvrqYL  TRUE    1.5e+12      1.5e+12    61           mate  white
## 5 9tXo1AUZ  TRUE    1.5e+12      1.5e+12    95           mate  white
## 6 MsoDV9wj FALSE    1.5e+12      1.5e+12     5           draw   draw
##   increment_code      white_id white_rating      black_id black_rating
## 1           15+2      bourgris         1500          a-00         1191
## 2           5+10          a-00         1322     skinnerua         1261
## 3           5+10        ischia         1496          a-00         1500
## 4           20+0 daniamurashov         1439  adivanov2009         1454
## 5           30+3     nik221107         1523  adivanov2009         1469
## 6           10+0     trelynn17         1250 franklin14532         1002
##                                                                                                                                                                                                                                                                                                                                                                                                         moves
## 1                                                                                                                                                                                                                                                                                                                                                          d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5 Bf4
## 2                                                                                                                                                                                                                                                                                                                                            d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6 Qe5+ Nxe5 c4 Bb4+
## 3                                                                                                                                              e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc6 bxc6 Ra6 Nc4 a4 c3 a3 Nxa3 Rxa3 Rxa3 c4 dxc4 d5 cxd5 Qxd5 exd5 Be6 Ra8+ Ke7 Bc5+ Kf6 Bxf8 Kg6 Bxg7 Kxg7 dxe6 Kh6 exf7 Nf6 Rxh8 Nh5 Bxh5 Kg5 Rxh7 Kf5 Qf3+ Ke6 Bg4+ Kd6 Rh6+ Kc5 Qe3+ Kb5 c4+ Kb4 Qc3+ Ka4 Bd1#
## 4                                                                                                                                  d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O-O O-O-O Nb5 Nb4 Rc1 Nxa2 Ra1 Nb4 Nxa7+ Kb8 Nb5 Bxc2 Bxc7+ Kc8 Qd2 Qc6 Na7+ Kd7 Nxc6 bxc6 Bxd8 Kxd8 Qxb4 e5 Qb8+ Ke7 dxe5 Be4 Ra7+ Ke6 Qe8+ Kf5 Qxf7+ Nf6 Nh4+ Kg5 g3 Ng4 Qf4+ Kh5 Qxg4+ Kh6 Qf4+ g5 Qf6+ Bg6 Nxg6 Bg7 Qxg7#
## 5 e4 e5 Nf3 d6 d4 Nc6 d5 Nb4 a3 Na6 Nc3 Be7 b4 Nf6 Bg5 O-O b5 Nc5 Bxf6 Bxf6 Bd3 Qd7 O-O Nxd3 Qxd3 c6 a4 cxd5 Nxd5 Qe6 Nc7 Qg4 Nxa8 Bd7 Nc7 Rc8 Nd5 Qg6 Nxf6+ Qxf6 Rfd1 Re8 Qxd6 Bg4 Qxf6 gxf6 Rd3 Bxf3 Rxf3 Rd8 Rxf6 Kg7 Rf3 Rd2 Rg3+ Kf8 c3 Re2 f3 Rc2 Rg5 f6 Rh5 Kg7 Rd1 Kg6 Rh3 Rxc3 Rd7 Rc1+ Kf2 Rc2+ Kg3 h5 Rxb7 Kg5 Rxa7 h4+ Rxh4 Rxg2+ Kxg2 Kxh4 b6 Kg5 b7 f5 exf5 Kxf5 b8=Q e4 Rf7+ Kg5 Qg8+ Kh6 Rh7#
## 6                                                                                                                                                                                                                                                                                                                                                                                            e4 c5 Nf3 Qa5 a3
##   opening_eco                           opening_name opening_ply
## 1         D10       Slav Defense: Exchange Variation           5
## 2         B00 Nimzowitsch Defense: Kennedy Variation           4
## 3         C20  King's Pawn Game: Leonardis Variation           3
## 4         D02 Queen's Pawn Game: Zukertort Variation           3
## 5         C41                       Philidor Defense           5
## 6         B27   Sicilian Defense: Mongoose Variation           4
tail(chess_games)
##             id rated  created_at last_move_at turns victory_status winner
## 20053 EopEqqAa  TRUE 1.49981e+12  1.49981e+12    37         resign  white
## 20054 EfqH7VVH  TRUE 1.49979e+12  1.49979e+12    24         resign  white
## 20055 WSJDhbPl  TRUE 1.49970e+12  1.49970e+12    82           mate  black
## 20056 yrAas0Kj  TRUE 1.49970e+12  1.49970e+12    35           mate  white
## 20057 b0v4tRyF  TRUE 1.49970e+12  1.49970e+12   109         resign  white
## 20058 N8G2JHGG  TRUE 1.49964e+12  1.49964e+12    78           mate  black
##       increment_code     white_id white_rating           black_id black_rating
## 20053          10+10     jamboger         1219           samael88         1250
## 20054          10+10      belcolt         1691           jamboger         1220
## 20055           10+0     jamboger         1233 farrukhasomiddinov         1196
## 20056           10+0     jamboger         1219       schaaksmurf3         1286
## 20057           10+0 marcodisogno         1360           jamboger         1227
## 20058           10+0     jamboger         1235              ffbob         1339
##                                                                                                                                                                                                                                                                                                                                                                                                                                                               moves
## 20053                                                                                                                                                                                                                                                                                                                      c4 e6 d4 b6 Nc3 Bb7 Nf3 g6 h4 Bg7 Bg5 f6 Bf4 d6 e4 Ne7 d5 e5 Be3 c6 b4 c5 a3 h6 Qa4+ Nd7 Rb1 Qc7 bxc5 bxc5 Nb5 Qb8 Qa5 a6 Nc7+ Qxc7 Qxc7
## 20054                                                                                                                                                                                                                                                                                                                                                                  d4 f5 e3 e6 Nf3 Nf6 Nc3 b6 Be2 Bb7 O-O Be7 Ne5 d6 Bh5+ g6 Nxg6 hxg6 Bxg6+ Kf8 e4 fxe4 Re1 d5
## 20055                                                                                                 d4 d6 Bf4 e5 Bg3 Nf6 e3 exd4 exd4 d5 c3 Bd6 Bd3 O-O Nd2 Re8+ Kf1 Bxg3 hxg3 b6 g4 Ba6 g5 Bxd3+ Ne2 Ne4 Nxe4 Bxe4 Nf4 Qxg5 Nh3 Bxg2+ Kg1 Qg6 Nf4 Qg5 Nxg2 Re6 Qd3 Rh6 Qe2 Nc6 Re1 g6 f4 Rxh1+ Kxh1 Qh6+ Kg1 a5 Qb5 Na7 Re8+ Rxe8 Qxe8+ Kg7 Qe5+ Kg8 Qxc7 Qh5 Qxa7 Qd1+ Kh2 Qa1 Nh4 Qxb2+ Kh3 Qxc3+ Kg4 Qxd4 Qb8+ Kg7 Nf3 Qa1 Kg5 Qxa2 Ne5 Qg2+ Kh4 h5 Nxf7 Qg4#
## 20056                                                                                                                                                                                                                                                                                                                 d4 d5 Bf4 Nc6 e3 Nf6 c3 e6 Nf3 Be7 Bd3 O-O Nbd2 b6 Ne5 Nxe5 Bxe5 Nd7 Bxh7+ Kxh7 Qh5+ Kg8 Nf3 f6 Bf4 g5 Qg6+ Kh8 Nh4 Qe8 Qh6+ Kg8 Ng6 Kf7 Qh7#
## 20057 e4 d6 d4 Nf6 e5 dxe5 dxe5 Qxd1+ Kxd1 Nd5 c4 Nb6 c5 Nd5 Bc4 e6 Bxd5 exd5 Nc3 d4 Ne4 Bf5 f3 Nd7 b4 Nxe5 Bf4 f6 g4 Bxe4 fxe4 c6 Bxe5 fxe5 Nf3 Be7 Nxe5 Bf6 Nc4 O-O-O h4 h6 e5 b5 cxb6 Be7 bxa7 Kc7 a3 Rhf8 Kd2 Rf4 Rag1 d3 h5 Rf2+ Ke3 Re2+ Kf4 Rf8+ Kg3 Re3+ Nxe3 d2 Rd1 Bg5 Nf5 Kb7 Rhf1 Kxa7 Nd6 Rxf1 Rxf1 Kb6 e6 Kc7 Nf5 Kc8 e7 Kd7 a4 Bxe7 Nxe7 Kxe7 a5 Kd7 Rd1 Kc7 Rxd2 Kb7 Ra2 Ka6 Kf4 Kb5 a6 Kb6 a7 Kb5 a8=Q Kxb4 Qxc6 g5+ hxg6 Kb3 Rc2 Kb4 Qb7+ Ka3 Rc8
## 20058                                                                                                                         d4 d5 Bf4 Na6 e3 e6 c3 Nf6 Nf3 Bd7 Nbd2 b5 Bd3 Qc8 e4 b4 e5 Ne4 Nxe4 dxe4 Bxe4 bxc3 Bxa8 Qxa8 bxc3 Ba3 Rb1 c5 Qd3 O-O Qxa6 Bc6 Qxa3 Bxf3 gxf3 Qxf3 Qxa7 Qxh1+ Ke2 Qxb1 Qxc5 Qc2+ Ke3 Qxa2 Qb4 h6 c4 g5 Bg3 Qa8 c5 Rb8 Qc3 f5 f4 Qe4+ Kd2 Qg2+ Kd3 gxf4 Bxf4 Qf3+ Kc4 Qxf4 c6 Qf1+ Kc5 Rb1 Qg3+ Kf7 c7 Rc1+ Kd6 Qa6+ Kd7 Qb5+ Kd8 Qe8#
##       opening_eco                    opening_name opening_ply
## 20053         A40                 English Defense           4
## 20054         A80                   Dutch Defense           2
## 20055         A41                    Queen's Pawn           2
## 20056         D00 Queen's Pawn Game: Mason Attack           3
## 20057         B07                    Pirc Defense           4
## 20058         D00 Queen's Pawn Game: Mason Attack           3

Glimpse of the data

Now that we have a summary of what the data looks like, we can look at the actual data itself. To do this, we can use the glimpse function.

glimpse(chess_games)
## Rows: 20,058
## Columns: 16
## $ id             <chr> "TZJHLljE", "l1NXvwaE", "mIICvQHh", "kWKvrqYL", "9tXo1A~
## $ rated          <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE~
## $ created_at     <dbl> 1.5e+12, 1.5e+12, 1.5e+12, 1.5e+12, 1.5e+12, 1.5e+12, 1~
## $ last_move_at   <dbl> 1.5e+12, 1.5e+12, 1.5e+12, 1.5e+12, 1.5e+12, 1.5e+12, 1~
## $ turns          <int> 13, 16, 61, 61, 95, 5, 33, 9, 66, 119, 39, 38, 60, 31, ~
## $ victory_status <chr> "outoftime", "resign", "mate", "mate", "mate", "draw", ~
## $ winner         <chr> "white", "black", "white", "white", "white", "draw", "w~
## $ increment_code <chr> "15+2", "5+10", "5+10", "20+0", "30+3", "10+0", "10+0",~
## $ white_id       <chr> "bourgris", "a-00", "ischia", "daniamurashov", "nik2211~
## $ white_rating   <int> 1500, 1322, 1496, 1439, 1523, 1250, 1520, 1413, 1439, 1~
## $ black_id       <chr> "a-00", "skinnerua", "a-00", "adivanov2009", "adivanov2~
## $ black_rating   <int> 1191, 1261, 1500, 1454, 1469, 1002, 1423, 2108, 1392, 1~
## $ moves          <chr> "d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5 Bf4", "~
## $ opening_eco    <chr> "D10", "B00", "C20", "D02", "C41", "B27", "D00", "B00",~
## $ opening_name   <chr> "Slav Defense: Exchange Variation", "Nimzowitsch Defens~
## $ opening_ply    <int> 5, 4, 3, 3, 5, 4, 10, 5, 6, 4, 1, 9, 3, 2, 8, 7, 8, 8, ~

Summary of the data

Looking at a summary of the data can give a general idea of what the data looks like and what values are being dealt with.

summary(chess_games)
##       id              rated           created_at         last_move_at      
##  Length:20058       Mode :logical   Min.   :1.377e+12   Min.   :1.377e+12  
##  Class :character   FALSE:3903      1st Qu.:1.480e+12   1st Qu.:1.480e+12  
##  Mode  :character   TRUE :16155     Median :1.497e+12   Median :1.497e+12  
##                                     Mean   :1.483e+12   Mean   :1.483e+12  
##                                     3rd Qu.:1.501e+12   3rd Qu.:1.501e+12  
##                                     Max.   :1.504e+12   Max.   :1.504e+12  
##      turns        victory_status        winner          increment_code    
##  Min.   :  1.00   Length:20058       Length:20058       Length:20058      
##  1st Qu.: 37.00   Class :character   Class :character   Class :character  
##  Median : 55.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 60.47                                                           
##  3rd Qu.: 79.00                                                           
##  Max.   :349.00                                                           
##    white_id          white_rating    black_id          black_rating 
##  Length:20058       Min.   : 784   Length:20058       Min.   : 789  
##  Class :character   1st Qu.:1398   Class :character   1st Qu.:1391  
##  Mode  :character   Median :1567   Mode  :character   Median :1562  
##                     Mean   :1597                      Mean   :1589  
##                     3rd Qu.:1793                      3rd Qu.:1784  
##                     Max.   :2700                      Max.   :2723  
##     moves           opening_eco        opening_name        opening_ply    
##  Length:20058       Length:20058       Length:20058       Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 3.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 4.000  
##                                                           Mean   : 4.817  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :28.000

Data Transformation

Once we have looked at and explored the data, we should have a better understanding of what we are working with. Now that we know what the data looks like, we can start mutating the data to get rid of what we don’t need and manipulate the columns to have more useful meanings. This is essential because the data may not always be formatted in such a way that it can be immediately useful. It is also important to correct for any missing values or other issues that may skew the final results.

Removing Unnecessary Columns

After looking at the fields, I immediately know that the fields ‘id’, ‘created_at’, ‘last_move_at’, ‘white_id’, ‘black_id’, and ‘moves’ will not be needed, so it is safe to drop those fields. I will save these changes to a new data-frame called chess2.

chess2 <- chess_games %>% 
  select(-id, -created_at, -last_move_at, -white_id, -black_id, -moves)

glimpse(chess2)
## Rows: 20,058
## Columns: 10
## $ rated          <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE~
## $ turns          <int> 13, 16, 61, 61, 95, 5, 33, 9, 66, 119, 39, 38, 60, 31, ~
## $ victory_status <chr> "outoftime", "resign", "mate", "mate", "mate", "draw", ~
## $ winner         <chr> "white", "black", "white", "white", "white", "draw", "w~
## $ increment_code <chr> "15+2", "5+10", "5+10", "20+0", "30+3", "10+0", "10+0",~
## $ white_rating   <int> 1500, 1322, 1496, 1439, 1523, 1250, 1520, 1413, 1439, 1~
## $ black_rating   <int> 1191, 1261, 1500, 1454, 1469, 1002, 1423, 2108, 1392, 1~
## $ opening_eco    <chr> "D10", "B00", "C20", "D02", "C41", "B27", "D00", "B00",~
## $ opening_name   <chr> "Slav Defense: Exchange Variation", "Nimzowitsch Defens~
## $ opening_ply    <int> 5, 4, 3, 3, 5, 4, 10, 5, 6, 4, 1, 9, 3, 2, 8, 7, 8, 8, ~

Cleaning the Data

One concern that I have is regarding the potential for results to be skewed when looking at non-rated games. When games are not rated the players are more likely to be playing against friends, trying something new, or messing around which could have a negative impact on the accuracy of the games. Filtering out this condition will only lose 3,903 games out of the 20,058 in the dataset, but could help increase the accuracy of the data.

chess3 <- chess2 %>% 
  filter(rated=="TRUE") %>%
  select(-rated)

nrow(chess2)-nrow(chess3)
## [1] 3903
head(chess3)
##   turns victory_status winner increment_code white_rating black_rating
## 1    16         resign  black           5+10         1322         1261
## 2    61           mate  white           5+10         1496         1500
## 3    61           mate  white           20+0         1439         1454
## 4    95           mate  white           30+3         1523         1469
## 5    33         resign  white           10+0         1520         1423
## 6    66         resign  black           15+0         1439         1392
##   opening_eco                               opening_name opening_ply
## 1         B00     Nimzowitsch Defense: Kennedy Variation           4
## 2         C20      King's Pawn Game: Leonardis Variation           3
## 3         D02     Queen's Pawn Game: Zukertort Variation           3
## 4         C41                           Philidor Defense           5
## 5         D00 Blackmar-Diemer Gambit: Pietrowsky Defense          10
## 6         C50      Italian Game: Schilling-Kostic Gambit           6

Another concern I have is regarding the win condition. There are 4 possible win conditions: Mate, Draw, Resign, Out of Time. Let’s graph these to see what the impact may be. My concern is that games ending in a resignation or out of time may be the result of external factors.

chess3 %>% 
  ggplot(aes(x = victory_status, color = winner, fill = winner)) +
  geom_bar(alpha = 0.7) +
  labs(title = "Outcome of Game by Winner",
       x = "Outcome",
       y = "Number of Games")

After looking at the graph, I have decided that it is best to not filter the outcome conditions ‘resign’ and ‘outoftime’. When considering the proportions of the conditions that ended in a white victory, black victory, or draw they appear to be mostly the same, so it should not have a negative impact on the outcome of the match.

Adding to the data

There are a few columns that I believe will help better understand the data.

Here I will add the following columns:
1. rating_difference: the difference in rating level between white and black. Positive when white is higher and negative when black is higher.
2. average_rating: this is the average rating level between the two players.
3. advantage: this is a categorical value to determine who has an advantage. An advantage is found when one player’s skill level is higher than the others.

These columns will be added to a new data frame called chess4.

chess4 <- chess3 %>% 
  mutate(rating_difference = white_rating-black_rating) %>% 
  mutate(average_rating = (white_rating+black_rating)/2) %>% 
  mutate(advantage=case_when(rating_difference > 0 ~ "White Advantage",
                             rating_difference == 0 ~ "No Advantage",
                             rating_difference < 0 ~ "Black Advantage"))

glimpse(chess4)
## Rows: 16,155
## Columns: 12
## $ turns             <int> 16, 61, 61, 95, 33, 66, 119, 36, 13, 69, 43, 54, 53,~
## $ victory_status    <chr> "resign", "mate", "mate", "mate", "resign", "resign"~
## $ winner            <chr> "black", "white", "white", "white", "white", "black"~
## $ increment_code    <chr> "5+10", "5+10", "20+0", "30+3", "10+0", "15+0", "10+~
## $ white_rating      <int> 1322, 1496, 1439, 1523, 1520, 1439, 1381, 1307, 1113~
## $ black_rating      <int> 1261, 1500, 1454, 1469, 1423, 1392, 1209, 1106, 1423~
## $ opening_eco       <chr> "B00", "C20", "D02", "C41", "D00", "C50", "B01", "A2~
## $ opening_name      <chr> "Nimzowitsch Defense: Kennedy Variation", "King's Pa~
## $ opening_ply       <int> 4, 3, 3, 5, 10, 6, 4, 4, 3, 4, 8, 4, 4, 4, 3, 6, 3, ~
## $ rating_difference <int> 61, -4, -15, 54, 97, 47, 172, 201, -310, -141, 746, ~
## $ average_rating    <dbl> 1291.5, 1498.0, 1446.5, 1496.0, 1471.5, 1415.5, 1295~
## $ advantage         <chr> "White Advantage", "Black Advantage", "Black Advanta~

Next I will add a shortened version of the opening name. This will just make it easier to deal with later on.

chess4 <- chess4 %>% 
  mutate(short_opening = opening_name) %>% 
  mutate(short_opening = str_remove_all(short_opening, pattern = "\\|[^|]*$"))

head(chess4)
##   turns victory_status winner increment_code white_rating black_rating
## 1    16         resign  black           5+10         1322         1261
## 2    61           mate  white           5+10         1496         1500
## 3    61           mate  white           20+0         1439         1454
## 4    95           mate  white           30+3         1523         1469
## 5    33         resign  white           10+0         1520         1423
## 6    66         resign  black           15+0         1439         1392
##   opening_eco                               opening_name opening_ply
## 1         B00     Nimzowitsch Defense: Kennedy Variation           4
## 2         C20      King's Pawn Game: Leonardis Variation           3
## 3         D02     Queen's Pawn Game: Zukertort Variation           3
## 4         C41                           Philidor Defense           5
## 5         D00 Blackmar-Diemer Gambit: Pietrowsky Defense          10
## 6         C50      Italian Game: Schilling-Kostic Gambit           6
##   rating_difference average_rating       advantage
## 1                61         1291.5 White Advantage
## 2                -4         1498.0 Black Advantage
## 3               -15         1446.5 Black Advantage
## 4                54         1496.0 White Advantage
## 5                97         1471.5 White Advantage
## 6                47         1415.5 White Advantage
##                                short_opening
## 1     Nimzowitsch Defense: Kennedy Variation
## 2      King's Pawn Game: Leonardis Variation
## 3     Queen's Pawn Game: Zukertort Variation
## 4                           Philidor Defense
## 5 Blackmar-Diemer Gambit: Pietrowsky Defense
## 6      Italian Game: Schilling-Kostic Gambit

Next I will add a column for advantage level. This will be similar to what we added before, but instead of just noting who has an advantage, I will look at how significant of an advantage they have. I calculate this level by determining the Q1, Q2, and Q3 values and using those as a cut off.
1. MAB - if the rating difference is less than the Q1 value, then black has a major advantage
2. SAB - if the rating difference is less than the Q2 value, then black has a slight advantage
3. SAW - if the rating difference is less than or equal to the Q3 value, then white has a slight advantage
4. MAW - if the rating difference is greater than the Q3 value, then white has a major advantage
5. No Advantage - if the two players have the same rating level, then rating_difference will equal the Q2 value of zero.

# Set the interquartile ranges for categorizing the difference rating
Q1_rating <- summary(chess4$rating_difference) [2] # first quartile
Q2_rating <- summary(chess4$rating_difference) [3] # second quartile
Q3_rating <- summary(chess4$rating_difference) [5] # third quartile

chess4 <- chess4 %>% 
  mutate(advantage_level = case_when( rating_difference < Q1_rating ~ "MAB",
                                      rating_difference < Q2_rating ~ "SAB",
                                      rating_difference == Q2_rating ~ "No Advantage",
                                      rating_difference <= Q3_rating ~ "SAW",
                                      rating_difference > Q3_rating ~ "MAW"))

head(chess4)
##   turns victory_status winner increment_code white_rating black_rating
## 1    16         resign  black           5+10         1322         1261
## 2    61           mate  white           5+10         1496         1500
## 3    61           mate  white           20+0         1439         1454
## 4    95           mate  white           30+3         1523         1469
## 5    33         resign  white           10+0         1520         1423
## 6    66         resign  black           15+0         1439         1392
##   opening_eco                               opening_name opening_ply
## 1         B00     Nimzowitsch Defense: Kennedy Variation           4
## 2         C20      King's Pawn Game: Leonardis Variation           3
## 3         D02     Queen's Pawn Game: Zukertort Variation           3
## 4         C41                           Philidor Defense           5
## 5         D00 Blackmar-Diemer Gambit: Pietrowsky Defense          10
## 6         C50      Italian Game: Schilling-Kostic Gambit           6
##   rating_difference average_rating       advantage
## 1                61         1291.5 White Advantage
## 2                -4         1498.0 Black Advantage
## 3               -15         1446.5 Black Advantage
## 4                54         1496.0 White Advantage
## 5                97         1471.5 White Advantage
## 6                47         1415.5 White Advantage
##                                short_opening advantage_level
## 1     Nimzowitsch Defense: Kennedy Variation             SAW
## 2      King's Pawn Game: Leonardis Variation             SAB
## 3     Queen's Pawn Game: Zukertort Variation             SAB
## 4                           Philidor Defense             SAW
## 5 Blackmar-Diemer Gambit: Pietrowsky Defense             SAW
## 6      Italian Game: Schilling-Kostic Gambit             SAW

The final column I want to add will be the skill level. Once again, I used the Q1, Q2, and Q3 of the average_rating value to define the cut off for skill groupings. The following are the values: 1. Low Skill - average rating is less than or equal to the Q1 value 2. Medium Skill - average rating is less than the Q3 value 3. High Skill - Average rating is greater than or equal to the Q3 value

Initialy, I had used an arbitrary cutoff value, but using the interquartile ranges felt more natural and a better way of determining the cutoff values.

Q1_skill <- summary(chess4$average_rating) [2] # first quartile
Q2_skill <- summary(chess4$average_rating) [3] # second quartile
Q3_skill <- summary(chess4$average_rating) [5] # third quartile

chess4 <- chess4 %>% 
    mutate(skill_level = case_when( average_rating <= Q1_skill ~ "Low Skill",
                                  average_rating < Q3_skill ~ "Medium Skill",
                                  average_rating >= Q3_skill ~ "High Skill"))

# transform(chess4, skill_level = factor(skill_level, levels = c("Low Skill", "Medium Skill", "High Skill")))
head(chess4)
##   turns victory_status winner increment_code white_rating black_rating
## 1    16         resign  black           5+10         1322         1261
## 2    61           mate  white           5+10         1496         1500
## 3    61           mate  white           20+0         1439         1454
## 4    95           mate  white           30+3         1523         1469
## 5    33         resign  white           10+0         1520         1423
## 6    66         resign  black           15+0         1439         1392
##   opening_eco                               opening_name opening_ply
## 1         B00     Nimzowitsch Defense: Kennedy Variation           4
## 2         C20      King's Pawn Game: Leonardis Variation           3
## 3         D02     Queen's Pawn Game: Zukertort Variation           3
## 4         C41                           Philidor Defense           5
## 5         D00 Blackmar-Diemer Gambit: Pietrowsky Defense          10
## 6         C50      Italian Game: Schilling-Kostic Gambit           6
##   rating_difference average_rating       advantage
## 1                61         1291.5 White Advantage
## 2                -4         1498.0 Black Advantage
## 3               -15         1446.5 Black Advantage
## 4                54         1496.0 White Advantage
## 5                97         1471.5 White Advantage
## 6                47         1415.5 White Advantage
##                                short_opening advantage_level  skill_level
## 1     Nimzowitsch Defense: Kennedy Variation             SAW    Low Skill
## 2      King's Pawn Game: Leonardis Variation             SAB Medium Skill
## 3     Queen's Pawn Game: Zukertort Variation             SAB Medium Skill
## 4                           Philidor Defense             SAW Medium Skill
## 5 Blackmar-Diemer Gambit: Pietrowsky Defense             SAW Medium Skill
## 6      Italian Game: Schilling-Kostic Gambit             SAW Medium Skill
chess5 <- chess4 %>%
  filter(rating_difference <= Q3_rating & rating_difference >= Q1_rating)

head(chess5)
##   turns victory_status winner increment_code white_rating black_rating
## 1    16         resign  black           5+10         1322         1261
## 2    61           mate  white           5+10         1496         1500
## 3    61           mate  white           20+0         1439         1454
## 4    95           mate  white           30+3         1523         1469
## 5    33         resign  white           10+0         1520         1423
## 6    66         resign  black           15+0         1439         1392
##   opening_eco                               opening_name opening_ply
## 1         B00     Nimzowitsch Defense: Kennedy Variation           4
## 2         C20      King's Pawn Game: Leonardis Variation           3
## 3         D02     Queen's Pawn Game: Zukertort Variation           3
## 4         C41                           Philidor Defense           5
## 5         D00 Blackmar-Diemer Gambit: Pietrowsky Defense          10
## 6         C50      Italian Game: Schilling-Kostic Gambit           6
##   rating_difference average_rating       advantage
## 1                61         1291.5 White Advantage
## 2                -4         1498.0 Black Advantage
## 3               -15         1446.5 Black Advantage
## 4                54         1496.0 White Advantage
## 5                97         1471.5 White Advantage
## 6                47         1415.5 White Advantage
##                                short_opening advantage_level  skill_level
## 1     Nimzowitsch Defense: Kennedy Variation             SAW    Low Skill
## 2      King's Pawn Game: Leonardis Variation             SAB Medium Skill
## 3     Queen's Pawn Game: Zukertort Variation             SAB Medium Skill
## 4                           Philidor Defense             SAW Medium Skill
## 5 Blackmar-Diemer Gambit: Pietrowsky Defense             SAW Medium Skill
## 6      Italian Game: Schilling-Kostic Gambit             SAW Medium Skill

Next, I want to determine if I should filter out games in which the two players rating are too far apart indicating that the players are not evenly matched.

chess4 %>% 
  ggplot(aes(x=white_rating, y=black_rating, color=winner)) +
  geom_point(alpha=0.25) +
  geom_smooth(method = "lm", size=1) +
  labs(title = "Correlation Between Player Ratings in Each Game",
       subtitle = "Colored by Winner",
       x = "White Rating",
       y = "Black Rating",
       caption = "Source: Chess Game Dataset (Lichess)")
## `geom_smooth()` using formula 'y ~ x'

The graph shows a strong positive correlation indicating that the two players skill levels tend to be relatively even. This would mean that the higher your rating, the higher your opponent’s rating will likely be. The tells us that the matches are typically between two similarly ranked players. The trend line tells us that when white wins, they are generally higher ranked compared to when black wins. Based on this result, I have determined that it will not be necessary to filter out games in which the players are not evenly matched.

Does white have a first move advantage?

chess4 %>% 
  group_by(winner) %>% 
  summarise(count = n()) %>% 
  mutate(freq = count/sum(count)*100)
## # A tibble: 3 x 3
##   winner count  freq
##   <chr>  <int> <dbl>
## 1 black   7384 45.7 
## 2 draw     719  4.45
## 3 white   8052 49.8

Looking at the above information, we can see that black wins 45.7% of the time, white wins 49.8% of the time, and the game ends in a draw 4.5% of the time. Based solely on these numbers, it doesn’t appear that which has an initial advantage, however there is more to the story as you will find out in the next section.

Plotting the Data

What is the relationship between win rate based on rating advantage of a player?

I will look at games won compared by their basic rating advantage (i.e., black advantage or white advantage)

chess4 %>% 
  filter(rating_difference != 0) %>% 
  ggplot(aes(x = advantage, color = winner, fill = winner)) +
  geom_bar(alpha = 0.7) +
  labs(title = "Games Won by Piece Color", 
       subtitle = "compared by rating advantage",
       x = "Advantage",
       y = "Number of Games",
       caption = "Source: Chess Game Dataset (Lichess)")

In the above graph, I have used the player advantage variable on the x-axis and colored the bars based on the match winner. Black advantage means that the black player had a higher rating than the white player. White advantage means that the white player had a higher rating than the black player. According to the bar graph, the winner of the match is favored 2 times out of 3 according to the player’s color advantage.

Is there a correlation between the level someone is advantaged and their win rates?

I will look at games won compared by their advantage level (i.e., major advantage or slight advantage)

chess4 %>% 
  filter(advantage_level != "No Advantage") %>% 
  ggplot(aes(x = advantage_level, color = winner, fill = winner)) +
  geom_bar(alpha = 0.7) +
  labs(title = "Games Won by Piece Color", 
       subtitle = "compared by rating advantage level",
       x = "Advantage Level",
       y = "Number of Games",
       caption = "Source: Chess Game Dataset (Lichess)")

Here I have divided each of the advantage levels into 4 sections - major advantage black, slight advantage black, major advantage white, and slight advantage white. The graph above shows that when a player has a major advantage, they can be expected to win nearly 3 out of 4 games. When a player has only a slight advantage the possibility of winning is only about a half. Another interesting observation is that the number of games played where a player had a significant advantage is about the same as when a player has a slight advantage.

To look at this again, we can revisit the graph from earlier:

chess4 %>% 
  ggplot(aes(x=white_rating, y=black_rating, color=winner)) +
  geom_point(alpha=0.25) +
  geom_smooth(method = "lm", size=1) +
  labs(title = "Correlation Between Player Ratings in Each Game",
       subtitle = "Colored by Winner",
       x = "White Rating",
       y = "Black Rating",
       caption = "Source: Chess Game Dataset (Lichess)")
## `geom_smooth()` using formula 'y ~ x'

As stated before: “The graph shows a strong positive correlation indicating that the two players skill levels tend to be relatively even. This would mean that the higher your rating, the higher your opponent’s rating will likely be. The tells us that the matches are typically between two similarly ranked players. The trend line tells us that when white wins, they are generally higher ranked compared to when black wins. Based on this result, I have determined that it will not be necessary to filter out games in which the players are not evenly matched.” From this graph we can also see that as the skill level of the players go up, so does the chance of the game ending in a draw. My presumption is that more skilled players are more likely to know how to react to another player’s move with more precision and less randomness.

What if we compare the graphs of Games won by outcome for players with the same rating vs the entire dataset?

chess4 %>% 
  filter(rating_difference == 0) %>% 
  ggplot(aes(x = winner, color = winner, fill = winner)) +
  geom_bar(alpha = 0.7) +
  labs(title = "Games Won by Outcome",
       subtitle = "for players with same rating",
       x = "Outcome",
       y = "Number of Games", 
       caption = "Source: Chess Game Dataset (Lichess)")

chess4 %>% 
  ggplot(aes(x = winner, color = winner, fill = winner)) +
  geom_bar(alpha = 0.7) +
  labs(title = "Games Won by Outcome",
       subtitle = "across all games",
       x = "Outcome",
       y = "Number of Games",
       caption = "Source: Chess Game Dataset (Lichess)")

The graphs above show that there tends to be the same ratio between black victory, draw, and white victory across the entire dataset as there for players with the same skill level.

Does level of skill have an impact on the outcome?

chess5 <- transform(chess5, skill_level = factor(skill_level, levels = c("Low Skill", "Medium Skill", "High Skill")))

chess5 %>% 
  ggplot(aes(x = skill_level, color = winner, fill = winner)) +
  geom_bar(alpha = 0.7) +
  labs(title = "Games Won by Piece Color", 
       subtitle = "based on skill level",
       x = "Skill Level",
       y = "Number of Games")

The graph above shows that regardless of the players skill level (low, medium, or high) the chance of them winning the game is relatively consistent.

Does the number of turns in the game have an impact on the outcome?

To answer this question I utilized a ridge graph. This graph is intended to show the distribution of the number of turns in a game as it relates to the outcome. I will use two methods to visualize this, though both show essentially the same thing.

chess4 %>% 
  pivot_longer(turns) %>% 
  ggplot(aes(y = winner, x = value, fill = winner)) +
  geom_density_ridges(alpha = 0.7, bandwidth = 7.5) +
  scale_fill_viridis_d(end = 0.75, option = "C") +
  scale_colour_viridis_d(end=0.75, option = "C") + 
  theme_minimal() + 
  theme(legend.position = "none", 
        axis.title.y = element_blank()) +
  labs(title = "Game result",
       subtitle = "based on number of turns",
       x = "Number of Turns the game takes",
       caption = "Source: Chess Game Dataset (Lichess)")

chess4 %>% 
  ggplot(aes(turns, fill = victory_status)) +
  geom_density(alpha=0.3) +
  facet_wrap(~winner, ncol = 1) +
  labs(title = "Game result",
       subtitle = "based on number of turns",
       x = "Number of Turns the game takes",
       caption = "Source: Chess Game Dataset (Lichess)")

From this graph we can see that to win a game takes roughly the same number of turns, which black having a slightly more narrow peak distribution. The peak distribution is likely due to the fact that white taking the first move makes it an offensive side, whereas the black player is reacting to white’s moves. It can also be noted that the longer a game goes on, the more likely it is to end in a draw. This would indicate that the length of a game would be a good predictor of the outcome.

Does the number of turns in the opening have an impact on the outcome?

To answer this question I utilized a ridge graph. This graph is intended to show the distribution of the number of turns in an opening as it relates to the outcome. I will use two methods to visualize this, though both show essentially the same thing.

chess4 %>% 
  pivot_longer(opening_ply) %>% 
  ggplot(aes(y = winner, x = value, fill = winner)) +
  geom_density_ridges(alpha = 0.7, bandwidth = 0.462) +
  scale_fill_viridis_d(end = 0.75, option = "C") +
  scale_colour_viridis_d(end=0.75, option = "C") + 
  theme_minimal() + 
  theme(legend.position = "none", 
        axis.title.y = element_blank()) +
  labs(x = "Number of Turns the opening takes", 
       caption = "Source: Chess Game Dataset (Lichess)")

chess4 %>% 
  ggplot(aes(opening_ply, fill = victory_status)) +
  geom_density(alpha=0.3) +
  facet_wrap(~winner, ncol = 1) +
  labs(title = "Game result",
       subtitle = "based on number of turns in opening",
       x = "Number of Turns the opening takes",
       caption = "Source: Chess Game Dataset (Lichess)")

There are no major discernible difference in the distribution of the number of turns an opening has across outcomes. This would indicate that the number of moves in an opening is not a good predictor for the outcome of the game.

Does the opening affects the win rate?

To answer this question I will use a bar graph showing the openings vs win percentage as it relates the winner.

grouped_games <- chess4 %>% 
  group_by(short_opening, winner) %>% 
  summarise(count=n())
## `summarise()` has grouped output by 'short_opening'. You can override using the `.groups` argument.
grouped_games %>% 
  ggplot(aes(x = short_opening, y = count, fill = winner )) +
  geom_bar(position="fill", stat="identity")+
  theme_minimal()+
  labs(x = "Opening",
       y = "Win Ratio") +
  theme(axis.text.x = element_blank()) + 
  scale_fill_viridis_d(end=0.75, option="C")+
  scale_color_viridis_d(end=0.75, option="C") + 
  labs(caption = "Source: Chess Game Dataset (Lichess)")

This graph shows that while the majority of the openings are mostly balanced, there are openings which favor one side heavily. This would indicate that the opening used is highly predictive of the games outcome. However it should be noted that the data does not indicate which openings are the most played and which are the least played. It can be assumed that the openings most likely to be played will be balanced for the most part. The reason for this is that in order for an opening to be played it must be accepted by the opponent. If a given opening is not a balanced opening, it is not likely that the opponent will follow the required moves in order to play it through.

Does rating have an effect on the outcome of a game?

I would expect the side with a advantage to win more frequently as we saw prior. To further investigate these relationships I will use a violin plot. This type of plot will allow me to check the relationship between a quantitative variable and a categorical variable.

chess7 <- chess4 %>% 
  mutate(outcome = case_when( winner == "white" ~ 1,
                              winner == "black" ~ 0,
                              winner == "draw" ~ 0.5))

chess7 %>% 
  ggplot(aes(x = rating_difference, y = outcome, color = winner, fill = winner)) +
  geom_jitter(width = 0, height = 0.1, alpha = 0.5)+
  geom_violin(alpha = 0.8, color = "white") +
  theme_minimal() +
  labs(x = "Rating Difference ( - Black, + white)",
       y = "Outcome", 
       caption = "Source: Chess Game Dataset (Lichess)") +
  scale_fill_viridis_d(end=0.75, option="C") +
  scale_color_viridis_d(end=0.75, option="C")

There is a large overlap in rating_difference for all 3 winner categories. Overall, this trend is not a surprise considering our previous investigation found that the match making system mostly matches players of similar skill. Additionally, it is not a surprise that white tends to win over black when outranking them as well as the other way around. It can also be noted that draws are most likely to occur when players are of similar rating, although there is a slight increase in the number of draws when white has the advantage. Overall, this graph indicates that the rating is a good predictor in the outcome of the game.

Final Analysis

While Chess is a game of tactics and skill, there are some advantages that exist however slight they may be. I was able to find what variables have an impact on the outcome of the game and what variables don’t seem to have any correlation at all. One change that I would have liked would be to find a data set that allows me to look at the variables as they change over time such as tracking a unique player’s progression in the game. This would allow me to see how each variable changes the outcome in a more controlled environment.

There are 3 phases of a standard chess game with white getting the first move: The opening, the middlegame, and the endgame. While the middlegame and endgame do play an important role, the opening sets the stage for the flow and development of the rest of the game. A good opening will give a player an advantage when entering the middle game, prepare pieces to launch an attack, and set the defense all at the same time. They play such an important role that many professional players will spend years studying and perfecting different openings to see which ones will impact their game.

I was unable to find a first move bias in favor of the white side, or at least one of significance. I did, however, find that a player of higher skill is more likely to take advantage of getting the first move by having a stronger opening, however that is mostly countered by an equally skilled black player being able to react more effectively.

My findings indicate that the following variables are the best predictors for the outcome of a match:
1. Player Rating
2. Opening Played
3. Number of Turns
4. Rating Difference Between Players
5. Starting side (white or black)

I will likely continue to investigate using this data set and would eventually like to have a machine learning algorithm to predict the winner of a chess match using these findings.