TD-learning, neural
networks, and backgammon
Kimon Tsinteris
David Wilson
Board games are well suited to demonstrate various approaches of artificial intelligence. They offer challenges of great complexity over a domain that is well defined. Because the rules to games such as checkers, chess, and backgammon lack ambiguity, as does the final outcome, board games provide a good evaluation of the performance of different algorithms and techniques.
This paper
describes a successful approach toward a computer program that plays a good
game of backgammon. Because backgammon incorporates a stochastic process, the
roll of the dice, the game tree space is too large to lend itself to the
traditional method of search. For this reason, past attempts at computer
programs that played backgammon have been largely unsuccessful. A computer program
that succeeded in doing so is a major milestone. The goal of this project was
to recreate this major milestone.
The
approach taken reproduces work done by Gerald Tesauro in 1991. We
re-implemented an early version of his TD-Gammon, a neural network based
program that successfully learns to play backgammon. A framework for the game
of backgammon was written including such things as the board representation,
random dice roll, and valid move generation. A neural network was used and
trained as an evaluation function, with sole output the probability of white
winning. After each roll of the dice, all possible valid moves were calculated
and then subsequent board positions evaluated. As white, the program
consistently chooses the move that maximizes the evaluated output, while as
black, it just chooses the move that minimizes it most.
The truly remarkable aspect of this approach is that the computer program is self-taught. Starting with random initial weights, the neural network is trained through reinforcement learning, in particular using a technique called TD(l). The neural network is trained solely through self-play, which leads it to develop its own positional knowledge concerning the game of backgammon. One of the most exciting repercussions of this, as witnessed by Tesauro and the whole backgammon community, is the creation of a backgammon player free from the biases of existing human knowledge.
2.
Problem Definition and Algorithm
The trained program should act
as both an evaluator and a controller, by utilizing board evaluations to choose
that move that will lead to the best value. It will also, therefore, be
stateless - the network will need as input only a single board description, and
will give the same output for that description regardless of what it has seen
before (excepting differences caused by training of the network.) The neural
network must therefore be embedded within a game-playing framework to achieve
our goal of competent play, because the network by itself cannot choose
moves. We have, then, the
constraints on program design: A neural network evaluation function, embedded
within a framework that ensures correct move choice and provides an interface
from which the network can play against others.
The constraints above lead
naturally into an overall program architecture. A framework is set up that
tracks changes to the board, rolls dice, and ensures the legality of moves. The
framework is also responsible for generating the stream of moves that the
neural network must evaluate. The framework exhibits no intelligence by itself-
it simply finds, and evaluates via the neural network, all legal moves given an
initial board position and roll of the dice.
The neural network, then, is
responsible for the entirety of the intelligent behavior exhibited by the
network. The network consists of a three-layer fully-connected feedforward
network, and maps a board description onto a single output that corresponds,
roughly, to the chance that white will win, given the evaluated board position.
We use a standard sigmoid function at each of the neurons, where a neuron's
output is equal to 1/(1+e-sum), where
sum is the weighted sum of the inputs to that neuron. The sigmoid function
fulfills backpropagation's requirement that the activation function be
continuously differentiable, and at the same time keeps the output within the
0-1 range appropriate for an evaluation intended to describe the white player's
chance of winning.
In designing the inputs to the
network, the decision was made to closely follow Tesauro's original TD-Gammon
program, and have mapped the board position onto an input vector of 198
elements, making no attempt to limit the number of input neurons. There are no
hand-crafted features as part of the inputs; we rely on the network for feature
extraction and use.
A backgammon board consists of
twenty-four board positions, and each of these positions is represented by eight
input neurons. This board representation comprises the first 192 inputs to the
network. For each position, four inputs represent the number of white pieces
and four represent the number of black pieces. This representation is simple --
the first neuron is on if there is one piece on the board at the given
position. The second and first are both on if there are two pieces, and the
first three are on if there are three checkers. The fourth neuron represents
one-half the number of checkers beyond three that are present on the board
position.
The next two inputs represent
the player who is moving; the first is on for a move by white and the second is
on for a move by black. Pieces on the bar are the next two, each input being
one-half the number of checkers on the bar for the corresponding player. Last,
the number of pieces already borne off by each player is represented directly
in the final two inputs. These are the only inputs whose values are ever
expected to exceed unity with any regularity.
These 198 input units are
fully connected to a hidden layer of 50 units, and this hidden layer is in turn
connected to the single output neuron. Each hidden layer neuron, and the output
layer neuron, also have bias inputs whose values are held at unity. The network
configuration can be described in terms of 10001 independent weights governing
the connections between the neurons.
The network was trained using
a fully incremental version of TD-backprogation. Weight changes were calculated
following every move except the first, and the changed weights were used in the
next move's evaluation. The desired output during backpropagation was set to
the evaluation of the neural network at the following move (after the opponent
had made a move, that is.) The procedure at each step was:
Given vector of weights D eligibility trace vector
e(s).
1.) Evaluate board positions using the neural network.
Choose the move with the highest (lowest as black) evaluation. Move.
2.) If this is the end of the game:
Backprop,
with reward of 1 or 0 depending on whether white won or lost.
3.) Else if this was not the first move, then:
a)
Evaluate board.
b)
Calculate error between current evaluation and previous evaluation.
c)
Backprop, using the current evaluation as desired output and the board position
previous to the current move as the input.
End.
Backpropagation procedure:
Given an input vector V and a desired output O.
1) Calculate error E between the network's output on V
and the desired output O.
2) e(s) = (lambda)*e(s) + grad(V)
3) V = V + (alpha)*error(n)*e(s)
where
error(n) is:
For
the weight between hidden node i and the output node, error(i)=E*activation(i)*weight(i)
For
the weigth between input node j and hidden node i, error(j,i)=error(i)*activation(j)*weight(j,i)
The application of temporal
difference learning to game playing has a fairly long history, from Samuel's
checkers player through Tesauro's original TD-Gammon program, providing an
excellent theoretical framework for our work. Though closely following
Tesauro's work to capitalize on this history, our program nonetheless deviates
in certain areas to reduce the complexity (or solve difficulties that arose in
training). The first of these comes form the evaluations created by the
network- while Tesauro expected the network to produce different evaluations
based on which player it was evaluating for, our network always evaluates to
find white's chance of winning- the external framework takes care of using the
highest value for white's move and the lowest for black's. This proved to be
the solution to an initial difficulty that was revealed when the network
quickly saturated to a middle value (approximately 0.5) regardless of the input
board position.
The implementation of the full
eligibility trace algorithm (initial tests used only an approximation) lead to
the first networks that matched the results previous work indicated that the
network should produce. Subjective evidence of learning was apparent as early
as 20,000 games into training, and the network was able to beat the authors
consistently by 90,000 games. Testing against implementations with the
eligibility trace approximation showed the new network winning over 85% of the
games.
3.1 Methodology
The goal of the project was,
of course, the creation of a program that could achieve competency in playing
backgammon. We chose two ways of evaluating our networks performance at this
task: comparison with human expert moves on a set of board positions, and
evidence of learning through testing the network against less-trained versions
of the weight data. As described above, the program went through several
iterations before the current, and relative best, design was found. It is
interesting to note that many of the issues with earlier designs are directly
attributable to faults in the random number generators initially used in
playing the games; that is, networks trained using the standard C library's
rand() function performed noticeably worse than networks trained using a random
number generator with a longer period.
It is likely that a portion of the poor performance of our initial
approximation to TD(l) can be explained by this, though we did not discover the problem
until we had moved to the full implementation of the algorithm and therefore
cannot provide more than circumstantial evidence for this.
Much of the advantage that
self-play offers as a method of training the neural network lies in the fact
that every position will be the result of a backgammon position from an actual
board, rather than being contrived positions that may fail to teach the network
about probabilities or prevent the network from properly generalizing from the
results. Given the average 70-ply in a backgammon game, and the more than
100,000 games played, we find that the network has the advantage of having seen
over 7 million different board positions, something hardly feasible in a
network trained through a crafted set of training data.
3.2
Results
Though we
had originally intended to play the program on FIBS and thus discern its
numerical rating, this turned out to be largely unfeasible given the time
constraints. We had originally planned to act as intermediaries between our
network’s UI and the online opponents. This proved time consuming (e.g.
setting the dice, playing the opponents move, setting the dice, playing back
our networks move), to the point where opponents got bored and walked away from
the game.
We chose 2
different methods to evaluate the performance of our network as a backgammon
player. The first was a round robin tournament between the same network at
various stages in its training. Figure 1 shows that random strategy, reflected
by network trained on only 1 game, is not good for playing backgammon. It
managed to win a single game out of the 1200 games it played against the other
contenders – in fact it is an exceedingly bad strategy. The network after
20,000 games of training did notably better, and a similar improvement can be
seen after it has played 50,000 games against itself. After that, the benefits
of additional training appear to taper off, with the winner of the tournament
being the network after 90,000 games of self-play. We expect that because of
the stochastic nature of the game, and the closeness of the results, a much
longer tournament would be necessary to determine the true winner with much
certainty.
Figure 1
The second method was to compare our fully trained network to current expert opinion on the best opening moves. After 150,000 games of self-play, given the 15 possible opening dice rolls, the network’s move matched expert opinion 8 out of 15 times – more than half the occasions. As can further be seen in Table 1, of the remaining dice rolls, 5 out of 7 times it’s play was a partial match in keeping with the best, and on only 2 occasions was the move different from expert opinion.
3.3
Discussion
The results support our initial hypothesis; that a neural network can be made to excel at backgammon. Even though our implementation provided the network with only a "raw" board encoding as input, and no other relevant information about the game of backgammon, it still managed to do a substantial amount of learning.
Dice
Roll |
Best Move
|
Network
Move |
Match |
2-1 |
24/23,
13/11 |
24/23,
23/21 |
Partial
|
3-1 |
8/5, 6/5 |
8/5, 6/5 |
Yes
|
4-1 |
24/23,
13/9 |
24/20,
6/5 |
No |
5-1 |
24/23,
13/8 |
24/23,
23/18 |
Partial
|
6-1 |
13/7, 8/7 |
13/7, 8/7 |
Yes
|
3-2 |
24/21,
13/11 |
24/21,
13/11 |
Yes |
4-2 |
8/4, 6/4 |
8/4, 6/4 |
Yes |
5-2 |
24/22,
13/8 |
24/22,
13/8 |
Yes |
6-2 |
24/18,
13/11 |
24/18,
13/11 |
Yes |
4-3 |
24/20,
13/10 |
24/20,
24/21 |
Partial |
5-3 |
8/3, 6/3 |
8/3, 6/3 |
Yes |
6-3 |
24/18,
13/10 |
24/21,
13/7 |
No |
5-4 |
24/20,
13/8 |
24/20,
13/8 |
Yes
|
6-4 |
24/18,
13/9 |
24/18,
24/20 |
Partial |
6-5 |
24/18,
18/13 |
24/18,
13/8 |
Partial |
Table 1
For the 15 possible opening dice rolls, the moves played by our neural network as compared to the moves considered best by today’s experts [2]. The neural network discovered more than half the best moves, and was completely in error on only 2 occasions.
As can be
seen in Figure 1, the network made significant progress at learning backgammon
through self-play. It started with purely random play and by 20,000 games of
play had already developed strategies such as hitting the opponent, and playing
it safe by covering its pieces. Table 1 shows how with no pre-computed
knowledge, the network by the end of its training, in fact discovered many of
the same opening moves experts deem best. In all aspects, the neural network
displayed phenomenal success at automatic “feature discovery” in
the domain of backgammon.
There are a
number of reasons why computer programs only recently have had success at
playing good backgammon. At each ply, the probabilistic roll of the dice yields
21 different possibilities, each of which on average has 20 possible legal
moves. This makes for a branching factor of ~400. When compared to a branching
factor of 30-40 for chess, backgammon’s high number clearly prohibits
brute search of the game tree. TD-Gammon showed, and this project confirmed,
that relying on a neural network’s sense of positional knowledge does
much better.
Our
implementation uses self-play with reinforcement learning in order to develop
its positional judgment. Instead of being shown a handcrafted set of games, and
imitating expert human opinion, it actually becomes an expert of its own. In
particular by not relying on human interaction for it’s training, it is
freed from some of the existing human biases. In fact, because of more advanced
versions of TD-Gammon, well established opening moves to backgammon have been
re-examined and abandoned. The backgammon community has now accepted TD-Gammon rollout
as the "most reliable method available for analyzing backgammon
plays." [5]
Previous
attempts at writing computer programs that played backgammon had always led to
mediocre backgammon play. In the game of backgammon, there is a branching
factor of several hundred, which makes the brute-force methodology not
feasible. Most commercial programs before 1990 played at a "weak
intermediate level" [5].
Neurogammon
was the first successful attempt to teach a neural network backgammon, and in
fact was TD-Gammon's precursor. By Tesauro, Neurogammon relied on supervised
learning. It was trained through backpropagation on a database of games
recorded by expert backgammon players. Neurogammon managed to achieve a
"strong intermediate level of play" [5]. This approach however is not
infallible. It relies on human knowledge, and sometimes that human knowledge is
flawed. For instance, human game play of backgammon has improved significantly
in the past 20 years.
Tesauro
also released subsequent versions of TD-Gammon that built upon the initial
success. Instead of just using a "raw" board encoding (number of
white and black pieces at each location) as input, a set of handcrafted
features deemed important by experts was also added. This included pre-computed
information such as the probability of being hit, or the strength of a
blockade. The final version of TD-Gammon also included a 2-ply search in making
its move selection. These added features made TD-Gammon the first computer
program to attain strong master level play, equaling the world's best human
players.
5. Further Work
Given further time for
investigations, there are several directions for the program that could prove
worthwhile. Improvements to the backpropagation algorithm hold the most
promise, especially momentum or Q(l), an enhancement of
the temporal difference portion of the algorithm. The most common difficulty we
encountered was found when networks became caught in local minima, and efforts
to overcome this would decrease the number of initial weight configurations
that must be tested to find one that could be called, with any certainty,
near-optimal.
More fundamental changes are
also worth investigating, such as the addition of extra hidden nodes or the use
of crafted features as additional inputs to the network. Preliminary testing
with a larger hidden layer (80 nodes) indicated a very large performance hit,
however, prompting the choice of 50 as the final size for the hidden layer.
However, previous work in this area (especially Tesauro's TD-Gammon) has shown
that raising the network size to 100 or more can have a beneficial impact on
trained performance.
Temporal
difference learning yielded surprisingly good results training a neural network
as an evaluation function for the game of backgammon. Furthermore, the use of
self-play for training appears to have in no way hindered the learning process.
Our program clearly shows evidence of feature discovery in the domain of
backgammon, and after sufficient training, plays a good game.
These
findings, mirrored in our results, show that temporal difference learning
applied to neural networks is a viable alternative to traditional search. The
success of TD-Gammon not only reshaped master level play of backgammon, but
also sparked a re-interest in the field. TD(l) is being applied to other board
games such as chess and Othello, with initial findings showing some success.
Though not as dramatic as in backgammon, temporal difference learning appears
to be a good complement to the more traditional search techniques employed by
many today.
1. Barto, Andrew G and Sutton,
Richard S. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
2. Damish,
Mar. Backgammon -- Frequently Asked Questions. http://www.cybercom.net/~damish/backgammon/bg-faq.html,
1995.
3. Press, Teukolsky,
Vetterling, and Flannery. Numerical Recipes in C. Cambridge University Press, Cambridge, MA, 1992.
4. Russel, Stuart and Norvig,
Peter. Artficial Intelligence: A Modern Approach. Prentice Hall, Upper Saddle River, NJ. 1995
5. Tesauro,
G (1995). Temporal Difference Learning and TD-Gammon. Communications of the
ACM, 38(3): 58-68.
6. Tesauro, G. (1992).
Practical issues in temporal difference learning. Machine Learning, 8(3-4):257-277
7. Tesauro, G. and Sejnowski,
T. J. (1989). A parallel network that learns to play backgammon. Artificial
Intelligence, 39(3):357-390.