The Learning Algorithm
The goal of the learning process is to find
the set of 8 values (w1 through w8 in the evaluation function) that correspond to optimal
Tetris play. This amounts to searching the 8-dimensional weight space for the
global maximum in performance. Since the path by which the solution is reached
is irrelevant, an iterative improvement algorithm seemed to be practical.
The program uses a modified version of the
Random-Restart Hill-Climbing (RRHC) algorithm. In RRHC, a series of
hill-climbing searches (continually moving in the direction of increasing
value) are conducted from randomly generated initial states. The best result
obtained from any of the searches is then the best solution found.
There are a few well-known drawbacks to RRHC.
One is getting stuck on local maxima, but this is addressed by a
"slack" parameter, discussed below. Another is the possibility of
oscillating from side to side on a gently sloping peak. However, the
granularity (step size) of the search is also specified by a user-controlled
parameter to allow for refinements of the search.
Note: While playing games in the learning
process, I disabled the Next Tetrad Preview, so the shallow search is
always executed. The only reason for doing this was to have a quicker learning
process. Since the player performs significantly worse without the preview, the
games end much faster. This results in siginificant time savings over the
thousands of games typically played during the learning algorithm. In addition,
this is valid because the optimal strategy (weight setting) for the shallow
search is also the optimal strategy for the deep search. This can be seen in
the Results section.
The algorithm uses several parameters set by
the user. These are:
R |
Total number of runs. When completed, the user will be presented with R "optimal" weight settings obtained starting from the various starting points in the search space. |
P |
Total number of passes through the weights. (See algorithm below.) |
N |
Number of games to play at each weight setting. Performance at each weight setting is based on averaging the performance over N games. |
STEPSIZE |
The granularity of the search – how many units to increase/decrease each weight at each step. |
SLACK |
When the current performance is at most SLACK % lower than the best performance so far, continue searching in the same direction. The idea is to "jump off" of local maxima. See Figure 4 below. Additionally, due to the random nature of Tetris, performance that is slightly worse for a given N games may end up being equal or slightly better for another N games. The slack allows us to treat roughly equivalent performance as equivalent. |
Figure 4. Jumping off of a local maximum. The slack is represented
by the yellow areas. Although the two blue points represent worse performance
than the best seen so far, they lie within the SLACK %, allowing the search to
discover points of higher performance. The larger the SLACK %, the less likely
the search will get stuck on a local maximum. However, a larger SLACK % also
corresponds to a longer search time.
PERFORMANCE Performance is represented by a performance
vector: p = (Avg. Lines, Avg. Score, Avg. Tetrads) ordered lexicographically. That is, (a, b,
c) > (d, e, f) if and only if (a > d) or (a = d and b > e) or (a = d
and b = e and c > f). The averages are calculated over N games. Why measure performance this way? Scoring varies among versions of Tetris
games, so it is difficult to know how well one plays by the Tetris score
alone. The count of lines completed, on the other hand, remains consistent.
(It still is not perfect since the speed at which tetrads drop can differ
among versions of Tetris games.) Thus, the number of lines completed is
considered most important, followed by the number of points scored. In the
unlikely event of a tie in the first two numbers, the number of tetrads is
used to determine which performance is better. The idea is that the more tetrads
used, the longer the player remained "alive" and the better the
player performed. People have made a distinction between
"playing for points" and "playing for lines." Since more
points are awarded for completing multiple lines simultaneously, "playing
for points" means taking riskier moves with the goal of getting a high
score. If one is "playing for lines," however, one is perfectly
content completing one line at a time. Although the goal of this program is
to maximize the number of lines, an increase in average points always
accompanied an increase in average lines in the learning process (see the Results section). |
In the algorithm below, the best performance
exhibited up to the current time is stored in p*. Each weight wi
is continually updated, with initial and optimal values stored in temporary
variables w0 and w*, respectively.
Each run ends with an "optimal" p*
and corresponding set of weights. It is up to the user to identify the best p*
overall.
"MODIFIED RRHC" ALGORITHM
For each run (1 to R):
|
1. |
Initialize weights wi, 1 £ i £ 8 to random numbers between 0 and 600. |
|
2. |
Play N games and calculate performance p. Set p* ¬ p. |
|
For each pass (1 to P): |
|||
|
For each weight wi (i ¬ 1 to 8): |
|||
|
1. |
Set w0 ¬ wi and w* ¬ wi. |
||
|
2. |
Increase wi by STEPSIZE. |
||
|
3. |
Play N games and calculate performance p. |
||
|
4. |
If p > p* (lexicographically),
then the current performance is the best so far. Set p* ¬ p and w* ¬ wi. Go to step 2. |
||
|
5. |
If (avg. lines in p) ³ (SLACK % worse than avg. lines in p*), go to step 2. |
||
|
6. |
At this step, avg. lines in p is over SLACK % worse than the best. Set wi ¬ w*. |
||
|
7. |
If wi ¹ w0, then increasing wi improved performance. Go to Step 13. |
||
|
8. |
Decrease wi by STEPSIZE. |
||
|
9. |
Play N games and calculate performance p. |
||
|
10. |
If p > p* (lexicographically),
then the current performance is the best so far. Set p* ¬ p and w* ¬ wi. Go to step 8. |
||
|
11. |
If (avg. lines in p) ³ (SLACK % worse than avg. lines in p*), go to step 8. |
||
|
12. |
At this step, avg. lines in p is over SLACK % worse than the best. Set wi ¬ w*. |
||
|
13. |
Adjustment of wi is now complete. |
||