Learning to Play TETRIS

The Learning Algorithm

The goal of the learning process is to find the set of 8 values (w₁ through w₈ in the evaluation function) that correspond to optimal Tetris play. This amounts to searching the 8-dimensional weight space for the global maximum in performance. Since the path by which the solution is reached is irrelevant, an iterative improvement algorithm seemed to be practical.

The program uses a modified version of the Random-Restart Hill-Climbing (RRHC) algorithm. In RRHC, a series of hill-climbing searches (continually moving in the direction of increasing value) are conducted from randomly generated initial states. The best result obtained from any of the searches is then the best solution found.

There are a few well-known drawbacks to RRHC. One is getting stuck on local maxima, but this is addressed by a "slack" parameter, discussed below. Another is the possibility of oscillating from side to side on a gently sloping peak. However, the granularity (step size) of the search is also specified by a user-controlled parameter to allow for refinements of the search.

Note: While playing games in the learning process, I disabled the Next Tetrad Preview, so the shallow search is always executed. The only reason for doing this was to have a quicker learning process. Since the player performs significantly worse without the preview, the games end much faster. This results in siginificant time savings over the thousands of games typically played during the learning algorithm. In addition, this is valid because the optimal strategy (weight setting) for the shallow search is also the optimal strategy for the deep search. This can be seen in the Results section.

The algorithm uses several parameters set by the user. These are:

R	Total number of runs. When completed, the user will be presented with R "optimal" weight settings obtained starting from the various starting points in the search space.
P	Total number of passes through the weights. (See algorithm below.)
N	Number of games to play at each weight setting. Performance at each weight setting is based on averaging the performance over N games.
STEPSIZE	The granularity of the search – how many units to increase/decrease each weight at each step.
SLACK	When the current performance is at most SLACK % lower than the best performance so far, continue searching in the same direction. The idea is to "jump off" of local maxima. See Figure 4 below. Additionally, due to the random nature of Tetris, performance that is slightly worse for a given N games may end up being equal or slightly better for another N games. The slack allows us to treat roughly equivalent performance as equivalent.

Figure 4. Jumping off of a local maximum. The slack is represented by the yellow areas. Although the two blue points represent worse performance than the best seen so far, they lie within the SLACK %, allowing the search to discover points of higher performance. The larger the SLACK %, the less likely the search will get stuck on a local maximum. However, a larger SLACK % also corresponds to a longer search time.

PERFORMANCE

Performance is represented by a performance vector:

p = (Avg. Lines, Avg. Score, Avg. Tetrads)

ordered lexicographically. That is, (a, b, c) > (d, e, f) if and only if (a > d) or (a = d and b > e) or (a = d and b = e and c > f). The averages are calculated over N games.

Why measure performance this way?

Scoring varies among versions of Tetris games, so it is difficult to know how well one plays by the Tetris score alone. The count of lines completed, on the other hand, remains consistent. (It still is not perfect since the speed at which tetrads drop can differ among versions of Tetris games.) Thus, the number of lines completed is considered most important, followed by the number of points scored. In the unlikely event of a tie in the first two numbers, the number of tetrads is used to determine which performance is better. The idea is that the more tetrads used, the longer the player remained "alive" and the better the player performed.

People have made a distinction between "playing for points" and "playing for lines." Since more points are awarded for completing multiple lines simultaneously, "playing for points" means taking riskier moves with the goal of getting a high score. If one is "playing for lines," however, one is perfectly content completing one line at a time. Although the goal of this program is to maximize the number of lines, an increase in average points always accompanied an increase in average lines in the learning process (see the Results section).

In the algorithm below, the best performance exhibited up to the current time is stored in p*. Each weight w_i is continually updated, with initial and optimal values stored in temporary variables w₀ and w^*, respectively.

Each run ends with an "optimal" p^* and corresponding set of weights. It is up to the user to identify the best p^* overall.

"MODIFIED RRHC" ALGORITHM

For each run (1 to R):

	1.	Initialize weights w_i, 1 £ i £ 8 to random numbers between 0 and 600.
	2.	Play N games and calculate performance p. Set p^* ¬ p.

For each pass (1 to P):
		For each weight w_i (i ¬ 1 to 8):
	1.		Set w₀ ¬ w_i and w^* ¬ w_i.
	2.		Increase w_i by STEPSIZE.
	3.		Play N games and calculate performance p.
	4.		If p > p^* (lexicographically), then the current performance is the best so far. Set p^* ¬ p and w^* ¬ w_i. Go to step 2.
	5.		If (avg. lines in p) ³ (SLACK % worse than avg. lines in p^*), go to step 2.
	6.		At this step, avg. lines in p is over SLACK % worse than the best. Set w_i ¬ w^*.
	7.		If w_i ¹ w₀, then increasing w_i improved performance. Go to Step 13.
	8.		Decrease w_i by STEPSIZE.
	9.		Play N games and calculate performance p.
	10.		If p > p^* (lexicographically), then the current performance is the best so far. Set p^* ¬ p and w^* ¬ w_i. Go to step 8.
	11.		If (avg. lines in p) ³ (SLACK % worse than avg. lines in p^*), go to step 8.
	12.		At this step, avg. lines in p is over SLACK % worse than the best. Set w_i ¬ w^*.
	13.		Adjustment of w_i is now complete.