mercoledì 22 agosto 2007

How to test a chess engine?

A common problem in computer chess is to test a chess engine. We can test for strength or for bugs, but it always is an hard task and you need planning, math and time.

Testing for bugs

I'll write about perft and divide later in another column, here only some suggestion about your engine:

  1. Use lots of assertions; use default in your 'switch-case'; use maximum level of warning in your compiler and try to verify every alert message : Fruit is a very good example of this style of programming.

  2. "Premature optimization is the root of all evil": first write a chess engine that works than you can write a chess engine that is fast!

  3. Write routines twice. First a slow simple one and later a fastest-optimized one: you can test the second one with the first (very useful for moves generator and make move).

  4. Write a perft function.

  5. Write a 'flip' function: a function that switches the board around so that white is black and black is white. Does your program score the position the same? Believe me it is very useful to find bug in evaluation.

  6. Write a chess engine that can play not only under time constrictions, but even for fixed depth and for fixed node: they are very useful for debugging and proving the value of certain algorithms.

  7. Put in lots of switches. Whenever you detect a problem it is often useful to turn off hash tables, selectivity, quies tricks, etc and find out who the culprit is. This is quite useful sometimes.

  8. Use a versioning system for your source code.

Testing for strength

You have written a new version of your chess engine and now you want to know if the new version is strongest than the old one: how? Usually we can use collection of positions (test-suites) and test-matches.

Test-suites

A test-suite is a collection of chess positions with a 'best move' or an 'avoid move' for each position. You can find a lot of test-suite like WAC, bt2630, lapuce... Test-suites aren't a good way to test strength of a chess engine: when you optimize your engine to have a good result with test-suites you are using only a very limited number of chess positions that it isn't a good sample of all chess positions that an engine can see in his life.

You can use test-suites like a fast "sanity check": if your chess engine changes its perfomance dramatically against a test-suite maybe there is a bug.

Test-matches

A test-match is the best way (the only?) to test strength of a chess engine. You can organize a tournament between your new chess engine, your old one and some others chess engines. Now the only problem is to know how many games you need to get accurate results.

In computer chess forum you can find post like "After 5 games between program X and Y: X wins for 3-2 so X is stronger than Y". This post is simply wrong and useless because 'test' done in this way has no significance from a statistical point of view! If you repeat the same 'test-match' with same chess engines and same computer you can have very different result every time you make this 'test'. There are a lot of things that can modify a chess match result between two engine. In an old post on CCC, Christophe Theron (Chess Tiger's author) wrote:

The first experiments I did were self-test program A against A (that is EXACTLY the same program playing both sides). I use a set of predefined openings, each program playing white, then black in each opening. 10 openings give 20 games for example. What a surprise: self-playing a program against itself in these conditions almost never gives the expected 50% result!
What should have been a simple procedure then becomes a nightmare.
First of all, why doesn't the match end in a 50% result???
Answer: because of timing errors. I use the PC internal timer to measure the thinking time, and the clock precision is 1/18.2 second (roughly). A search beginning right after a clock tick is different from the samesearch started just BEFORE a clock tick, the difference being themeasured time. So a chess engine could sometimes decide to continue asearch, and sometimes decide to stop it, randomly...
Second point: OK, I understand why the result is not exactly 50%.
But shouldn't it be close to 50%? And how close? Answer: make several matches between exactly the same programs, and write down the minimumand maximum score you get. Ok, good idea. Let's try... Well the results range from 39% to 58% when I do 20 games matches. When I try 24 games matches, I get 46%-56% In order to stop the nightmare, I now do 28 games matches (14 openings), and decide the result is significant if it is below or equal to 40%, or above or equal to 60%. The problem is that a change has to be quite good to win a match by 60% or more! If it is a slight improvement, chances are you never notice it by self-play!

And Don Dailey (CilkChess):

When I test 2 identical versions I always get exactly a 50-50 result: I designed my program to be completely deterministic (in these kind of games) and this has also proven to be a good debugging aid. In my deterministic mode I clear hash tables before every move, as well as killer and history information and anything that will prevent it from being deterministic. So when I implement many algorithm (like fast check testing again) the program better look at exactly the same number of nodes or something is wrong.

A chess match is a ''random event": we can't know early how a chess match end but we are sure that a chess match between A and B can end in 3 ways; A win, B win, draw with probability P1, P2 and P3. Where P1+P2+P3=1 (100%). I don't want to write a statistical essay so i'll try to be simplest as i can. We don't know P1, P2, P3 values (you can imagine that P1=P2=P3=33% but it is a supposition only) so to have accurate results about your engine you need a test-match of infinite games and it is impossible to do! Statistical helps us: if you run a test match of N games and you know that your chess engine have won x% of games you know that it isn't a true score but you can calculate a range (centered in x) in which you are quite sure that the real score is. We call this range 'confidence interval'. We call it in this way because we aren't sure that true score is in the range but we can say that 'with a probability (or confidence) of M% true score is in this range'; where M% is usually 95%, 99%, 99,9%. How to do that: we have a score, we calculate the possible error (standard error) so true score is:

score - z * standard error '<' true score '<' score + z * standard error

Where z is a number that we choose to have different level of confidence: we have a z to have a 95% of confidence, a z to have a 99% of confidence, a z to have a 99.9% of confidence.

You can calculate a confidence interval in a lot of way, here i suggest a way to do that because i believe it is more useful: we will use relative frequency (in this way we haven't to know variance of population: P1, P2, P3) and we will calculate each relative frequency separately: in this way we can use a binomial distribution instead of a trinomial one.

The formula is:

range = p ± Z *SQRT ((p*q)/(n-1))

p:= relative frequency
q:= 1-p
Z:= 1.96 if we want a 95% of confidence or 2.58 for 99% of confidence
SQRT := square root
n:= number of games played

[for more information W. G. Cochran Sampling Techniques, 3rd ed. John Wiley, New York, 428 pp.]

You can notice that:

  • The standard error ( SQRT ((p*q)/(n-1)) ) becomes smaller when n becomes bigger that is when you play more game. With the same level of confidence you have a smaller range if your engines plays more games.
  • With same standard error if you use a bigger confidence (99% instead of 95%) you have a bigger range.

How to use this formula?

Example 1. In a test match of 200 games between my new chess engine (A) and my old chess engine (B) i have this result: 70 games won by A (35%), 64 games won by B (32%), 66 draw. So A has won more games than B: is A better than B?

Let's calculate a confidence interval for the relative frequency of the games won by A

n = 200
pA= games won by A/n = 70/200 = 0.35
q = 1-pA = 1-0.35 = 0.65

Lower Limit = pA - Z *SQRT ((pA*q)/(n-1)) = 0.35 - 1.96 * SQRT ((0.35*0.65)/(200-1)) = 0.35 - 0.07 = 0.28
Upper Limit = pA + Z *SQRT ((pA*q)/(n-1)) = 0.35 + 0.07 = 0.42

so the true score of A, with a 95% of confidence, is between 0.28 and 0.42 (28%, 48%)

now we must calculate the same for B

n = 200
pB= games won by B/n = 64/200 = 0.32
q = 1-pB = 1-0.32 = 0.68

Lower Limit = pB - Z *SQRT ((pB*q)/(n-1)) = 0.32 - 1.96 * SQRT ((0.32*0.68)/(200-1)) = 0.32 - 0.06 = 0.26
Upper Limit = pB + Z *SQRT ((p*q)/(n-1)) = 0.32 + 0.06 = 0.38

so the true score of B, with a 95% of confidence, is between 0.26 and 0.38: we can't say that A is better than B because the worst possible result of A (0.28) is lower than the best possible result of B (0.38).

If we had smaller range maybe we could say if A was better than B. To have small confidence interval we need more games.

Example 2. In a test match of 4500 games between my new chess engine (A) and my old chess engine (B) i have this result: 35% games won by A, 32% games won by B, 33% draw. So A has won more games than B: is A better than B?

A:
n = 4500 ; pA= 0.35 ; q = 1-pA = 1-0.35 = 0.65
Lower Limit = pA - Z *SQRT ((pA*q)/(n-1)) = 0.35 - 1.96 * SQRT ((0.35*0.65)/(4500-1)) = 0.35 - 0.014 = 0.336
Upper Limit = pA + Z *SQRT ((pA*q)/(n-1)) = 0.35 + 0.014 = 0.364

B:
n = 4500 ; pB= 0.32 ; q = 1-pB = 1-0.32 = 0.68
Lower Limit = pB - Z *SQRT ((pB*q)/(n-1)) = 0.32 - 1.96 * SQRT ((0.32*0.68)/(4500-1)) = 0.32 - 0.013 = 0.307
Upper Limit = pB + Z *SQRT ((p*q)/(n-1)) = 0.32 + 0.013 = 0.333

So after 4500 games we can say that A is better than B with 95% of confidence

You now understand that it is very difficult measure little improvements because we need a lot of games, but with relative frequency you can project a test match ad hoc. You must only choose how your range must be.

Example 3. You want to do a test match between your new chess engine and your old chess engine. You want calculate how many games you need to have that the difference between Upper Limit and Lower Limit is 0.04 (4%): in this way if you have a score like 30% for A and 35% for B (or 40% vs 45% and so on...) you can say who is better

We put ourselves in the worst case that is when variance (n * p * q) is at its maximum and it is when p = 0.5. We use this formula:

n. of games = ((Z^2) * p * q)/(b^2))

[W. G. Cochran Sampling Techniques, 3rd ed. John Wiley, New York]

where b = (Upper Limit - Lower Limit) /2

so n= ((1.96^2) * 0.5 *0.5)/(0.020^2) = 2401 games with 95% of confidence

2400 are a lot of games, but here we are in the worst case: after 2400 games we are confident that A is better than B or not. If we know (from a past experiment) how much p can be we can use this information to calculate a smaller n. For example if you know that your engine score is about 30% you need only

n= ((1.96^2) * 0.3 *0.7)/(0.020^2) = 2016,84 = 2017 games with 95% of confidence (about 20% less games)

Now you know that to measure strength of your chess engine you need a lot of games: how to do a test match and how many time you need?

  1. Use a dedicated test machine (an old computer is good) instead of your main PC.
  2. Use one of wb tournament manager like PSWBTM.
  3. Disable book.
  4. Use Nunn positions, Silver Positions or Daley positions as starting point.
  5. For every positions play games like White and like Black.
  6. Disable ponder.
  7. Disable learning.
  8. Play a lot of games (thousands, not hundreds)!!!

How long does a 2000 games match takes to get accurate results?

Don't use tournament time controls: it is a waste of time. Try to use a time control that allow your engine to search at least 3 ply: 1' + 1" or 2' + 2" time control is a good choice.

Many of you have mailed me to say thank you about this column. Thank you too! But there's a better way to say "Thanks" by adding a link to this blog in your blog/website

0 commenti:

Posta un commento

Per favore non dimenticate di inserire il vostro nome o nickname e un riferimento al vostro blog... non amo i commenti anonimi per cui firmatevi. Per evitare spam i commenti ai post più vecchi di 15 giorni sono moderati: non verranno, dunque, visualizzati immediatamente.