How to test a chess engine?
A common problem in computer chess is to test a chess engine. We can test for strength or for bugs, but it always is an hard task and you need planning, math and time. Testing for strength You have written a new version of your chess engine and now you want to know if the new version is strongest than the old one: how? Usually we can use collection of positions (test-suites) and test-matches. Test-suites A test-suite is a collection of chess positions with a 'best move' or an 'avoid move' for each position. You can find a lot of test-suite like WAC, bt2630, lapuce... Test-suites aren't a good way to test strength of a chess engine: when you optimize your engine to have a good result with test-suites you are using only a very limited number of chess positions that it isn't a good sample of all chess positions that an engine can see in his life. You can use test-suites like a fast "sanity check": if your chess engine changes its perfomance dramatically against a test-suite maybe there is a bug. Test-matches A test-match is the best way (the only?) to test strength of a chess engine. You can organize a tournament between your new chess engine, your old one and some others chess engines. Now the only problem is to know how many games you need to get accurate results. In computer chess forum you can find post like "After 5 games between program X and Y: X wins for 3-2 so X is stronger than Y". This post is simply wrong and useless because 'test' done in this way has no significance from a statistical point of view! If you repeat the same 'test-match' with same chess engines and same computer you can have very different result every time you make this 'test'. There are a lot of things that can modify a chess match result between two engine. In an old post on CCC, Christophe Theron (Chess Tiger's author) wrote: The first experiments I did were self-test program A against A (that is EXACTLY the same program playing both sides). I use a set of predefined openings, each program playing white, then black in each opening. 10 openings give 20 games for example. What a surprise: self-playing a program against itself in these conditions almost never gives the expected 50% result! And Don Dailey (CilkChess): When I test 2 identical versions I always get exactly a 50-50 result: I designed my program to be completely deterministic (in these kind of games) and this has also proven to be a good debugging aid. In my deterministic mode I clear hash tables before every move, as well as killer and history information and anything that will prevent it from being deterministic. So when I implement many algorithm (like fast check testing again) the program better look at exactly the same number of nodes or something is wrong. A chess match is a ''random event": we can't know early how a chess match end but we are sure that a chess match between A and B can end in 3 ways; A win, B win, draw with probability P1, P2 and P3. Where P1+P2+P3=1 (100%). I don't want to write a statistical essay so i'll try to be simplest as i can. We don't know P1, P2, P3 values (you can imagine that P1=P2=P3=33% but it is a supposition only) so to have accurate results about your engine you need a test-match of infinite games and it is impossible to do! Statistical helps us: if you run a test match of N games and you know that your chess engine have won x% of games you know that it isn't a true score but you can calculate a range (centered in x) in which you are quite sure that the real score is. We call this range 'confidence interval'. We call it in this way because we aren't sure that true score is in the range but we can say that 'with a probability (or confidence) of M% true score is in this range'; where M% is usually 95%, 99%, 99,9%. How to do that: we have a score, we calculate the possible error (standard error) so true score is: score - z * standard error '<' true score '<' score + z * standard error Where z is a number that we choose to have different level of confidence: we have a z to have a 95% of confidence, a z to have a 99% of confidence, a z to have a 99.9% of confidence. You can calculate a confidence interval in a lot of way, here i suggest a way to do that because i believe it is more useful: we will use relative frequency (in this way we haven't to know variance of population: P1, P2, P3) and we will calculate each relative frequency separately: in this way we can use a binomial distribution instead of a trinomial one. The formula is: range = p ± Z *SQRT ((p*q)/(n-1)) p:= relative frequency [for more information W. G. Cochran Sampling Techniques, 3rd ed. John Wiley, New York, 428 pp.] You can notice that: How to use this formula? Example 1. In a test match of 200 games between my new chess engine (A) and my old chess engine (B) i have this result: 70 games won by A (35%), 64 games won by B (32%), 66 draw. So A has won more games than B: is A better than B? Let's calculate a confidence interval for the relative frequency of the games won by A n = 200 Lower Limit = pA - Z *SQRT ((pA*q)/(n-1)) = 0.35 - 1.96 * SQRT ((0.35*0.65)/(200-1)) = 0.35 - 0.07 = 0.28 so the true score of A, with a 95% of confidence, is between 0.28 and 0.42 (28%, 48%) now we must calculate the same for B n = 200 Lower Limit = pB - Z *SQRT ((pB*q)/(n-1)) = 0.32 - 1.96 * SQRT ((0.32*0.68)/(200-1)) = 0.32 - 0.06 = 0.26 If we had smaller range maybe we could say if A was better than B. To have small confidence interval we need more games. Example 2. In a test match of 4500 games between my new chess engine (A) and my old chess engine (B) i have this result: 35% games won by A, 32% games won by B, 33% draw. So A has won more games than B: is A better than B? A: B: So after 4500 games we can say that A is better than B with 95% of confidence You now understand that it is very difficult measure little improvements because we need a lot of games, but with relative frequency you can project a test match ad hoc. You must only choose how your range must be. Example 3. You want to do a test match between your new chess engine and your old chess engine. You want calculate how many games you need to have that the difference between Upper Limit and Lower Limit is 0.04 (4%): in this way if you have a score like 30% for A and 35% for B (or 40% vs 45% and so on...) you can say who is better We put ourselves in the worst case that is when variance (n * p * q) is at its maximum and it is when p = 0.5. We use this formula: n. of games = ((Z^2) * p * q)/(b^2)) [W. G. Cochran Sampling Techniques, 3rd ed. John Wiley, New York] where b = (Upper Limit - Lower Limit) /2 so n= ((1.96^2) * 0.5 *0.5)/(0.020^2) = 2401 games with 95% of confidence 2400 are a lot of games, but here we are in the worst case: after 2400 games we are confident that A is better than B or not. If we know (from a past experiment) how much p can be we can use this information to calculate a smaller n. For example if you know that your engine score is about 30% you need only n= ((1.96^2) * 0.3 *0.7)/(0.020^2) = 2016,84 = 2017 games with 95% of confidence (about 20% less games) Now you know that to measure strength of your chess engine you need a lot of games: how to do a test match and how many time you need? How long does a 2000 games match takes to get accurate results? Don't use tournament time controls: it is a waste of time. Try to use a time control that allow your engine to search at least 3 ply: 1' + 1" or 2' + 2" time control is a good choice. Many of you have mailed me to say thank you about this column. Thank you too! But there's a better way to say "Thanks" by adding a link to this blog in your blog/website
Testing for bugs
I'll write about perft and divide later in another column, here only some suggestion about your engine:
What should have been a simple procedure then becomes a nightmare.
First of all, why doesn't the match end in a 50% result???
Answer: because of timing errors. I use the PC internal timer to measure the thinking time, and the clock precision is 1/18.2 second (roughly). A search beginning right after a clock tick is different from the samesearch started just BEFORE a clock tick, the difference being themeasured time. So a chess engine could sometimes decide to continue asearch, and sometimes decide to stop it, randomly...
Second point: OK, I understand why the result is not exactly 50%.
But shouldn't it be close to 50%? And how close? Answer: make several matches between exactly the same programs, and write down the minimumand maximum score you get. Ok, good idea. Let's try... Well the results range from 39% to 58% when I do 20 games matches. When I try 24 games matches, I get 46%-56% In order to stop the nightmare, I now do 28 games matches (14 openings), and decide the result is significant if it is below or equal to 40%, or above or equal to 60%. The problem is that a change has to be quite good to win a match by 60% or more! If it is a slight improvement, chances are you never notice it by self-play!
q:= 1-p
Z:= 1.96 if we want a 95% of confidence or 2.58 for 99% of confidence
SQRT := square root
n:= number of games played
pA= games won by A/n = 70/200 = 0.35
q = 1-pA = 1-0.35 = 0.65
Upper Limit = pA + Z *SQRT ((pA*q)/(n-1)) = 0.35 + 0.07 = 0.42
pB= games won by B/n = 64/200 = 0.32
q = 1-pB = 1-0.32 = 0.68
Upper Limit = pB + Z *SQRT ((p*q)/(n-1)) = 0.32 + 0.06 = 0.38
so the true score of B, with a 95% of confidence, is between 0.26 and 0.38: we can't say that A is better than B because the worst possible result of A (0.28) is lower than the best possible result of B (0.38).
n = 4500 ; pA= 0.35 ; q = 1-pA = 1-0.35 = 0.65
Lower Limit = pA - Z *SQRT ((pA*q)/(n-1)) = 0.35 - 1.96 * SQRT ((0.35*0.65)/(4500-1)) = 0.35 - 0.014 = 0.336
Upper Limit = pA + Z *SQRT ((pA*q)/(n-1)) = 0.35 + 0.014 = 0.364
n = 4500 ; pB= 0.32 ; q = 1-pB = 1-0.32 = 0.68
Lower Limit = pB - Z *SQRT ((pB*q)/(n-1)) = 0.32 - 1.96 * SQRT ((0.32*0.68)/(4500-1)) = 0.32 - 0.013 = 0.307
Upper Limit = pB + Z *SQRT ((p*q)/(n-1)) = 0.32 + 0.013 = 0.333







0 commenti:
Posta un commento
Per favore non dimenticate di inserire il vostro nome o nickname e un riferimento al vostro blog... non amo i commenti anonimi per cui firmatevi. Per evitare spam i commenti ai post più vecchi di 15 giorni sono moderati: non verranno, dunque, visualizzati immediatamente.