Squared probability errors
Suppose 3 forecasters A, B, & C make 10 precipitation probability forecasts each. A forecasts 
0% each time, B forecasts 50% each time, and C 100% each time. Suppose measurable precipitation 
occurs 5 of the 10 times. Clearly forecaster B was best, because none could discern any 
difference among the 10 situations and at least B was forecasting 50% when 50% verified.
Yet if we use linear errors, they all get the same score:

A: 5|0-1| + 5|0-0| = 5
B: 5|.5-1| + 5|.5-0| = 5
C: 5|1-1| + 5|1-0| = 5

for which |a| is the absolute value of a, and the first terms are verifications of precipitation 
and the second terms those of none. If we use squared errors however:

A: 5(0-1)2 + 5(0-0)2 = 5
B: 5(.5-1)2 + 5(.5-0)2 = 2.5
C: 5(1-1)2 + 5(1-0)2 = 5

B's score is much better than A's or C's. The same would be true for A & B if precipitation 
occurred 3 times of 10, and B forecast 30% each time. Linear errors:

A: 3|0-1| + 7|0-0| = 3
B: 3|.3-1| + 7|.3-0| = 3
C: 3|1-1| + 7|1-0| = 7

Squared errors:

A: 3(0-1)2 + 7(0-0)2 = 3
B: 3(.3-1)2 + 7(.3-0)2 = 2.1
C: 3(1-1)2 + 7(1-0)2 = 7

So if linear errors were used and the forecaster saying all 0% or 100% is on the right side of 
50%, their forecasts appear just as accurate as the one saying the correct probability - which 
reduces this to a yes/no type of forecast (thereby defeating the purpose of using probabilities).

Now suppose that in the first case forecaster C became daring and decided to forecast 90% for
one of the cases precipitation did not occur. Using linear errors:

A: 5|0-1| + 5|0-0| = 5
B: 5|.5-1| + 5|.5-0| = 5
C: 5|1-1| + 4|1-0| + 1|.9-0| = 4.9

C nudges ahead by a tiny margin. Using squared errors though:

A: 5(0-1)2 + 5(0-0)2 = 5
B: 5(.5-1)2 + (.5-0)2 = 2.5
C: 5(1-1)2 + 4(1-0)2 + 1(.9-0)2 = 4.81

C's score is only a little better than A's and not close to B's - which makes sense because B 
forecast 50% each time when 50% occurred, despite the fact that C was able to one time discern 
a slightly smaller chance than 100% for one event which did not occur among the 10.

So if linear error were used and a forecaster has any skill at all, it would be best to forecast 
100% if they think precipitation is more likely than not, and 0% vice versa. Squared error 
properly rewards forecast accuracy though - i.e., a proper scoring system.

Some people suggest to use squared errors not only for probabilities, but also variables such 
as geopotential height and temperature. Below is an example of why such is not a good idea.

Suppose forecasters A & B are back at it, and they make the following temperature forecasts:

A: 73 70 67 65 90 77 83 69 80 82   Linear error = 54  Squared error = 678
B: 70 71 74 65 87 80 86 68 81 81   Linear error = 43  Squared error = 729
v: 70 67 75 66 88 83 60 65 77 81

for which v is the verification. B's total temperature error was 11° lower than A's (23% more 
accurate) and beat A 6 of 10 times with a tie. A only forecast better than B 3 times. Yet if 
the errors are squared, forecaster A receives a great reward for the 1 forecast of 10 which 
was a little less terrible than B's - those of 83° & 86° when 60° verified. 676 of B's total 
comes from that, compared with 53 for the others. That for example may be a strong cold front 
which was not expected to pass the until the next day - would seldom happen for day 1, but 
quite a bit more often for the medium range such as day 5. With more forecasts, the effects 
of such outliers becomes less - though the squared scoring system always fails to some extent 
to properly reward the forecasts.

An argument can certainly be made that a large temperature error is much more significant 
than a smaller one and should be punished accordingly. For example, replace A & B's forecasts 
of 83° & 86° with 67° & 80° respectively, and forecaster B would still be 1° ahead (37-38). 
Yet a person can claim that one forecast of B's was really bad and A's not good but okay -
so that should be worth more than the 13 point difference the linear error gives. Well maybe, 
though it takes quite a few other better forecasts to overcome that deficit. Any exponent 
could be used I suppose - raise the errors to the power of 1.25, etc. - or a different type 
of scoring system altogether. Seems best to me to just use the linear error and leave such 
issues to some amount of subjective interpretation.

A favorite trick of people verifying ensemble forecasts is to use squared errors for variables 
for which linear errors are appropriate, because the ensemble mean avoids large errors. Thus
it can apparently verify better than say an operational model.


Home Page