Squared probability errors
Suppose 3 forecasters A, B, & C make 10 precipitation probability forecasts each. A forecasts
0% each time, B forecasts 50% each time, and C 100% each time. Suppose measurable precipitation
occurs 5 of the 10 times. Clearly forecaster B was best, because none could discern any
difference among the 10 situations and at least B was forecasting 50% when 50% verified.
Yet if we use linear errors, they all get the same score:
A: 5|0-1| + 5|0-0| = 5
B: 5|.5-1| + 5|.5-0| = 5
C: 5|1-1| + 5|1-0| = 5
for which |a| is the absolute value of a, and the first terms are verifications of precipitation
and the second terms those of none. If we use squared errors however:
A: 5(0-1)2 + 5(0-0)2 = 5
B: 5(.5-1)2 + 5(.5-0)2 = 2.5
C: 5(1-1)2 + 5(1-0)2 = 5
B's score is much better than A's or C's. The same would be true for A & B if precipitation
occurred 3 times of 10, and B forecast 30% each time. Linear errors:
A: 3|0-1| + 7|0-0| = 3
B: 3|.3-1| + 7|.3-0| = 3
C: 3|1-1| + 7|1-0| = 7
Squared errors:
A: 3(0-1)2 + 7(0-0)2 = 3
B: 3(.3-1)2 + 7(.3-0)2 = 2.1
C: 3(1-1)2 + 7(1-0)2 = 7
So if linear errors were used and the forecaster saying all 0% or 100% is on the right side of
50%, their forecasts appear just as accurate as the one saying the correct probability - which
reduces this to a yes/no type of forecast (thereby defeating the purpose of using probabilities).
Now suppose that in the first case forecaster C became daring and decided to forecast 90% for
one of the cases precipitation did not occur. Using linear errors:
A: 5|0-1| + 5|0-0| = 5
B: 5|.5-1| + 5|.5-0| = 5
C: 5|1-1| + 4|1-0| + 1|.9-0| = 4.9
C nudges ahead by a tiny margin. Using squared errors though:
A: 5(0-1)2 + 5(0-0)2 = 5
B: 5(.5-1)2 + (.5-0)2 = 2.5
C: 5(1-1)2 + 4(1-0)2 + 1(.9-0)2 = 4.81
C's score is only a little better than A's and not close to B's - which makes sense because B
forecast 50% each time when 50% occurred, despite the fact that C was able to one time discern
a slightly smaller chance than 100% for one event which did not occur among the 10.
So if linear error were used and a forecaster has any skill at all, it would be best to forecast
100% if they think precipitation is more likely than not, and 0% vice versa. Squared error
properly rewards forecast accuracy though - i.e., a proper scoring system.
Some people suggest to use squared errors not only for probabilities, but also variables such
as geopotential height and temperature. Below is an example of why such is not a good idea.
Suppose forecasters A & B are back at it, and they make the following temperature forecasts:
A: 73 70 67 65 90 77 83 69 80 82 Linear error = 54 Squared error = 678
B: 70 71 74 65 87 80 86 68 81 81 Linear error = 43 Squared error = 729
v: 70 67 75 66 88 83 60 65 77 81
for which v is the verification. B's total temperature error was 11° lower than A's (23% more
accurate) and beat A 6 of 10 times with a tie. A only forecast better than B 3 times. Yet if
the errors are squared, forecaster A receives a great reward for the 1 forecast of 10 which
was a little less terrible than B's - those of 83° & 86° when 60° verified. 676 of B's total
comes from that, compared with 53 for the others. That for example may be a strong cold front
which was not expected to pass the until the next day - would seldom happen for day 1, but
quite a bit more often for the medium range such as day 5. With more forecasts, the effects
of such outliers becomes less - though the squared scoring system always fails to some extent
to properly reward the forecasts.
An argument can certainly be made that a large temperature error is much more significant
than a smaller one and should be punished accordingly. For example, replace A & B's forecasts
of 83° & 86° with 67° & 80° respectively, and forecaster B would still be 1° ahead (37-38).
Yet a person can claim that one forecast of B's was really bad and A's not good but okay -
so that should be worth more than the 13 point difference the linear error gives. Well maybe,
though it takes quite a few other better forecasts to overcome that deficit. Any exponent
could be used I suppose - raise the errors to the power of 1.25, etc. - or a different type
of scoring system altogether. Seems best to me to just use the linear error and leave such
issues to some amount of subjective interpretation.
A favorite trick of people verifying ensemble forecasts is to use squared errors for variables
for which linear errors are appropriate, because the ensemble mean avoids large errors. Thus
it can apparently verify better than say an operational model.
Home Page