Is Your Weatherman Lying To You?

Why calibration creates perverse incentives and how to fix it. In Twitter-speak: why even aligned AGI will lie about the weather.

This post is based on HQYZ 2024 [1] and QZ 2025 [2].

Let's say you've hired a weatherman or trained some ML model to tell you the chance of rain tomorrow. How should you evaluate how well your human/AI weather forecaster is doing? Hopefully we can all agree that any sane evaluation scheme should:

say the forecaster did well if they correctly estimate the chance of rain every single day.
say the forecaster did bad if they keep overestimating the chance of rain by e.g. 20%.
say the forecaster did bad if making decisions based on their forecasts would have caused significant regret.
not force the forecaster to lie (e.g., predict 50% when they know for certain it will rain) because the truth would have somehow resulted in a far worse evaluation.

Yet all of the evaluation measures you may now be thinking of fail at least one of these requirements. I'll refer to #1 as completeness, #2 as soundness, #3 as decision-theoreticness, and #4 as truthfulness [1].

Sad weatherman gif — Just imagine being a weatherman in Springfield [3] who gets their Brier score deducted from their paycheck.

Let's start with proper scoring rules, the most famous of which is the Brier score / squared error. It's not hard to see that Brier score basically punishes forecasters for how random the weather is, violating completeness (#1). In contrast, calibration measures like expected calibration error and smooth calibration error clearly satisfy completeness (#1) and soundness (#2), e.g. if rain is always a coin toss, always predicting a 50% chance of rain gives vanishing calibration error. However, calibration measures are not truthful (#4) and are usually not decision-theoretic (#3).^†

There's been some progress in the past few years towards trying to complete this picture by designing better calibration measures. KPLST 2023 [4] introduced U-Calibration error, which is decision-theoretic but not truthful (#4). In HQYZ 2024 [1], we introduced Subsampled-Smooth Calibration error, which is truthful but not decision-theoretic (#3).

It actually turns out that you can't do better. In recent work, QZ 2025 [2], we show that there is no evaluation measure that can satisfy all four desiderata in the worst-case. One melodramatic implication of this impossibility result is that alignment is impossible when it comes to rewarding an AI for providing good forecasts of the future.

However, this impossibility mainly exists in the worst case: it largely vanishes if we relax the assumption that nature can adversarially set the chance of rain with arbitrary precision. Indeed, we succeeded in designing a calibration measure that is best of all worlds in the smoothed setting, combining lessons from calibration (to satisfy #1 and #2), proper scoring rules (to satisfy #3), and subsampling (to satisfy #4) to provide a reliable metric for evaluating forecasters.

Forecasting Refresher

Let's recall the usual mathematical setup for forecasting. Every night, the forecaster makes a prediction \(p_t \in [0, 1]\) for the chance of rain tomorrow (day \(t\)). Every morning, they see whether it rained, denoted by the event \(x_t \in \{0, 1\}\); we'll also have \(p^*_t \in [0, 1]\) denote the true probability of rain. Using this language, we can write any forecasting evaluation as a mapping from predictions and events to a real number; e.g., expected calibration error is \(\mathsf{ECE}(\mathbf{p}, \mathbf{x}) = \sum_{v \in \mathbb{R}} |\sum_{t=1}^T 1[p_t = v] \cdot (p_t - x_t)|\) and Brier score is \(\mathsf{Brier}(\mathbf{p}, \mathbf{x}) = \sum_{t=1}^T (p_t - x_t)^2\).

We can now verify some basic facts, like Brier score's violation of completeness. Even if the forecaster correctly predicts that rain is a coin toss every day, i.e. \(p_t = p^*_t = 0.5\), the forecaster will pay a linear Brier score of \(\mathsf{Brier}(\mathbf{p}, \mathbf{x}) \in \Omega(T)\). In contrast, the expected calibration error (which is complete) of the predictions is vanishing: \[\mathbb{E}[\mathsf{ECE}(\mathbf{p}, \mathbf{x})] = \mathbb{E}[\mathsf{Binomial}(T, \tfrac 12) - \tfrac 12 T] \in \Theta(\sqrt T).\]

I'll also briefly mention a recent example of a decision-theoretic calibration measure: the U-Calibration error of KPLST 2023 [4]. U-Calibration is defined as the worst possible regret that an agent can incur when following the forecaster's predictions, satisfying Requirement #3 by definition. Formally, for any action set \(A\) and utility function \(u: A \times \{0, 1\} \to [-1, 1]\), the U-Calibration error is defined as \[ \mathsf{UCal}(\mathbf{p}, \mathbf{x}) = \max_u \sum_{t=1}^T u(a_t, x_t) - \max_{a^* \in A} \sum_{t=1}^T u(a^*, x_t), \] where \(a_t = \arg\max_{a \in A} p_t \cdot u(a, 1) + (1 - p_t) \cdot u(a, 0)\).

The question now is: how do you introduce truthfulness into these evaluation measures?

Even the Palantir is a truthful forecaster.

How To Make Forecasters Stop Lying

There's mainly three forms of dishonesty that one should worry about with forecasters.

Patching up the past

A forecaster may lie about the future to cancel out errors they made in the past. This was first noted by [7] and has a very simple example for ECE: suppose the true probabilities of rain are \(p^*_1 = 0.5, p^*_2 = 0, p^*_3 = 1\), and this cycle repeats for \(T\) days. Even if the forecaster exactly predicted these true probabilities, i.e. \(p_t = p^*_t\), they should expect to pay \(\mathsf{ECE}(\mathbf{p}, \mathbf{x}) \in \Omega(\sqrt T)\). Now suppose the forecaster lies that \(p_1 = 0.5, p_2 = 0.5, p_3 = 1\) if it rains on day 1 and \(p_1 = 0.5, p_2 = 0, p_3 = 0.5\) otherwise (and repeats this scheme for all \(T\) days). In this case, they pay \(\mathsf{ECE}(\mathbf{p}, \mathbf{x}) = 0\) as their lies cancel out all errors. Compared to lying, telling the truth increases the forecaster's ECE by a factor of \(\tfrac{\Omega(\sqrt T)}{0} \to \infty\).

In HQYZ 2024 [1], we remedy this non-truthfulness by randomly subsampling the timesteps included in the computation of smooth calibration error. This works because patching up the past is a delicate operation that subsampling disrupts.

It's worth noting that we can also view subsampling as negating non-truthfulness via (weak) conditional guarantees—in a fashion deeply connected with multicalibration [5], e.g. see [6]. Specifically, let's adopt a batch perspective and associate random features to each timestep: subsampling can then be seen as demanding that calibration error is low not just marginally on the entire history, but also on specific feature groups. Such demands effectively force a forecaster to predict closer to the Bayes classifier, which is equivalent to truthful predictions.

Discontinuity

A forecaster may slightly lie about the future (differing from the truth by an \(\epsilon\) amount) when an evaluation measure is discontinuous. This is the reason why in HQYZ 2024 [1] we apply our subsampling method to smooth calibration error rather than ECE: the latter's discontinuity also introduces an additional form of dishonesty.

This source of non-truthfulness is actually the most challenging to overcome if we want to preserve decision-theoretic guarantees. When we look at how downstream agents best-respond to a forecaster's predictions, the mapping is itself discontinuous, meaning that any non-trivial decision-theoretic evaluation measure must also reflect this discontinuity. Fortunately, for reasonable calibration measures, this discontinuity is only a problem in non-smooth worst-case settings, where nature can adversarially adjust the chance of rain above and below a precise threshold.

Hedging

A forecaster may exaggerate the uncertainty of their forecasts to encourage downstream agents to be more risk-averse. This is a form of non-truthfulness that is unique to decision-theoretic evaluation measures, and has a rather intuitive explanation: forecasters—even ones that are definitionally aligned with minimizing the regret of downstream agents—may paternalistically force risk-averse behavior when downstream agents take their forecasts at face value and act on them trying to maximize utility.

For a simple example, suppose that the probability of rain is \(p^*_t = \tfrac 15\) for the first half of the days and \(p^*_t = \tfrac 45\) for the rest. The U-Calibration error of truthfully forecasting these probabilities is \(\Omega(\sqrt T)\). In contrast, if the forecaster lies and predicts \(p_t = \tfrac 25\) for the first half and \(p_t = \tfrac 35\) for the rest, the U-Calibration error is exponentially small: \(\exp(-\Omega(T))\).

Somewhat surprisingly, this form of non-truthfulness can be mitigated by just modifying the U-Calibration error measure to be more two-sided, effectively penalizing poor uncertainty quantification but without impacting other desiderata. We refer to the resulting calibration measure as Step Calibration error.

Putting It All Together

Putting all of these insights together gives a complete, sound, decision-theoretic, and truthful calibration measure: "subsampled" Step Calibration error. It also has the added benefit of being achievable in online settings with a rate of \(O(\sqrt{T})\) (i.e., you can realistically hope to do well with respect to this evaluation measure even in adversarial forecasting problems). The math of proving its truthfulness is somewhat involved, as it requires bounding a random walk whose variance is itself random and potentially poorly behaved; so please see [2] (and [1]) for details!

^† Expected calibration error does provide decision-theoretic guarantees, but cannot be obtained at the usual \(O(\sqrt{T})\) rate that we expect in online learning; rather, all forecasting algorithms must have an ECE of \(\Omega(\sqrt T)\). Smooth calibation measures, like smooth calibration error, do not provide decision-theoretic guarantees but can be obtained at the usual \(O(\sqrt{T})\) rate. See [4].

References

Haghtalab, N., Qiao, M., Yang, K., & Zhao, E. (2024, Feb). Truthfulness of Calibration Measures. Proceedings of the 38th Annual Conference on Neural Information Processing Systems. Link
Qiao, M., & Zhao, E. (2025, Jan). Truthfulness of Decision-Theoretic Calibration Measures. Preprint. Link
Forbes. (2007, July 20). In Pictures: America's Wildest Weather Cities. Retrieved from Link
Kleinberg, R., Paes Leme, R., Schneider, J., & Teng, Y. (2023). U-Calibration: Forecasting for an Unknown Agent. In Conference on Learning Theory (COLT), 5143–5145.
Hébert-Johnson, U., Kim, M., Reingold, O., & Rothblum, G. (2018). Multicalibration: Calibration for the (Computationally-Identifiable) Masses. In International Conference on Machine Learning (ICML), 1939–1948.
Haghtalab, N., Jordan, M., & Zhao, E. (2023). A Unifying Perspective on Multicalibration: Game Dynamics for Multi-Objective Learning. In Advances in Neural Information Processing Systems (NeurIPS), 72464–72506.
Qiao, M., & Zheng, L. (2024). On the Distance from Calibration in Sequential Prediction. In Conference on Learning Theory (COLT), 4307–4357.

Thanks for reading! Anonymous feedback can be left here. Feel free to reach out if you think there's something I should add or clarify.