by Y. Qiang Sun, Pedram Hassanzadeh, Jonathan Weare and Dorian S. Abbot
Artificial intelligence (AI) has revolutionized daily to biweekly weather forecasting. Certain deep learning models—like NVIDIA’s FourCastNet [8], Google DeepMind’s GraphCast and GenCast [5, 9], Microsoft’s Aurora [1], and the Artificial Intelligence Forecasting System by the European Centre for Medium-Range Weather Forecasts (ECMWF) [7]—outperform traditional numerical weather prediction models in both accuracy and computational speed. These developments are transforming short- and medium-range weather forecasting—and potentially long-term climate projections—but a crucial question remains: Can these models reliably predict the most extreme and rare weather events, including those on which they were never trained? Such events often cause the most societal impact.
This query comprises the heart of out-of-distribution (OOD) generalization in statistical learning, the limits of data-driven inference in physical systems, and philosophical questions about AI’s aptitude to actually learn and understand. In a recent study, we evaluated this question by investigating an AI model’s capability to forecast so-called gray swan tropical cyclones (TCs): physically possible but exceedingly rare OOD events [12]. Our work involves major practical applications and highlights a key research area in the fast-paced AI revolution of weather and climate modeling.
Experimental Setup: A Controlled OOD Test
We trained five versions of the FourCastNet model—a transformer-based deep neural network that predicts the evolution of the three-dimensional atmospheric state every six hours—on variations of the ECMWF Reanalysis v5 (ERA5) dataset (see Figure 1) [12]. In the version called noTC, we removed all training samples from 1979-2015 that contained major TCs (Category 3 to 5 storms) anywhere in the world [12]. We then tested the model’s prowess on 20 Category 5 TCs from 2018-2023.

Our analysis aimed to diagnose AI models’ capability to learn weather dynamics well enough for extrapolation, which is a key aspect of “learning” [4]. This type of controlled experiment is uncommon, as training such models is extremely computationally intensive. Training resources typically seek to improve operational performance rather than increase our understanding of AI’s functionality. We also included models with randomly removed data (Rand) and targeted removal of Category 3 to 5 TCs from only the North Atlantic (noNA) or Western Pacific (noWP) basins.
Results: Failure to Extrapolate
The Full and Rand models successfully predicted the intensification of Category 5 storms. The noTC model, however, failed entirely. When forecasting Hurricane Lee (an out-of-sample Category 5 TC), all ensemble members in noTC predicted a weakening storm (see Figure 1b). In fact, the lowest predicted pressure never dropped below 980 hectopascals (the Category 2 TC range defined in ERA5) — far above the 960 hectopascal value in ERA5. Note that for TCs, lower pressure means stronger storms, i.e., stronger winds.
Mathematically, this failure reveals a key breakdown; the model does not extrapolate from Category 1 and 2 storms to Category 5. It reverts to its learned distribution to confidently predict mild conditions when a catastrophe is approaching.
Why Physics Matters: The Joint Distributions

To explain the failure, we utilized a mathematical diagnostic: the joint probability density function of mean sea-level pressure and 10-meter wind speeds. The results in Figure 2 demonstrate why extratropical storms—which the model had witnessed during training—could not substitute for TCs.
In the tropics, low pressure is tightly coupled with high winds — a manifestation of convective dynamics and latent heating. In the middle latitudes, seasonal cycle and other large-scale dynamics obscure the pressure-wind relationship. Although the noTC model had seen strong pressure anomalies in the middle latitudes, the dynamics that are associated with those anomalies were substantially different than TC dynamics. As a result, the FourCastNet-noTC could learn from such middle latitude low-pressure events for OOD generalization in the tropics.
Some Hope: Transfer Across Basins
Encouragingly, the noNA and noWP models—which were trained with the removal of TCs from only one tropical basin—still forecasted strong storms across basins, just not as well as the model with access to all of the data. This outcome suggests that FourCastNet captures some dynamical similarity across regions. In other words, it did not overfit on location-specific data in the latitude-longitude coordinate, but instead learned structures that are likely in a low-dimensional representation. Further work has exhibited similar behavior in other AI weather models for extreme precipitation, hence allowing the models to forecast events that are gray swans for a given region but common in other parts of the world [11]. Atmospheric dynamicists and applied mathematicians will naturally wish to understand this transfer, which may motivate further work in mathematical climate science to (i) identify invariant representations that support better generalization across heterogeneous regions or over time, and (ii) embed such representations in the next generation of AI weather and climate models.
Physics-informed AI: A Mathematical Necessity
Our study indicates that the augmentation of training data benefits extrapolation beyond historical data. We also learned that despite its apparent forecasting skill, FourCastNet does not obey known physical laws. Specifically, its wind and pressure fields violate the gradient-wind balance — a foundational dynamical equilibrium in TCs. From a mathematical modeling perspective, such a violation suggests that physics-agnostic learning leads to solutions that are outside of a lower-dimensional manifold on which the physical system concentrates. This realization invites the question: Can we improve extrapolation by enforcing more physical constraints?
Implications for AI Applications and Beyond
While our study focused on one AI model and one type of extreme event, we expect that the key results will more or less apply to other AI models and weather extremes. Our findings pose both a challenge and an opportunity for the applied math and broader AI communities:
- How can we characterize and quantify OOD generalization in high-dimensional, physical, and spatiotemporal systems?
- What is the role of manifold learning, operator theory, or rare-event sampling in the construction of more robust AI weather and climate models?
- Can dynamical systems theory help explain or guide the training process for edge-case extremes?
Moreover, this work motivates increased collaboration between atmospheric scientists, mathematicians, and computer scientists who wish to develop methods that preserve physical structure, improve uncertainty quantification, and extend trustworthiness to rare but societally critical events. One example of such a collaborative approach couples AI weather models and mathematical frameworks from rare event sampling; when used with traditional weather models, the latter have shown promising results in handling gray swan events [2, 10]. Coupling these frameworks with fast AI models can improve both the frameworks and the models themselves [6].
Conclusion: Predicting the Unpredictable
AI weather models are revolutionizing forecasting techniques, but scientists do not yet understand their limits, and how and what they learn. Our study reveals that these models do not extrapolate to gray swan events that they have not seen during training, despite thousands of accurate forecasts in normal conditions. Overcoming this limitation requires the mathematical machinery of rare-event theory, physics-informed learning, or dynamical constraints. Predicting the unpredictable is not just a computational problem; it’s a scientific and mathematical frontier.
Original post on The Siam News
Y. Qiang Sun delivered a minisymposium presentation on this research at the 2025 SIAM Conference on Applications of Dynamical Systems, which took place in Denver, Colo., earlier this year.