The Forecast Everyone Gets Wrong at the Same Time

Why decorrelation and interpretability matter more than accuracy in energy trading — and what we are building to fix it

Mar 23, 2026

There is a peculiar kind of risk that energy traders rarely talk about openly, yet every seasoned desk professional has lived through it at least once. It is not the risk of being wrong. It is the risk of being wrong in the exact same way as everyone else, at the exact same moment, in the exact same direction.

In equity markets, this is called systemic risk and it is managed obsessively through diversification, hedging, and stress testing. In energy trading — and wind energy trading in particular — the same phenomenon exists, but its source is something almost no one in traditional finance would think to look at: the weather models.

This is the problem we set out to solve.

The infrastructure that every wind energy trader relies on

To understand the opportunity, you first need to understand how wind energy trading actually works in practice.

Wind farm operators and energy traders must submit production bids to electricity markets — day-ahead markets typically close at noon for the following day’s delivery, and intraday markets allow corrections up until an hour or less before actual delivery. The bids are binding. If a trader commits to delivering 40 megawatts between 14:00 and 15:00 tomorrow and the turbines only produce 28 megawatts, the difference has to be covered in the real-time balancing market, at whatever price the grid operator sets — a price that, by design, punishes deviation.

These deviations are called imbalances, and the penalties for them are not trivial. Across the Swiss electricity market alone, the annual pool of imbalance penalties exceeds CHF 11.7 billion per year. Across all of Europe, the number is substantially larger.

To avoid these penalties, traders rely on NWP models — Numerical Weather Prediction systems — which are the physics-based atmospheric simulations run by major meteorological agencies. The European Centre for Medium-Range Weather Forecasts, known as ECMWF and operated from Reading in the UK, produces what is widely considered the gold standard global weather model. The US operates GFS (( Global Forecast System) — operated by NOAA, the US weather agency). Germany’s ICON (Icosahedral Nonhydrostatic model) — operated by DWD, the German weather service. France’s Météo-France operates AROME (Application of Research to Operations at Mesoscale) — very high-resolution short-range model.

These are extraordinary scientific and computational achievements. They run on some of the world’s most powerful supercomputers, assimilate millions of observations from satellites, radiosonde balloons, aircraft, and ground stations every six hours, and produce forecasts that, for most practical purposes, are genuinely impressive.

But they all share a structural weakness. And that weakness is the foundation of what we are building.

The correlation problem nobody talks about

Every major NWP model uses broadly similar approaches to the fundamental problem of atmospheric simulation. They all solve variants of the same fluid dynamics equations. They all rely on similar data assimilation techniques to incorporate observations. They all make similar assumptions about the boundary conditions and physical parameterisations that are computationally necessary to make global forecasting tractable.

This means that when the atmosphere does something genuinely unusual — a rapidly developing low-pressure system, an unexpected wind shear event, a sudden boundary layer collapse — all the major models tend to miss it in similar ways. Their errors are correlated.

For a wind energy trader, this creates a particularly dangerous situation. The standard industry practice is to combine multiple NWP models — typically through a machine learning ensemble that learns optimal blending weights — and use that blend as the forecast. This is a reasonable approach. But because the underlying models are structurally correlated, the ensemble’s failure modes are also correlated. When ECMWF misses a ramp event, GFS and ICON are very likely missing it too. The ensemble that was supposed to reduce risk actually concentrates it.

The practical consequence is that the worst penalty events — the ones that cause the largest imbalances and the highest costs — tend to occur simultaneously across all market participants who are using standard NWP-based forecasting. Everyone bids wrong at the same time. The balancing market becomes extremely stressed precisely when it can least afford to be.

There is a second, subtler consequence. Because all the major forecasting services are drawing on the same NWP substrate, their products are highly correlated with each other. A trader who buys forecast products from multiple vendors may believe they are diversifying their information sources. In reality, they are paying for multiple versions of the same underlying signal.

The black box that nobody trusts

There is a second failure mode in existing ML forecast products that is, in some ways, even more damaging than the correlation problem. It is the problem of interpretability — or rather, the lack of it.

Modern machine learning models for wind power forecasting can achieve genuinely impressive accuracy metrics when evaluated on historical data. But accuracy metrics evaluated on historical averages are not the same thing as usefulness in real-time trading decisions. What a trader actually needs is not a model that is accurate on average. What they need is a model that tells them, in real time, when to trust the forecast and when not to.

This is a fundamentally different capability, and most existing products do not provide it.

When a forecast says “expected production is 35 megawatts,” a trader needs to know: is this a confident forecast driven by strong atmospheric signal, or is it an uncertain interpolation in a regime where the model historically performs poorly? Is the model relying primarily on satellite observations that are currently limited by cloud cover? Is there a ramp event risk that the point forecast is hiding? Is this period particularly prone to error because it falls during a transitional weather pattern that the training data underrepresents?

Without this information, even a highly accurate model gets misused. Traders either over-trust it in situations where it should be questioned, or override it in situations where it is actually reliable. The interpretability gap translates directly into P&L losses that never show up in accuracy metrics.

This is not a criticism of the data scientists building these products. It reflects a genuine technical difficulty — explaining the outputs of complex ensemble models in real time, in terms that are actionable for a trader on a dealing desk, is a hard problem. But it is a solvable one.

What decorrelation actually means in practice

The core thesis of the Decorrelated Ensemble Engine — the DEE — is that the right objective function for a wind power forecast is not accuracy alone. It is accuracy subject to a structural independence constraint from the NWP baseline.

In mathematical terms, we are solving:

minimise forecast error, subject to: correlation with ECMWF ≤ target threshold

This is a constrained optimisation problem, and it changes the character of the solution fundamentally. A model trained purely on accuracy will converge toward the NWP signal, because the NWP signal contains most of the useful information about tomorrow’s wind. A model trained on accuracy subject to a decorrelation constraint is forced to find independent sources of information — and to weight them in proportion to how much structural independence they provide, not just how much accuracy they add.

What are those independent information sources? They are the data that the global NWP models either do not assimilate or do not represent at sufficient spatial resolution:

SCADA telemetry from the turbines themselves — real-time measurements of rotor speed, pitch angle, nacelle direction, power output, and wind speed at hub height, updated every ten minutes. This is hyper-local data that no global model can replicate.

Wind mast measurements at the specific site, at multiple heights, giving direct observations of wind shear and turbulence intensity that the NWP boundary layer parameterisations approximate but never capture precisely.

Satellite imagery — specifically the atmospheric motion vectors and cloud-top temperature fields from Meteosat’s SEVIRI instrument, which provide information about the atmospheric state between NWP model update cycles.

The combination of these data sources, processed through a feature engineering pipeline that explicitly constructs NWP residuals as input features, produces models that fail differently from the global NWP ensemble. When ECMWF fails, the DEE may still be right — because it is drawing on information that ECMWF never had.

The three-model architecture and why it matters

The DEE uses three structurally independent base models, each chosen specifically because its failure modes are different from the others:

An XGBoost gradient-boosted tree model learns static, non-temporal mappings from features to residuals. It is fast, robust to outliers, and highly interpretable through SHAP (SHapley Additive exPlanations) attribution. Its weakness is that it has no memory — it treats each forecast period as independent of the previous ones.

An LSTM (Long Short-Term Memory) neural network operates on twenty-four-hour sequences of weather variables, capturing the temporal autocorrelations and diurnal patterns that the tree model cannot see. Wind ramp events, for example, tend to be preceded by characteristic sequences of atmospheric changes that a sequence model can learn to recognise. The LSTM’s failure modes are genuinely different from the XGBoost model’s — the two models are wrong at different times.

A Gaussian Process provides calibrated uncertainty estimates alongside its point forecasts. Unlike the other two models, the GP produces a full predictive distribution rather than a single number. Its uncertainty estimates — the σ values — are used downstream in the conformal prediction layer to construct statistically rigorous P10/P50/P90 forecast intervals, and in the trust scorer to signal when the model is operating in unfamiliar territory.

These three models are combined by the decorrelation stacker, which finds blending weights that minimise forecast error subject to the correlation constraint. The stacker is retrained weekly as new SCADA data arrives, so the weights adapt continuously to the evolving characteristics of each site.

Conformal prediction — uncertainty that actually means something

One of the persistent frustrations of probabilistic weather and energy forecasting is that uncertainty bands are often poorly calibrated. A model that claims to produce 80% confidence intervals should, by definition, contain the actual outcome 80% of the time. In practice, most forecast uncertainty products are significantly overconfident in stable weather regimes and underconfident in turbulent ones.

The DEE addresses this through split conformal prediction, a technique from statistical learning theory that provides a finite-sample marginal coverage guarantee regardless of the model’s quality. The method is straightforward: a held-out calibration set is used to compute nonconformity scores — essentially, how surprised the model should have been by outcomes it has already seen. These scores are then used to set prediction interval widths for future forecasts, ensuring that the stated coverage probability is achieved in expectation.

The practical consequence for a trader is that a P10/P50/P90 forecast from the DEE means what it says. The interval between P10 and P90 contains the actual outcome 80% of the time, not as a theoretical claim but as an empirically validated property of the specific site and model. This is the foundation on which the bid optimisation layer builds.

From forecast to bid — the probabilistic optimisation

The translation from forecast uncertainty to optimal market bid is a classic problem in stochastic optimisation. Under an asymmetric penalty structure — where under-delivering is substantially more costly than over-delivering — the optimal bid is not the median forecast. It is a higher quantile of the forecast distribution, where the exact quantile depends on the penalty rates.

Formally, if the penalty for under-delivery is α_under per MWh and the penalty for over-delivery is α_over per MWh, the optimal bid quantity is:

Q* = F⁻¹( α_under / (α_over + α_under) )

where F⁻¹ is the inverse of the forecast cumulative distribution function — in other words, the quantile corresponding to the ratio of under-delivery cost to total cost.

Under the penalty structure typical of the Swiss electricity market, with α_under approximately eight times α_over, this resolves to the 88th to 90th percentile of the forecast distribution. The bid optimizer automatically computes this, applies ramp rate constraints between adjacent settlement periods, and adjusts the effective quantile downward when the trust score signals that the forecast uncertainty is understated.

The intraday rebalancing agent then monitors open positions continuously as the delivery window approaches, re-evaluating whether the expected saving from submitting a correction order exceeds the transaction cost, and whether the revised forecast has moved enough — relative to its own uncertainty — to justify acting on.

Why interpretability is not just a nice-to-have

The trust scorer is the component that we believe differentiates the DEE most clearly from existing products in the market, and it is worth dwelling on the reasoning behind it.

A forecast that traders don’t trust is worthless, regardless of its accuracy. This sounds obvious, but its implications run deeper than they first appear.

When traders override a black-box forecast and are subsequently wrong, the lesson they draw is often not that the override was mistaken — it is that the model was mistaken. This is a natural human response to uncertainty, but it creates a systematic problem: traders learn to override models in precisely the situations where the models are most valuable, because those are the situations where the model’s view diverges most strongly from the trader’s intuition.

The trust scorer attacks this problem directly. It is a composite signal, built from four components: the rolling correlation between the DEE forecast and the ECMWF baseline (low correlation is good — it means the DEE is genuinely adding independent information); the recent empirical coverage of the conformal prediction intervals (a coverage shortfall signals that the uncertainty is being underestimated); the freshness of the SCADA and satellite data inputs (stale inputs degrade confidence); and the stability of the SHAP feature attributions (when the model’s reasoning is changing rapidly, trust should be lower).

The trust score is not a recommendation to trade. It is information. It tells a trader: here is what the model knows, here is how confident it is, here is why, and here is what could change that confidence. That is the conversation that builds trust over time — not a black box that claims to be accurate, but a transparent collaborator that explains its reasoning.

The generalised portfolio — beyond wind energy

The architecture we have described was designed initially with wind energy trading as the primary application. But the underlying intellectual framework — decorrelated signals, interpretable uncertainty, probabilistic execution — applies much more broadly.

Any market where physical-world fundamentals drive price dynamics, and where those fundamentals are partially observable through satellite and sensor data before they appear in prices, is a candidate. Natural gas markets. Agricultural commodity markets. Freight rates. Solar energy. Hydropower reservoir levels. The common thread is that there exists information — real-time, high-resolution, locally specific — that global models and consensus forecasts do not fully incorporate.

For hedge fund desks running systematic commodity strategies, this translates into a signal generation framework based on alternative data sources with genuine structural independence from the standard factor models. For energy trading desks at utilities and integrated energy companies, it translates into an operational risk management capability — knowing, before the settlement window, which periods carry elevated imbalance risk and which do not.

The generalisation from wind energy to a broader alternative data alpha framework is the second layer of what we are building, and it is where the long-term commercial opportunity is largest.

What we are looking for in partners

This project is at an early but substantive stage. The core architecture is designed and documented. The ingestion pipeline for the four major NWP models is built. The feature engineering layer, the three base models, the decorrelation stacker, the conformal interval estimator, the trust scorer, and the probabilistic bid optimizer are all implemented and tested.

What we need now are partners who can contribute the things that code cannot provide.

We need site-specific SCADA data — even a single wind farm, with twelve to twenty-four months of historical ten-minute production and meteorological records, is sufficient to train and evaluate the first version of the DEE at a real site.

We need access to historical intraday market data — bid submissions, imbalance settlements, and balancing mechanism prices — so that the bid optimizer and rebalancing agent can be backtested against actual market conditions rather than synthetic scenarios.

We need domain expertise from traders and risk managers who understand the real operational constraints of energy trading desks — the gate closure timings, the practical limits on bid revisions, the relationship between the balancing mechanism and the day-ahead market in each specific jurisdiction.

And we need honest feedback on what we have built so far — what is genuinely differentiated and what is still too academic to be operationally useful.

If you are involved in wind energy trading, work at an energy desk at a utility or independent power producer, manage a commodity-focused hedge fund, or provide forecasting services to energy market participants, we would like to talk.

A closing thought on accuracy versus edge

There is a temptation, in any project that uses machine learning, to measure progress primarily in accuracy metrics. Lower RMSE. Higher R². Better mean absolute error. These metrics are important — a forecast that is systematically wrong is worthless regardless of its other properties — but they are not the whole story, and in trading applications they can be actively misleading.

The insight that prompted this project is that in a market where all participants are drawing on the same underlying NWP signal, marginal accuracy improvements on that signal are worth less than structural independence from it. A forecast that is slightly less accurate on average, but fails in different ways from the consensus, is more valuable than a more accurate forecast that fails at the same time as everyone else.

This is a fundamentally different way of thinking about forecasting quality. It shifts the objective from “how close are we to the truth on average” to “how different are we from the crowd when the crowd is wrong.” That difference in framing — from accuracy to decorrelation, from average performance to tail independence — is what drives every design decision in the architecture we have described.

We believe it is the right framing for the problem. We are building the tools to test that belief against real markets. And we are looking for partners who share the intuition.

The Algo-Energy-Trading project is open source. The repository, technical documentation, and full architecture specification are available at github.com/nunoedgar-invest/Algo-Energy-Trading. Enquiries from potential data and industry partners are welcome

Discussion about this post

Ready for more?