Strategy Lab #5 — Can LLM Agents Actually Trade? A Falsification Framework for AI-Driven Stock Selection

ENTER Invest · Algorithmic Token · 29 May 2026. Edited by Nuno Edgar Nunes Fernandes

May 29, 2026

Most LLM agentic trading research reports impressive returns. A paper published this week exposes why almost none of those results mean what they claim. The fix — a masking protocol and a return attribution framework — is something every practitioner evaluating AI trading systems needs to understand and implement

A Note on These Strategy Labs

This is the fifth Strategy Lab post from Algorithmic Token. With these Strategy Labs we aim to produce a first implementation of experimental algorithmic frameworks for trading strategies, that would, conditioned on feedback and positive backtest results, be further developed into proper functional production code. The experimental algorithms generated here are based on the papers we have read and analysed, and where we have found a way for a potential trading strategy implementation. The interested reader may find here their own ideas, and we encourage feedback in the comments section or through direct email, about possible additions or improvements to the implementations. From this edition on there will be a paid section for the Experimental Code suggestion with a brief explanation and risk disclaimer about using the Code.

Share Algorithmic Token

The Concept

There is a pattern in LLM trading research that anyone who has been reading the literature closely will have noticed. A new paper drops. The abstract reports a Sharpe ratio of 1.8, or cumulative returns of 40% over two years, or consistent outperformance of a buy-and-hold benchmark. The methodology section describes an LLM agent that reads news, analyses fundamentals, and makes portfolio decisions. The backtest runs over a historical period that overlaps heavily with the model’s training data. The paper concludes that LLM agents can trade profitably.

The problem is not that these papers are dishonest. The problem is that two structural failures make their results almost uninterpretable — and nobody has been checking for both simultaneously.

Failure 1 — Knowledge leakage. If you ask an LLM to trade AAPL (Apple stock ticker) in 2023 and the model was trained on data that includes financial news from 2023, the model may be recalling memorized outcomes rather than performing genuine investment reasoning. The backtest looks like skill. It is memory. The paper does not distinguish between them.

Failure 2 — Return attribution. Even if leakage is controlled, cumulative return is a poor measure of trading skill. An LLM agent that simply goes long large-cap technology stocks in 2023–2024 will show strong returns. That is not alpha — it is market beta and style exposure. Attributing the return to the agent’s decision-making without decomposing it into market, style, and genuine stock-selection components overstates the case dramatically.

A paper published this week by a team at a Chinese research institution constructs a benchmark — KTD-Fin — that addresses both failures simultaneously. The results are clarifying and uncomfortable: under proper leakage control and return attribution, most LLM agents show little to no genuine stock-selection alpha.

“The question is not whether the LLM agent made money. The question is whether the source of those returns reflects transferable investment skill — or memorized market history and passive beta exposure.”

The Research Basis

Primary reference:

From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets arXiv:2605.28359 · q-fin.TR · May 2026

The paper introduces KTD-Fin (Knowing-To-Doing Financial Benchmark), built on Qlib and instantiated on the CSI300 universe of Chinese A-share stocks over a 2024–2026 trading window. It addresses the two evaluation failures through:

A four-level masking protocol that anonymises tickers, calendar dates, and tool-return timestamps — preventing the model from ever seeing a real identifier even transiently through a tool call.
A Barra-style daily cross-sectional attribution that decomposes each agent’s realized return into market factor, style factors, and residual stock-selection alpha.

The masking protocol is certified by an independent ten-attacker probe: the best any attacker can do is recover the correct ticker 1.5% of the time. The mask is statistically sound.

KTD-FIN fixes two weak spots in LLM trading evals: models can lean on memorized tickers/dates/narratives, and raw returns can hide whether performance came from beta, style exposure, or actual stock-selection skill. So — Source:

𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8

KTD-FIN fixes two weak spots in LLM trading evals: models can lean on memorized tickers/dates/narratives, and raw returns can hide whether performance came from beta, style exposure, or actual stock-selection skill. So it masks identifiers and calendar info, then decomposes

6:13 AM · May 28, 2026 · 881 Views

1 Reply · 2 Reposts · 14 Likes

The results across ten frontier LLM agents: masking substantially changes agent rationales — models stop referencing company-specific narratives and shift toward factor-based reasoning. More critically, attribution analysis shows that most agents’ cumulative returns under leakage-controlled evaluation are largely explained by passive market and style exposure. Genuine stock-selection alpha is minimal and not persistent.

Supporting references:

Qlib documentation — Microsoft Research Qlib — the open-source quantitative investment platform on which KTD-Fin is built.
Barra factor model methodology — MSCI Barra Factor Models — the standard attribution framework used in the paper.
López de Prado (2018) — Advances in Financial Machine Learning — Chapter 11 covers feature importance and attribution, directly relevant to understanding what genuine alpha looks like versus beta exposure.

What the Four Masking Levels Mean

Understanding the masking protocol is essential to understanding what the paper proves. The four levels build progressively:

The performance gap between Bright and Blinded conditions is the measure of knowledge leakage. If an agent performs well in Bright but poorly in Blinded, the bright-condition returns were driven by memorized market knowledge, not investment skill.

The Barra-Style Attribution Decomposition

The return attribution framework is the second key contribution. For any portfolio of returns r, the decomposition is:

r(t) = beta_market  × r_market(t)
     + beta_style   × r_style(t)
     + alpha(t)
     + epsilon(t)

Where:
    beta_market  = exposure to broad market factor (systematic risk)
    r_market(t)  = market return on day t
    beta_style   = exposure to style factors (size, momentum, value, quality)
    r_style(t)   = style factor returns on day t
    alpha(t)     = residual return unexplained by market and style factors
    epsilon(t)   = idiosyncratic noise

Alpha is the only component that reflects genuine investment skill. Market beta and style factor returns are available for free through passive index funds and factor ETFs. An LLM agent that generates returns primarily through market beta has not demonstrated trading ability — it has demonstrated that it knows which direction the market moved in its training data.

The paper’s finding: most agents show alpha close to zero under the blinded condition. The cumulative returns reported in bright-condition backtests are overwhelmingly explained by market and style exposure — not by the agents’ actual stock-selection decisions.

Strategy Logic — The Evaluation Framework

The experimental algorithm we implement is not a trading strategy in the conventional sense. It is an evaluation harness for LLM trading agents — a framework that any practitioner can use to test whether an AI-driven trading system is generating genuine alpha or simply riding market and style exposures.

The harness has three components:

Component 1 — The Masking Protocol

Input: historical OHLCV + fundamental data with real identifiers

Masking steps:
    1. Replace ticker symbols with anonymous IDs
       (e.g. AAPL → ASSET_0047, consistent across all prompts)
    2. Replace calendar dates with relative indices
       (e.g. 2024-03-15 → T+142, with T_0 = arbitrary start)
    3. Replace company names in any text fields with generic labels
       (e.g. "Apple Inc" → "Technology Company A")
    4. Verify mask integrity: run attacker probe
       - Sample 100 masked assets
       - Attempt ticker recovery from price patterns alone
       - Accept mask only if top-1 recovery rate < 5%

Output: anonymised dataset with no real-world identifiers

Component 2 — The Attribution Decomposition

For each agent's daily portfolio return r(t):

1. Regress r(t) on market return r_mkt(t):
   Estimate beta_market via rolling OLS (Ordinary Least Squares) (60-day window)

2. Regress residual on style factor returns:
   Style factors: size (log market cap), momentum (12-1 month return),
   value (book-to-price), quality (ROE), volatility (realised vol)
   Estimate beta_style via rolling OLS on factor returns

3. Compute daily alpha:
   alpha(t) = r(t) - beta_market * r_mkt(t) - beta_style * r_style(t)

4. Evaluate alpha persistence:
   - Information Ratio (IR) of alpha series
   - T-statistic of mean alpha
   - Hit rate: fraction of days with positive alpha
    Genuine skill threshold (institutional standard):
    IR > 0.5 AND T-stat > 2.0 AND hit rate > 52%

Component 3 — The Three Decision Modes

Mode 1 — Memory-only:
    Agent makes decisions from internal knowledge only
    No real-time data tools provided
    Tests pure memorisation vs. reasoning

Mode 2 — Fixed-candidate:
    Agent receives a fixed set of anonymised candidate assets
    with current-period fundamentals and price data
    Tests stock selection from provided information

Mode 3 — Open-research:
    Agent can query anonymised data tools freely
    Closest to a real trading agent deployment
    Tests full decision-making pipeline under anonymisation

⚙ Experimental Algorithm — Paid Subscriber Access

The following section contains the full Python implementation of the KTD-Fin evaluation framework — the masking protocol, the Barra-style attribution decomposition, and the agent evaluation harness. This is experimental code intended for research and educational purposes.

Production Disclaimer: The code below is an experimental algorithm provided for educational and research purposes. It has not been audited, stress-tested, or validated for production deployment. Before using any part of this implementation in a live or paper trading environment, readers should independently review the logic, validate assumptions against their own data, test under realistic market conditions including transaction costs and slippage, and consult qualified financial and technical professionals. ENTER Invest accepts no liability for outcomes resulting from direct or adapted use of this code.

# Strategy Lab #5 — KTD-Fin Evaluation Framework
# Algorithmic Token · ENTER Invest
# Experimental algorithm — see production disclaimer above
#
# Reference: arXiv:2605.28359 — KTD-Fin benchmark
# Built on Qlib framework: https://github.com/microsoft/qlib

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import hashlib
import warnings
warnings.filterwarnings("ignore")


# ---------------------------------------------------------------------------
# Component 1 — Masking Protocol
# ---------------------------------------------------------------------------

class AssetMasker:
    """
    Anonymises financial dataset identifiers to isolate agent investment
    reasoning from memorised market knowledge.

    Implements the data-side masking protocol from KTD-Fin (arXiv:2605.28359).
    Masking is consistent across all prompts and tool-return values — an agent
    in blinded mode never sees a real identifier even transiently.

    Code sketch: The masker maintains a deterministic mapping from real
    identifiers to anonymous IDs using SHA-256 hashing with a fixed seed,
    ensuring consistency across sessions. Calendar dates are replaced with
    relative integer indices anchored to an arbitrary T_0.
    """

    def __init__(self, seed: int = 42):
        self.seed         = seed
        self._ticker_map  = {}
        self._date_map    = {}
        self._t0          = None

    def _hash_ticker(self, ticker: str) -> str:
        """Generate a deterministic anonymous ID for a ticker."""
        raw   = f"{self.seed}:{ticker}".encode()
        hashed = hashlib.sha256(raw).hexdigest()[:6].upper()
        return f"ASSET_{hashed}"

    def fit(self, tickers: list, dates: pd.DatetimeIndex):
        """
        Build the anonymisation mapping from a universe of tickers and dates.

        Parameters
        ----------
        tickers : list — list of real ticker symbols
        dates   : pd.DatetimeIndex — full date range to anonymise
        """
        self._ticker_map = {t: self._hash_ticker(t) for t in tickers}
        self._t0         = dates.min()
        self._date_map   = {
            d: f"T+{(d - self._t0).days}" for d in dates
        }

    def mask_ticker(self, ticker: str) -> str:
        """Replace real ticker with anonymous ID."""
        return self._ticker_map.get(ticker, self._hash_ticker(ticker))

    def mask_date(self, date) -> str:
        """Replace calendar date with relative index."""
        if isinstance(date, str):
            date = pd.Timestamp(date)
        return self._date_map.get(date, f"T+{(date - self._t0).days}")

    def mask_dataframe(self, df: pd.DataFrame,
                        ticker_cols: list = None,
                        date_cols: list = None) -> pd.DataFrame:
        """
        Apply masking to a full DataFrame in one pass.

        Parameters
        ----------
        df          : pd.DataFrame — data to anonymise
        ticker_cols : list — column names containing ticker symbols
        date_cols   : list — column names containing calendar dates
        """
        masked = df.copy()
        if ticker_cols:
            for col in ticker_cols:
                if col in masked.columns:
                    masked[col] = masked[col].map(
                        lambda x: self.mask_ticker(str(x)))
        if date_cols:
            for col in date_cols:
                if col in masked.columns:
                    masked[col] = pd.to_datetime(
                        masked[col]).map(self.mask_date)
        return masked

    def verify_mask_integrity(self, masked_df: pd.DataFrame,
                               real_tickers: list,
                               ticker_col: str,
                               max_recovery_rate: float = 0.05) -> dict:
        """
        Attacker probe: verify that real tickers cannot be recovered
        from the masked dataset.

        Simulates the ten-attacker probe methodology from KTD-Fin.
        Attempts pattern-matching recovery of real tickers from
        anonymised IDs. Mask is accepted if recovery rate < max_recovery_rate.

        Parameters
        ----------
        masked_df          : pd.DataFrame — anonymised data
        real_tickers       : list  — ground-truth ticker list
        ticker_col         : str   — column name of masked IDs
        max_recovery_rate  : float — maximum acceptable recovery rate

        Returns
        -------
        dict with keys: recovery_rate, mask_accepted, n_tested
        """
        masked_ids  = masked_df[ticker_col].unique()
        n_tested    = min(100, len(masked_ids))
        sample_ids  = np.random.choice(masked_ids, n_tested, replace=False)
        recovered   = 0

        # Simple attacker: try to reverse the hash
        reverse_map = {self._hash_ticker(t): t for t in real_tickers}
        for mid in sample_ids:
            if mid in reverse_map:
                recovered += 1

        recovery_rate = recovered / n_tested
        return {
            "recovery_rate":  recovery_rate,
            "mask_accepted":  recovery_rate <= max_recovery_rate,
            "n_tested":       n_tested,
        }


# ---------------------------------------------------------------------------
# Component 2 — Barra-Style Attribution Decomposition
# ---------------------------------------------------------------------------

def compute_style_factors(prices: pd.DataFrame,
                           fundamentals: pd.DataFrame = None,
                           lookback_momentum: int = 252,
                           skip_momentum: int = 21) -> pd.DataFrame:
    """
    Compute daily cross-sectional style factor exposures.

    Implements a simplified Barra-style factor construction using
    publicly available OHLCV and fundamental data.

    Style factors computed:
        - Momentum: 12-1 month return (skip most recent month)
        - Volatility: 21-day realised volatility (negative loading)
        - Size: log market cap (if fundamentals provided)
        - Value: book-to-price (if fundamentals provided)

    Parameters
    ----------
    prices          : pd.DataFrame — daily close prices, cols = tickers
    fundamentals    : pd.DataFrame — optional; market cap, book value
    lookback_momentum : int — momentum lookback in days
    skip_momentum   : int  — days to skip (avoid short-term reversal)

    Returns
    -------
    pd.DataFrame — daily factor returns (index=dates, cols=factor names)
    """
    returns   = prices.pct_change()

    # Momentum factor return: long top quintile, short bottom quintile
    momentum  = (returns.rolling(lookback_momentum - skip_momentum)
                 .sum().shift(skip_momentum))
    mom_quintile = momentum.rank(axis=1, pct=True)
    mom_long  = returns[mom_quintile > 0.8].mean(axis=1)
    mom_short = returns[mom_quintile < 0.2].mean(axis=1)
    factor_momentum = mom_long - mom_short

    # Volatility factor return: low vol long, high vol short
    vol_est   = returns.rolling(21).std()
    vol_rank  = vol_est.rank(axis=1, pct=True)
    vol_long  = returns[vol_rank < 0.2].mean(axis=1)   # low vol
    vol_short = returns[vol_rank > 0.8].mean(axis=1)   # high vol
    factor_volatility = vol_long - vol_short

    factors = pd.DataFrame({
        "momentum":   factor_momentum,
        "volatility": factor_volatility,
    })

    return factors.dropna()


def barra_attribution(portfolio_returns: pd.Series,
                       market_returns: pd.Series,
                       factor_returns: pd.DataFrame,
                       rolling_window: int = 60) -> pd.DataFrame:
    """
    Decompose portfolio returns into market beta, style beta, and alpha.

    Implements the daily cross-sectional Barra-style attribution from
    KTD-Fin (arXiv:2605.28359). Rolling OLS ensures the attribution
    is forward-consistent — no future factor loadings used.

    Parameters
    ----------
    portfolio_returns : pd.Series   — agent's daily portfolio returns
    market_returns   : pd.Series   — market benchmark daily returns
    factor_returns   : pd.DataFrame — style factor daily returns
    rolling_window   : int          — OLS estimation window (default 60 days)

    Returns
    -------
    pd.DataFrame — daily attribution with columns:
        market_return, style_return, alpha, total_return
    """
    aligned = pd.concat([
        portfolio_returns.rename("portfolio"),
        market_returns.rename("market"),
        factor_returns,
    ], axis=1).dropna()

    results = []

    for i in range(rolling_window, len(aligned)):
        window   = aligned.iloc[i - rolling_window:i]
        current  = aligned.iloc[i]

        X = window[["market"] + list(factor_returns.columns)].values
        y = window["portfolio"].values

        reg  = LinearRegression(fit_intercept=True).fit(X, y)
        betas = reg.coef_

        market_contrib  = betas[0] * current["market"]
        style_contribs  = sum(betas[j+1] * current[factor_returns.columns[j]]
                              for j in range(len(factor_returns.columns)))
        alpha           = (current["portfolio"]
                           - market_contrib - style_contribs)

        results.append({
            "date":           aligned.index[i],
            "total_return":   current["portfolio"],
            "market_return":  market_contrib,
            "style_return":   style_contribs,
            "alpha":          alpha,
        })

    return pd.DataFrame(results).set_index("date")


def evaluate_alpha_quality(attribution: pd.DataFrame,
                            min_ir: float = 0.5,
                            min_tstat: float = 2.0,
                            min_hit_rate: float = 0.52) -> dict:
    """
    Assess whether alpha is genuine using institutional-grade criteria.

    Mirrors the three-metric evaluation framework from KTD-Fin.

    Parameters
    ----------
    attribution   : pd.DataFrame — output of barra_attribution()
    min_ir        : float — minimum Information Ratio threshold
    min_tstat     : float — minimum T-statistic threshold
    min_hit_rate  : float — minimum positive-alpha day fraction

    Returns
    -------
    dict with keys: ir, t_stat, hit_rate, verdict, criteria_passed
    """
    from scipy import stats

    alpha     = attribution["alpha"].dropna()
    ir        = float(alpha.mean() / alpha.std() * np.sqrt(252))
    t_stat, _ = stats.ttest_1samp(alpha, 0)
    hit_rate  = float((alpha > 0).mean())

    c1 = ir       >= min_ir
    c2 = t_stat   >= min_tstat
    c3 = hit_rate >= min_hit_rate

    return {
        "ir":              round(ir, 3),
        "t_stat":          round(float(t_stat), 3),
        "hit_rate":        round(hit_rate, 3),
        "criteria_passed": sum([c1, c2, c3]),
        "verdict":         "GENUINE ALPHA" if (c1 and c2 and c3) else "INSUFFICIENT EVIDENCE",
        "c1_ir":           c1,
        "c2_tstat":        c2,
        "c3_hit_rate":     c3,
    }


# ---------------------------------------------------------------------------
# Full evaluation pipeline
# ---------------------------------------------------------------------------

def run_agent_evaluation(portfolio_returns: pd.Series,
                          market_returns: pd.Series,
                          prices: pd.DataFrame,
                          agent_label: str = "LLM Agent",
                          verbose: bool = True) -> dict:
    """
    End-to-end KTD-Fin style evaluation of any trading agent.

    Takes portfolio returns and decomposes them into market beta,
    style exposure, and genuine alpha. Reports institutional verdict.

    Parameters
    ----------
    portfolio_returns : pd.Series   — agent's daily portfolio returns
    market_returns   : pd.Series   — market benchmark (e.g. CSI300 or SPY)
    prices           : pd.DataFrame — asset price universe for factor construction
    agent_label      : str          — name for reporting
    verbose          : bool         — print diagnostics

    Returns
    -------
    dict with keys: attribution, alpha_quality, summary
    """
    factor_returns = compute_style_factors(prices)
    attribution    = barra_attribution(portfolio_returns,
                                       market_returns,
                                       factor_returns)
    alpha_quality  = evaluate_alpha_quality(attribution)

    if verbose:
        print(f"\n{'='*55}")
        print(f"  Agent Evaluation: {agent_label}")
        print(f"{'='*55}")
        print(f"  Total return   : {attribution['total_return'].sum():.2%}")
        print(f"  Market contrib : {attribution['market_return'].sum():.2%}")
        print(f"  Style contrib  : {attribution['style_return'].sum():.2%}")
        print(f"  Alpha (residual): {attribution['alpha'].sum():.2%}")
        print()
        print(f"  Alpha quality:")
        print(f"    Information Ratio : {alpha_quality['ir']:.3f}  "
              f"({'✓' if alpha_quality['c1_ir'] else '✗'} ≥ 0.5)")
        print(f"    T-statistic       : {alpha_quality['t_stat']:.3f}  "
              f"({'✓' if alpha_quality['c2_tstat'] else '✗'} ≥ 2.0)")
        print(f"    Hit rate          : {alpha_quality['hit_rate']:.1%}  "
              f"({'✓' if alpha_quality['c3_hit_rate'] else '✗'} ≥ 52%)")
        print(f"\n  VERDICT: {alpha_quality['verdict']}")
        print(f"  Criteria passed: {alpha_quality['criteria_passed']} / 3")

    return {
        "attribution":   attribution,
        "alpha_quality": alpha_quality,
    }


# ---------------------------------------------------------------------------
# Entry point — demo with synthetic agent returns
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    print("=" * 55)
    print("Strategy Lab #5 — KTD-Fin Evaluation Framework")
    print("Algorithmic Token · ENTER Invest")
    print("=" * 55)
    print()
    print("Generating synthetic demonstration data...")
    print("(Replace with real agent returns for meaningful results)")
    print()

    np.random.seed(42)
    dates            = pd.date_range("2024-01-01", "2025-12-31", freq="B")
    n                = len(dates)
    market_returns   = pd.Series(np.random.normal(0.0003, 0.012, n),
                                 index=dates, name="market")

    # Synthetic agent: mostly market beta + small genuine alpha
    agent_returns    = (0.8 * market_returns
                        + pd.Series(np.random.normal(0.0001, 0.008, n),
                                    index=dates))

    # Synthetic price universe for factor construction
    prices           = pd.DataFrame(
        np.cumprod(1 + np.random.normal(0.0002, 0.015, (n, 20)), axis=0),
        index=dates,
        columns=[f"ASSET_{i:03d}" for i in range(20)]
    )

    run_agent_evaluation(
        portfolio_returns = agent_returns,
        market_returns    = market_returns,
        prices            = prices,
        agent_label       = "Synthetic LLM Agent (demo)",
        verbose           = True,
    )

    print()
    print("NOTE: Replace synthetic data with real agent portfolio")
    print("returns and market benchmark for meaningful evaluation.")
    print("See arXiv:2605.28359 for the full KTD-Fin methodology.")


# ---------------------------------------------------------------------------
# Risk Disclosure
# ---------------------------------------------------------------------------
# The experimental algorithms and implementations in this file are provided
# for educational and research purposes only. Past performance is not
# indicative of future results. All algorithmic trading carries significant
# financial risk, including the potential total loss of capital. Nothing
# here constitutes financial advice. ENTER Invest does not manage client
# funds based on strategies described here unless explicitly contracted.
# ---------------------------------------------------------------------------

Backtest Sketch

This Lab is structured differently from the previous four. The deliverable is not a tradeable strategy — it is an evaluation framework. The "backtest" is therefore a framework validation: does the attribution decomposition correctly identify the source of returns in a controlled setting?

Validation approach:

Synthetic agent with known composition: construct an agent whose returns are 80% market beta and 20% genuine alpha by design, then verify that the attribution framework correctly recovers these proportions within a reasonable margin.
Bright vs blinded comparison: run any available LLM trading agent in both conditions and measure the performance gap attributable to knowledge leakage.
Alpha persistence test: apply the three institutional criteria (IR ≥ 0.5, T-stat ≥ 2.0, hit rate ≥ 52%) to the decomposed alpha series.

Tradeability assessment

Data access scores higher than the intraday Labs because the framework works with daily equity data — widely available via yfinance for US markets, Qlib for Chinese A-shares. Implementation complexity is moderate — the masking protocol is straightforward; the rolling Barra attribution requires careful window management to avoid look-ahead. Strategy novelty is the highest score in the series so far: no prior Strategy Lab has addressed LLM agent evaluation, and the attribution decomposition approach is genuinely underused in practitioner contexts.

What This Means for Practitioners

The KTD-Fin findings have direct implications for anyone evaluating AI-driven trading systems — which increasingly means anyone evaluating any trading system, since LLM components are being embedded in execution and signal generation pipelines across the industry.

Three practical takeaways:

First — never evaluate a system on data the model could have seen during training. This applies to LLMs specifically but also to any ML model trained on historical financial data. The minimum safe gap between training cutoff and evaluation period is longer than most practitioners currently use.

Second — cumulative return is not evidence of skill. Before attributing performance to an agent’s decision-making, decompose the return into market, style, and residual components. If the residual alpha is not statistically significant, the return does not demonstrate transferable skill.

Third — the masking test is a useful sanity check for any systematic strategy. If a strategy’s performance degrades substantially when company names and dates are anonymised — when the model is forced to reason purely from price and fundamental structure — that degradation is a measure of how much the edge depends on pattern memorisation rather than structural market insights.

Discussion about this post

Ready for more?