PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management

Background & Motivation

Large language models (LLMs) have shown strong performance across diverse financial tasks, yet portfolio management (PM) remains poorly benchmarked. Existing benchmarks ignore cross-asset correlation structures and fail to evaluate the complete PM decision pipeline, missing the compounding errors that arise as reasoning propagates through sequential allocation stages.

We introduce PortBench with the following key contributions:

Dataset

Market Base Dataset

183 instruments across 6 asset classes (equities, bonds, commodities, crypto, real estate, cash) spanning 2015–2025, with daily prices, returns, macro indicators, and news. Inter-class correlations are low, intra-class correlations are high: true diversification means crossing asset-class boundaries, not just picking more tickers.

Static Layer

6,269 QA Pairs

6,269 correlation-based QA pairs across 7 templates (T1–T7) and 4 difficulty levels, auto-generated from historical data via analytical formulas. Tests correlation reasoning from single-asset prediction to multi-asset constrained allocation to regime-driven rebalancing. Questions and ground truths are derived automatically (no human annotation needed), and new templates can be added on demand.

Dynamic Layer

Five-Stage Allocation Pipeline

Models execute S1 (Market Interpretation) → S2 (Signal Generation) → S3 (Weight Optimization) → S4 (Execution Simulation) → S5 (Risk Monitoring) sequentially at each rebalance date. A stateful sandbox tracks per-stage scores, weights, and NAV through time to reveal how early errors cascade into final outcomes. Evaluated under 3 investor profiles and 3 historical stress regimes.

Metrics

Dual-Layer Correlation Score + CEPS

A dual-layer correlation score that measures whether portfolios truly exploit inter-class hedging and avoid intra-class concentration. CEPS, a cross-stage error propagation score, quantifies how reasoning errors compound across pipeline stages: unlike prior benchmarks, CEPS penalizes error cascades rather than averaging scores.

PortBench framework overview: market data collection, dual-layer evaluation with static QA and dynamic pipeline, three risk profiles and stress regimes

Overview of PortBench. We first collect the Market Base Dataset (183 instruments × 6 asset classes, 2015–2025), then build a dual-layer evaluation framework on top: a static QA layer (6,269 correlation-based pairs) and a dynamic five-stage pipeline, jointly assessed under three risk profiles and three historical stress regimes.

Two-layer evaluation framework: static QA layer with 6,269 pairs above, dynamic five-stage pipeline below, both fed by market snapshots

Evaluation framework. Static QA layer (Top): seven task templates generated automatically from historical data. Dynamic five-stage pipeline (Bottom): executed sequentially at every rebalance date under three investor profiles and three stress regimes.

Key Findings

27/30

LLMs Fail to Beat Equal-Weight

Across 30 model–profile combinations, 27 / 30 underperform a naive 1/N equal-weight baseline on Sharpe ratio. High-quality market interpretation does not guarantee better portfolios. Only Qwen3.6-Plus (Balanced) simultaneously beats Equal-Weight and passes all stress gates.

6/10

Stress Rankings Diverge from Normal-Period

Normal-period CEPS rankings do not predict stress survival. 6 / 10 LLMs fail the Conservative stress gate, all during the 2022 Crypto Collapse. Compliant allocations still produce double-digit drawdowns via cross-asset contagion. Normal-period benchmarks systematically overestimate robustness.

ρ = −0.32

QA Performance ≠ Pipeline Competence

Spearman rank correlation between QA accuracy and pipeline CEPS is ρ = −0.32. GLM-5.1 ranks 7th in QA yet 1st in CEPS; Kimi-K2.6 ranks last in QA yet 3rd in CEPS. QA measures isolated recall; CEPS measures sustained reasoning across five causally dependent stages.

The sections below let you interactively explore each layer of the PortBench. Start with the raw Market Base Dataset, then dive into the two evaluation layers that run on top of it.

Market Base Dataset

The Market Base Dataset covers 183 unique financial instruments spanning 2015–2025 across six heterogeneous asset classes, collected from Yahoo Finance, FRED, and Kaggle. Equities exhibit the broadest coverage (126 tickers), reflecting the diversity of broad-market, sector, and factor ETFs. Commodities (16) and bonds (15) provide representative cross-class hedging opportunities; cryptocurrency (12) captures major and mid-cap digital assets; real estate (10) and cash equivalents (4) round out the defensive allocation universe.

Correlation analysis reveals that inter-class average correlations are generally low while intra-class correlations are strongly positive. True diversification requires crossing asset class boundaries, not merely spreading across tickers within the same class, directly motivating the two-layer correlation scoring design.

Interactive Market Snapshot

183 instruments across 6 asset classes, daily data 2015–2025. Each monthly snapshot includes macro indicators, per-asset price summaries, and cross-class correlations. Select a date to see the full snapshot. To keep the layout compact, the six asset class tables are collapsed by default: click any class header to expand and inspect its representative tickers.

Date:

Loading market snapshots…

Bar chart showing instrument count per asset class: 126 equities, 15 bonds, 16 commodities, 12 crypto, 10 real estate, 4 cash equivalents

Number of unique tickers/series per asset class.

Pairwise Pearson correlation heatmap of daily returns across all 183 instruments, 2015–2022, showing strong intra-class and weak inter-class correlation

Pairwise Pearson correlation matrix (daily returns, 2015–2022).

Bar chart of mean inter-class correlation for each asset class vs. all others, showing low average cross-asset correlation

Mean pairwise correlation: each class vs. all others.

Normalized Price Trajectories by Asset Class (2015–2025)

Base = 100 at first listing date. Each panel shows representative instruments from one asset class.

Equities (126 tickers: broad-market, sector, factor ETFs)

Bonds (15 series: full yield curve, TIPS, credit)

Commodities (16 tickers: energy, metals, agriculture)

Cryptocurrency (12 tickers: major and mid-cap digital assets)

Real Estate (10 series: REITs and housing indices)

Cash Equivalents (4 series: money market, ultra-short duration)

1 / 6

Evaluation Framework

The evaluation framework has two complementary layers. Switch between the tabs below to explore the QA Dataset and the Pipeline Evaluation in detail. Every component is objective, traceable, and scalable: QA pairs are auto-generated without human annotation, and pipeline ground truths are validated against realized future returns withheld from model prompts, eliminating oracle leakage and enabling seamless extension to new periods and assets.

⚙️ Pipeline Evaluation
❓ QA Dataset (T1–T7)

6,269 correlation-aware QA pairs across 7 task templates (T1–T7), spanning complexity levels 1–4 and three market regimes. Each template tests a different aspect of portfolio reasoning. Select a template below to browse sample question–answer pairs.

Loading QA samples…

At each rebalance date a MarketSnapshot is constructed and passed to the LLM for five-stage evaluation (S1–S5). Select a model, market scenario, and date to see the model’s input and stage-by-stage output vs. ground truth.

Model: Scenario: Date:

Loading pipeline traces…

At a Glance

10 LLMs evaluated

5 classical baselines

6 asset classes

183 financial instruments

6,269 QA pairs

10 years of data (2015–2025)

Five-Stage Decision Pipeline

At each rebalance date the LLM executes S1–S5 sequentially. LLMs and classical baselines share the identical backtest environment for controlled comparison.

S1 · Market Interpretation Continuous sentiment views v_i ∈ [−1,+1] per asset. Scored as 1 − MAE(views, ground-truth) / 2.

S2 · Signal Generation Views discretized into buy / hold / sell signals (thresholds ±0.2). Scored as fraction of assets with the correct direction.

S3 · Weight Optimization Portfolio weights scored by the two-layer correlation score: weight accuracy (L₁ to signal-constrained max-Sharpe optimum, α = 0.5) + correlation structure (intra-class concentration penalty + inter-class hedging credit). Ground truth computed from realized future returns with strict PiT safety.

S4 · Execution Simulation Deterministic. Fixed transaction costs (10 bps slippage + 5 bps commission). Scores turnover deviation from the oracle rebalancing rate.

S5 · Risk Monitoring Deterministic. 50% rebalance-trigger accuracy + 50% VaR/drawdown estimation accuracy. Rebalance trigger fires when max single-asset drift exceeds 5%.

CEPS: Cross-Stage Error Propagation Score

Prior benchmarks obscure early reasoning failures by averaging scores. CEPS penalizes error cascades, a strong stage followed by a weak one, more heavily than uniform mediocrity, capturing the operational reality that a perfectly interpreted market view is worthless if signal generation immediately fails.

CEPS = clip( mean(σ₁…σ₅) − λ · Σ max(σₜ − σₜ₊₁, 0), 0, 1 ) Default propagation weight λ = 0.1

Ground truth for all LLM-scored stages is derived from realized future returns withheld from prompts, guaranteeing point-in-time (PiT) safety throughout.

CEPS: Cascade vs. Uniform Mediocrity

Two models with identical average stage scores (0.526) receive different CEPS scores because one cascades errors while the other is uniformly mediocre.

	S1	S2	S3	S4	S5	Avg
Model A (cascade)	0.792	0.506	0.714	0.136	0.480	0.526
Model B (uniform)	0.526	0.526	0.526	0.526	0.526	0.526

	Model A (cascade)	Model B (uniform)
Isolated avg	0.526	0.526
Cascade drops	(0.792-0.506) + (0.714-0.136) = 0.286 + 0.578 = 0.864	0
Penalty (lambda=0.1)	0.1 x 0.864 = 0.086	0
CEPS	0.526 - 0.086 = 0.440	0.526 - 0 = 0.526

          The cascade penalty (lambda=0.1) reduces Model A's CEPS by 0.086, penalizing the sharp S1 to S2 and S3 to S4 drops that indicate brittle error propagation. Model B's uniform mediocrity incurs no penalty.
        

Investor Profiles & Stress Regimes

CEPS is evaluated under three investor risk profiles with escalating risk tolerance, and back-tested across three historical stress regimes to assess robustness when market conditions deteriorate sharply.

Each profile is tested under three historical stress regimes:

2015

China Shock

Aug 2015 – Feb 2016

liquidity-driven

2020

COVID Crash

Feb 2020 – May 2020

pandemic-driven

2022

Crypto Collapse

May 2022 – Dec 2022

monetary-tightening-driven

A model passes the stress gate for a given profile if its maximum drawdown across all three regimes stays within the profile's tolerance. Six of ten models fail the Conservative stress gate, all during the 2022 Crypto Collapse, where small crypto exposures compliant with allocation caps amplify into double-digit drawdowns (compliance trap: every process constraint satisfied, outcome safety violated).

Pipeline Performance: All Three Profiles

🔑 Key Finding

Across 30 model–profile combinations (10 LLMs × 3 profiles), 27 of 30 underperform a naive 1/N equal-weight allocation on Sharpe ratio. This holds despite strong S1–S3 pipeline scores, since high-quality market interpretation and signal generation do not guarantee portfolios that outperform a rule-based baseline. Only Qwen3.6-Plus (Balanced) simultaneously beats Equal-Weight and passes all stress gates.

The table below reports per-stage scores, CEPS, and financial outcomes for all ten LLMs and five classical baselines (normal period, 2024). Bold = column best within each profile. Select a profile to view results.

Model	Pipeline Scores					CEPS	Financial Outcomes				Gate
Model	S1	S2	S3	S4	S5	CEPS	Sharpe	Ret%	MaxDD%	Vol%	Gate
DeepSeek-V4-Pro	.766	.406	.752	.173	.483	.436	0.217	5.49	−3.53	7.61	✗
GLM-5.1	.769	.421	.751	.224	.561	.421	0.764	9.54	−3.14	8.14	✗
DeepSeek-V4-Flash	.764	.390	.766	.219	.386	.402	0.080	4.54	−7.43	8.24	✗
Kimi-K2.6	.791	.438	.758	.177	.319	.396	0.576	9.51	−4.90	9.64	✗
Qwen3.7-Max	.750	.387	.746	.158	.395	.387	0.450	7.36	−3.00	6.86	✓
Qwen3.6-Plus	.815	.466	.752	.128	.339	.386	0.548	9.13	−5.03	9.43	✓
Qwen3.6-35B-A3B	.748	.445	.749	.177	.347	.383	−0.033	3.58	−5.54	11.10	✓
Hunyuan3-Preview	.804	.527	.759	.029	.256	.372	0.621	9.95	−5.45	10.06	✗
Doubao-Seed-2.0-Lite	.768	.370	.752	.060	.339	.330	0.462	7.17	−3.01	8.28	✓
Doubao-Seed-2.0-Pro	.781	.449	.744	.094	.263	.325	0.708	8.85	−3.05	7.60	✗
Equal-Weight (EqW)	N/A	N/A	N/A	N/A	N/A	N/A	0.740	12.13	−5.09	10.25	N/A
60/40	N/A	N/A	N/A	N/A	N/A	N/A	0.651	10.17	−4.27	8.82	N/A
Risk Parity	N/A	N/A	N/A	N/A	N/A	N/A	0.111	4.56	−2.02	3.24	N/A
Cov. Risk Parity	N/A	N/A	N/A	N/A	N/A	N/A	−0.147	3.71	−2.02	2.98	N/A
Min-Variance	N/A	N/A	N/A	N/A	N/A	N/A	−0.601	2.45	−2.02	2.71	N/A

Risk-adjusted return metrics (Sharpe, total return, max drawdown, CEPS) under the Balanced profile.

Portfolio NAV trajectories (2024). Shaded band = range across all LLMs; dashed lines = classical baselines.

Stress Regime Results

🔑 Key Finding

Normal-period CEPS rankings do not predict stress survival. Models with strong normal-period CEPS rankings can collapse under historical stress, and the correlation between normal and stress-period scores is weak. Ranking models by calm-market performance alone is insufficient and potentially misleading for real-world deployment. Benchmarks that evaluate only under i.i.d. market conditions systematically overestimate model robustness.

More troubling still, 6 of 10 LLMs fail the Conservative stress gate, all during the 2022 Crypto Collapse. These models satisfy every allocation constraint (equity/crypto caps, bond/cash floors, drawdown/VaR limits) yet still suffer double-digit drawdowns. Small, compliant crypto exposures amplify into portfolio-level losses because the models fail to anticipate cross-asset contagion. Checking procedural boxes does not guarantee outcome-level safety. Stress-regime evaluation is not optional, as it is the only layer that reveals which models are genuinely safe.

The stress gate is defined as follows: a model passes for a given profile if its maximum drawdown across all three historical stress regimes stays within the profile's tolerance. The images below show aggregate stress performance across all models and profiles.

Grouped bar chart showing maximum drawdown for each LLM across three stress regimes (China Shock, COVID, Crypto Collapse), with Conservative tolerance threshold line

Max drawdown across three stress regimes (worst case, all profiles). Six of ten LLMs fail Conservative.

Normal-period vs. stress-period CEPS (2022, Conservative). High normal scores do not predict stress survival.

Takeaway: Stress-regime evaluation is the only layer that reveals which models are genuinely safe for deployment; normal-period benchmarks alone systematically overestimate robustness.

Static QA Evaluation Results

The rank dissociation above raises a natural question: where do different models excel or struggle within the QA layer itself? The table below breaks down per-template accuracy, revealing sharp divergence between formula-driven tasks (T4, T5) and judgment-driven ones (T1, T2, T6, T7).

Per-template accuracy (full & restricted covariance conditions), formula vs. judgment averages, and accuracy by market regime. Bold = column best. Pink rows = Mean < 0.65.

Model	Per-Template (Full)							Mean	Restricted		Task Type		Market Regime
Model	T1	T2	T3	T4	T5	T6	T7	Mean	T4_r	T5_r	F	J	Bull	Bear	Side.
DeepSeek-V4-Flash	.520	.843	.945	1.00	.932	.652	.843	.819	.975	.860	.966	.715	.827	.823	.812
Qwen3.7-Max	.500	.859	.951	1.00	.954	.724	.742	.819	1.00	.990	.977	.706	.814	.863	.810
DeepSeek-V4-Pro	.520	.837	.963	1.00	.992	.652	.760	.818	1.00	.660	.996	.692	.844	.846	.802
Doubao-Seed-2.0-Lite	.460	.798	.957	.956	.897	.810	.747	.804	.961	.940	.927	.704	.780	.846	.806
Doubao-Seed-2.0-Pro	.440	.847	.963	.991	.912	.824	.530	.787	.979	.923	.952	.660	.764	.806	.792
Qwen3.6-Plus	.440	.858	.968	1.00	.804	.640	.768	.783	1.00	.810	.902	.677	.799	.801	.771
GLM-5.1	.440	.855	.964	1.00	.421	.882	.738	.757	1.00	.531	.711	.729	.778	.765	.746
Qwen3.6-35B-A3B	.460	.808	.961	1.00	.230	.564	.763	.684	1.00	.320	.615	.649	.714	.729	.662
Hunyuan3-Preview	.460	.386	.336	.975	.958	.468	.783	.624	.982	.974	.967	.524	.664	.663	.597
Kimi-K2.6	.420	.422	.493	.956	.280	.684	.320	.511	.978	.710	.618	.462	.556	.531	.487

F = mean(T4,T5) formula-driven; J = mean(T1,T2,T6,T7) judgment-driven. Restricted (T4_r, T5_r) withholds the covariance matrix: 7 of 10 models perform better without it, confirming format matching rather than genuine numerical reasoning.

QA–Pipeline Rank Dissociation

🔑 Key Finding

QA performance does not imply pipeline competence. The Spearman rank correlation between QA accuracy and pipeline CEPS is ρ = −0.32. GLM-5.1 ranks 7th in QA yet 1st in CEPS; Kimi-K2.6 ranks last in QA yet 3rd in CEPS. Doubao-Seed-2.0-Lite ranks 4th in QA but last in CEPS, answering static questions correctly yet failing to translate that knowledge into executable portfolio decisions. QA measures isolated factual recall; CEPS measures sustained reasoning across five causally dependent stages. The dissociation is driven by S4 execution fidelity: models that ace formula-driven tasks (T4/T5) often collapse when translating signals into actual orders.

Model	QA Mean	QA Rank	CEPS_bal	CEPS Rank	ΔRank
DeepSeek-V4-Flash	.819	1	.463	2	−1
Qwen3.7-Max	.819	2	.384	8	−6
DeepSeek-V4-Pro	.818	3	.365	9	−6
Doubao-Seed-2.0-Lite	.804	4	.357	10	−6
Doubao-Seed-2.0-Pro	.787	5	.405	6	−1
Qwen3.6-Plus	.783	6	.426	4	+2
GLM-5.1	.757	7	.470	1	+6
Qwen3.6-35B-A3B	.684	8	.424	5	+3
Hunyuan3-Preview	.624	9	.389	7	+2
Kimi-K2.6	.511	10	.434	3	+7

Quadrant scatter plot of S2 signal accuracy vs. S4 execution fidelity per model, highlighting models with strong signals but poor execution

S2 (signal) vs. S4 (execution). Hunyuan3-Preview leads S2 yet collapses in S4: strong signals, no execution.

Profile Alignment Score (PAS) per model. GLM-5.1 applies a nearly identical allocation regardless of risk profile (σ = 0.014).

Citation

@article{zhao2026portbench,
  title={PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management},
  author={Zhao, Yuxuan and Chen, Sijia and Su, Ningxin},
  journal={arXiv preprint arXiv:2605.27887},
  year={2026}
}

PortBench

A Correlation-Aware, Full-Pipeline Benchmark
for LLM-Driven Portfolio Management

Background & Motivation

Key Findings

Interactive Browser

Market Base Dataset

Interactive Market Snapshot

Normalized Price Trajectories by Asset Class (2015–2025)

Evaluation Framework

Evaluation Details

At a Glance

Five-Stage Decision Pipeline

CEPS: Cross-Stage Error Propagation Score

CEPS: Cascade vs. Uniform Mediocrity

Investor Profiles & Stress Regimes

Results & Insights

Pipeline Performance: All Three Profiles

Stress Regime Results

Static QA Evaluation Results

QA–Pipeline Rank Dissociation

Citation

PortBench

A Correlation-Aware, Full-Pipeline Benchmarkfor LLM-Driven Portfolio Management

Background & Motivation

Key Findings

Interactive Browser

Market Base Dataset

Interactive Market Snapshot

Normalized Price Trajectories by Asset Class (2015–2025)

Evaluation Framework

Evaluation Details

At a Glance

Five-Stage Decision Pipeline

CEPS: Cross-Stage Error Propagation Score

CEPS: Cascade vs. Uniform Mediocrity

Investor Profiles & Stress Regimes

Results & Insights

Pipeline Performance: All Three Profiles

Stress Regime Results

Static QA Evaluation Results

QA–Pipeline Rank Dissociation

Citation

A Correlation-Aware, Full-Pipeline Benchmark
for LLM-Driven Portfolio Management