The Hidden Alignment Transition in Language Model Scaling
Enter your model's size + any benchmarks to get alignment phase, scaling recommendations, and predictions. Works for any model from 70M to frontier scale. · Amin (2026) · ZEHEN Labs
3.5BCritical Scale Nc
−0.989Pre-transition r
63 + 39Base + Frontier Models
16Families
0.513Frontier Slope
← ZEHEN LabsGitHubPapers"Lying Is Just a Phase" + "Growing Pains of Frontier Models" (NeurIPS 2026)
Analyze Any Model — Phase Classification + Actionable Recommendations
Custom Benchmark Pair — For Nc2/Nc3 Detection
Chart Axes:
Known Models — Click to analyze
Phase Diagram — TruthfulQA vs Parameters
TAX
N < 3.5B — Alignment Tax
γ₁₂ < 0 · r = −0.989 · d_eff ≈ 1.05
Scaling reasoning actively degrades truthfulness. The anti-coupling is built into pre-training, before any RLHF. Every web-trained family shows this. Loss is exact (CV=0.8%) — the transition is invisible in loss.
Curate Data1 unit quality ≈ 10× scalePhi shows: tax is eliminatable
TRANS
~3.5B — Critical Point
γ₁₂ = 0 · χ → ∞ · Arrhenius C spikes 10×
Maximum susceptibility. Gradient dips 37% below trend. Eigenvector rotates sharply. Loss landscape is at its flattest — small interventions have maximum leverage. OLMo sits here with γ₁₂ = 0.000 exactly.
Max alignment ROIOLMo confirms: zero-param
BONUS
N > 3.5B — Alignment Bonus
γ₁₂ > 0 · r = +0.770 cross-family · d_eff → 2
Capabilities cooperate. Scale improves both reasoning and truthfulness. The Arrhenius activation energy C=196 (vs 28 in Tax phase). Dimensional collapse begins: d_eff shrinks from 2→1 as capability manifold condenses.
Scale freelyCapability gains = shared
N_C2
~70B–130B — Axis Rotation
HS/TQA saturate · SWE/GPQA activate · d_eff → 2 again
HellaSwag and TruthfulQA compress to a 4.9-point range. New capability axes (SWE-bench, GPQA Diamond) become discriminating. The r(SWE,GPQA) = +0.85 confirms cooperative phase, but d_eff = 1.75 — new dimension still opening. Theory breaks at det(H)→0 near 130B.
IFEval = next key benchmarkPredict Nc3 ≈ 114B
Frontier Coupling — SWE-bench vs GPQA Diamond (Feb–Mar 2026)
r = +0.85 (n=20, p<0.00001) — cooperative coupling strongly confirmed. Sonnet 4.6 shows h = −13.4 anomaly (tax excursion). Opus 4.6 recovers h = +2.8. GPT-5.4 shows h = −1.6 (mild coding-specialist).
Within-Family Trajectory — Anthropic as Phase Diagnostic
Transition
ΔSWE
ΔGPQA
γ₁₂
h(D)
Interpretation
Sonnet 4.5 → Sonnet 4.6
+2.4
−9.3
−3.88
−13.4
Tax excursion: coding optimized at reasoning cost
Sonnet 4.6 → Opus 4.6
+1.2
+17.2
+14.3
+2.8
Recovery: full cooperative phase restored
Protocol: For any two consecutive releases, compute γ₁₂ = ΔGPQA/ΔSWE. If negative: training recipe entered a tax excursion. Single eval run suffices to detect it before deployment.
Within-Family Trajectory — Google Gemini as Independent Test
Transition
ΔSWE
ΔGPQA
γ₁₂
h(D)
Interpretation
2.5 Pro → 3 Flash
+14.2
+6.4
+0.45
+8.9 → +4.1
Cooperative: both improve
3 Flash → 3 Pro
−1.8
+1.5
−0.83
+4.1 → +7.0
Flash→Pro tradeoff: reasoning prioritized over coding
3 Pro → 3.1 Pro
+4.4
+2.4
+0.55
+7.0 → +6.0
Recovery: both capabilities improve
Second within-family test: Gemini's h-field stays positive throughout (+4 to +9) — a reasoning-specialist training recipe, the frontier analogue of Phi.
The Flash→Pro excursion (γ₁₂ = −0.83) mirrors Anthropic's Sonnet→Opus pattern: tier-specialist training creates a local tax that recovers at the next release.
Two labs, same physics.
OpenAI Trajectory — Now With Tax Excursion (GPT-5.4)
Transition
ΔSWE
ΔGPQA
γ₁₂
h(D)
Interpretation
GPT-4o → GPT-5
+41.7
+32.1
+0.77
+2.5 → +1.7
Strongly cooperative: massive joint gain
GPT-5 → GPT-5.1
+1.4
+2.4
+1.71
+1.7 → +3.0
Cooperative: reasoning outpaces coding
GPT-5.1 → GPT-5.4
+0.9
−3.9
−4.33
+3.0 → −1.6
Tax excursion: coding optimized at reasoning cost
GPT-5.4 → GPT-5.2 Pro
+2.8
+9.0
+3.21
−1.6 → +5.2
Recovery: full cooperative phase restored
Update: GPT-5.4 shows the same tax excursion pattern as Anthropic's Sonnet 4.6 (γ₁₂ = −4.33 vs −3.88). h dips to −1.6 before GPT-5.2 Pro recovers to +5.2.
Three labs, same physics: coding-specialist releases create local tax excursions that recover at the next generation. The universality of this pattern across Anthropic, OpenAI, and Google is now confirmed.
Frontier 3×3 Coupling Matrix — SWE · GPQA · IFEval
det(H_2×2) → 0. Third eigenvalue becomes significant. Pairwise γ₁₂ insufficient: need 3×3 coupling matrix. Future work extends to higher dimensions.
Arrhenius Activation Energy per Phase — New Result
The Arrhenius form log(rate) = A − C/S was fit separately in each coupling phase. The activation constant C is not universal — it spikes 10× at the phase boundary. This is the thermodynamic signature of the saddle point.
Phase
Scale Range
C_Arrhenius
r²
Interpretation
Tax
70M–1B
28
0.32
Shallow activation barrier
Transition
1B–2.8B
316 ★
0.88
10× spike = saddle point of loss landscape
Bonus
2.8B–12B
196
0.94
Deeper cooperative well
log(dS/dlog₁₀N) = A − C/S
Arrhenius structure survives all three phases. The 10× C_Arr spike at Nc directly explains the 37% gradient dip — measurable from gradient norms without any benchmark data.
Benchmark Survival at Each Nc — Eigenvector Analysis
Scale
Active Phase
Discriminating Benchmarks
New Dimension Trigger
70M–3.5B
Tax
HellaSwag, TruthfulQA
—
~3.5B
Nc1
HS⊕TQA coupling flips
MMLU enters below chance at ~3B
3.5B–70B
Bonus
HS, TQA, MMLU all cooperative
—
~70B–130B
Frontier
SWE-bench, GPQA Diamond
IFEval λ₁ loading = 0.64 (dominant)
~114B
Nc3
IFEval + agentic safety
HarmBench / AgentBench (recommended)
Phase-Separated Correlation Matrix — How TQA Restructures at Nc
▸ BELOW Nc (TAX PHASE)
HS
TQA
ARC
MMLU
WG
HS
1.00
−0.53
+0.89
+0.74
+0.67
TQA
−0.53
1.00
−0.65
−0.12
−0.28
ARC
+0.89
−0.65
1.00
+0.82
+0.71
MMLU
+0.74
−0.12
+0.82
1.00
+0.52
WG
+0.67
−0.28
+0.71
+0.52
1.00
4/10 pairs negative • deff = 1.53 • Mean r = +0.07
▸ ABOVE Nc (BONUS PHASE)
HS
TQA
ARC
MMLU
WG
HS
1.00
+0.91
+0.95
+0.90
+0.73
TQA
+0.91
1.00
+0.92
+0.85
+0.69
ARC
+0.95
+0.92
1.00
+0.93
+0.72
MMLU
+0.90
+0.85
+0.93
1.00
+0.62
WG
+0.73
+0.69
+0.72
+0.62
1.00
0/10 pairs negative • deff = 1.20 • Mean r = +0.89
Key finding: The restructuring is specific to truthfulness. All 4 TQA pairs flip sign across Nc (Frobenius |Δr| = 1.56).
Only 0/6 non-TQA pairs flip (|Δr| = 0.33). TQA loads anti-aligned with PC1 below Nc (+0.49 vs −0.49 for HS), aligned above.
Phase-by-Phase Progression — deff Peaks at Transition (Critical Fluctuations)
Tax Phase
1.53
deff • 4 neg pairs • TQA anti-aligned
Transition
1.81
deff PEAK • Max fluctuations at Nc
Bonus Phase
1.20
deff • 0 neg pairs • All cooperative
Frontier
1.15
deff • Deep cooperative
Nc,3 regime
1.33
deff • All positive but rising — new tax opening?
Physics prediction confirmed: deff peaks at 1.81 in the transition zone — maximum effective dimensionality at the critical point.
This is textbook: maximum fluctuations = maximum uncertainty about which phase the system occupies. The system "doesn't know" if it's in the tax or bonus regime, so all dimensions contribute equally.
Above Nc, deff collapses to ~1.2 as the soft mode freezes out. At Nc,3, deff starts rising again (1.33) — the fingerprint of a new transition opening.
Leave-One-Family-Out CV — Sign Robustness Across All 10 Benchmark Pairs
▸ BELOW Nc: 5/5 TQA pairs survive CV
HS–TQA: negative in 5/5 folds
ARC–TQA: negative in 5/5 folds
MMLU–TQA: negative in 5/5 folds
WG–TQA: negative in 4/5 folds
All non-TQA: positive in 5/5 folds
▸ ABOVE Nc: 6/6 pairs positive in all folds
Every single benchmark pair — including all TQA pairs — shows positive correlation in every leave-one-family-out fold. Result: 4/4 TQA pairs flip sign, 0/6 non-TQA pairs flip.
The truthfulness tax is specific and robust.
RG Flow (Preliminary) — Beta Function and Fixed Point
Beta function
β(γ) = −1.35γ² − 0.27γ + 0.73
R² = 0.58 • Quadratic fit to running coupling
Fixed point
γ* = 0.64
Stable • Models converge to moderate cooperation
Universality class
1D random-field XY
νeff = 0.72 • Between mean-field and Ising-3D
Asymptotic cooperation: Unlike QCD's asymptotic freedom (coupling weakens at high energy), AI capability coupling strengthens with scale — then saturates at γ* ≈ 0.64.
Large models converge toward moderate cooperative coupling, not runaway alignment. Full treatment deferred to Future work.
New: activation energy spikes 10× at N_c. Phase boundary = saddle point of loss landscape. Measurable from gradient norms alone.
CM ↔ AI Lever Mapping — Every Physics Lever Has an AI Analogue
The same intervention types that tune superconductors tune AI models. Click any row to expand.
Physics Lever
CM Effect
AI Analogue
AI Effect
Pressure
Compress lattice, shift bands
Model size N
Compress/expand representation
B-field (c-axis)
Orbital limiting, vortices
h-field (recipe emphasis)
Capability emphasis shift
B-field (ab-plane)
Pauli limiting, spin effects
Different benchmark pair
Different coupling direction
Doping
Carrier density, move EF
Data curation
Training distribution change
Temperature
Thermal fluctuations
Learning rate / noise
Training fluctuations
Strain
Lattice distortion
Architecture (width/depth)
Structural change at fixed N
Non-mag impurities
Anderson theorem: SC preserved
Dropout / augmentation
Robustness preserved
Magnetic impurities
Pair-breaking
Data contamination
Coupling destroyed
Twist angle (moiré)
Flat bands at magic angle
MoE routing / PLE
Effective coupling at routing
SOC
Mixes spin channels
Cross-head attention
Mixes capability representations
Key Insight
OPT IS FeSe under pressure. FeSe: s++ → transition → s± → transition → s++ (8 pressures). OPT: 0.514 → 0.876 → CRASH 0.356 → recovery 0.396 (125M→13B→30B→66B). Same trajectory. Same physics. Different substrate.
Polynomial Baseline — CAPE vs Naive Fits on Llama-2 Holdout
CAPE ODE
5.6%
Held-out MAE • 4 parameters
Degree-1 poly
14.6%
2.6× worse • 2 parameters
Degree-2 poly
10.2%
1.8× worse • 3 parameters
Degree-3 poly
10.5%
1.9× worse • 4 parameters
Degree-4 poly
10.4%
1.9× worse • 5 parameters
Key result: The CAPE ODE with 4 parameters beats polynomials with up to 5 parameters by ~2×. Polynomials fail catastrophically at Llama-2 7B and 13B (12-16% error) because they can't represent the phase structure — they fit a smooth curve through a regime change.
The ODE succeeds because it encodes the coupling between benchmarks, not just individual trajectories. A polynomial can't know that TQA anticorrelates with HS below Nc.
Topology — Winding Number W = 0.5 (Fractional) + Kink Soliton
▸ HALF-INTEGER WINDING
Winding #
W = 0.5
Half-integer → Z₂ topology
Geom. phase
−32.6°
−0.181π (not quantized)
The eigenvector e₂ crosses zero once at ~1.2B. One zero crossing = half-winding = Z₂ (Ising) topology, not U(1). The transition is binary: flip or don't flip. Supports domain walls between flipped/unflipped families, not continuous vortices.
In condensed matter: half-quantum vortices in p-wave SC (Sr₂RuO₄), half-vortices in spinor BEC. The CAPE analogue: each training generation crossing Nc undergoes a half-rotation of the coupling eigenvector.
▸ KINK SOLITON (INSTANTON)
Kink profile
γ₁₂(N) = 3.75·tanh((log₁₀N − 9.59)/1.00) − 1.54
RMSE = 0.116 • Width = 1.0 decade • Nc = 3.89B
The minimum-action path through the double-well potential. Deviations from this profile = suboptimal training = wasted compute.
Anti-kink penalty: Sonnet 4.6 (γ = −3.88 at 70B) represents tunneling BACK through the barrier. Action cost ΔS ∝ e7.5 ≈ 1800 — exponentially expensive.
PDW analogy (speculative): Within-family h-field oscillations (coop→tax→coop) resemble pair density wave modulation. Three labs now show this pattern. Deferred to Future work.
Physics ↔ ML Dictionary
Physics Concept
ML/CAPE Meaning
Where Measured
Ginzburg-Landau order parameter
γ₁₂(N): coupling sign and magnitude
§2: running coupling
Phase transition at T_c
Coupling sign flip at N_c ≈ 3.5B
§2: bootstrap CI
TRSB (time-reversal breaking)
Eigenvector locks at θ* = 38.8° (SFEE)
§7: Riccati ODE
Soft mode (collapse of λ₂)
Second eigenvalue λ₂ ~ N^{−0.72}
§7: PCA cascade
External magnetic field h
Training data quality offset h(D)
§5: Phi models
Meissner screening
Alignment interventions more durable above N_c
Future work (predicted)
Flux pinning
Curated data locks cooperative eigenvector
§5: h_c design eq
Ginzburg number Gi
1.35 > 1 → crossover, not sharp transition
§11: limitations
Susceptibility divergence
χ_γ = 1/|γ₁₂| → ∞ at N_c
§7: overconstrained
Heavy-fermion SFEE
Self-reinforcing feedback: r=+0.629, p=0.003
§7: coupling runs
det(H) → 0
Theory breakdown: new dimension must activate
§7: 130B prediction
Topological protection
Winding number in 3D capability space (predicted)
Future work
Boosting Chain L₀ → L₄
L₀
Power-law loss L = E+AN^{−α}
0.3% MAE — baseline, exact
✓
L₁
Independent-parameter gradient
44% MAE — 142× WORSE than L₀. This is the diagnostic: parameters are coupled.
✗
L₂
Collective: ‖∇L‖ ∝ L^3.5
~8% MAE — collective gradient captured
✓
L₃
Running coupling γ₁₂(N)
~6% MAE — alignment regime detected
✓
L₄
External field h(D): Phi holdout
5.6% holdout error — data quality as control parameter
✓
Paper Summary — Key Results
Scaling laws track loss. They say nothing about how capabilities interact. Below N_c ≈ 3.5B, reasoning and truthfulness anticorrelate (r = −0.989, p < 10⁻⁵): scaling one actively degrades the other — an alignment tax built into pre-training, before any RLHF. Above N_c, the coupling reverses sign. Two models with identical loss can be in opposite alignment regimes.
Core Finding
Alignment Tax
Pre-training, before RLHF. Structural, not a tuning artifact. Vanishes at N_c from scaling alone.
Practical Lever
Curate Data
1 unit quality ≈ 10× model size at 1B params. Phi demonstrates at production scale.
Framework
CAPE + GL EFT
Ginzburg-Landau free energy. Same math as heavy-fermion superconductors. Not analogy — same EFT.
Validity
Self-Limiting
Predicts own breakdown at ~130B. Higher-dim extension in Future work.
12 Diagnostics → 2 Numbers
All twelve quantities are independent measurements of a single coupling structure parameterized by A=0.629, B=−5.886 in γ₁₂(N) = A·log₁₀N + B. Twelve constraints on two free parameters.
α = 0.238
Loss scaling exponent (R²=0.9994)
γ₁₂ linear fit
12/12 sign correct
β = 0.40±0.08
Collective gradient scaling
ODE: 3.6%
5 benchmarks from 70M
χ_ND = 0.102
Chinchilla emerges from coupling
h(D) field
Phi: h=+23 above web baseline
W (conserved)
Capability gain redistributed CV=27%
θ* = +0.37
Riccati eigenvector fixed point
λ₂~N^{-0.72}
Soft mode collapse (R²=0.95)
Grad dip −37%
At 1B within Nc region
Curvature peak
TQA peak at 1.4B
r(γ,θ)=+0.47
Geometric phase correlation p=0.044
Citation
@article{amin2026cape,
author = {Amin, Adil},
title = {Lying Is Just a Phase},
note = {The Hidden Alignment Transition in Language Model Scaling},
booktitle = {NeurIPS},
year = {2026},
url = {https://github.com/adilamin89/cape-scaling}
}
@article{amin2026itsnotaphase,
author = {Amin, Adil},
title = {It's Not a Phase: Predicting Frontier Alignment from Capability Coupling},
booktitle = {NeurIPS},
year = {2026},
url = {https://github.com/adilamin89/cape-scaling}
}
h-field Calculator — Enter SWE + GPQA → Get Coupling Diagnostic
Each prediction has a deadline and quantitative pass/fail criterion. Check back as new models release.
#
Prediction
Deadline
Pass Criterion
Fail Criterion
Status
Already Confirmed — Base Scale
OLMo
✓ γ₁₂ = 0.000
Zero-parameter prediction confirmed by AI2
Llama-2 holdout
✓ 5.6% MAE
Cross-family, twice polynomial accuracy
Qwen3
✓ Cooperative
Tax eliminated by data curation at all scales
OPT Internal Coupling — The Nc2 Cascade (125M → 66B)
Cooperation rises, peaks, drops, and begins recovering — the same cycle as Nc1.
Competing Units — Zero through 13B, then explosion
Interpretation: OPT cooperation increases monotonically from 125M to 13B (Nc1 bonus phase), then drops sharply at 30B with 75 competing units appearing where there were none. At 66B, coupling partially recovers — the same rise→peak→drop→recovery pattern that governs Nc1, repeating at Nc2 scale.
ODE Explorer — Per-Family Differential Equation Fitting
Select a model family → fit the coupled ODE → predict benchmark trajectories for the next model size. Add source terms (h-field, width, curation) to see how training choices change the trajectory.
Source Terms (perturbations)
0
1
0
ODE Formulation — The Design Equation
dB/d(log N) = C · B + c0 + h(D) + J(arch)
B = benchmark vector, C = coupling matrix, h(D) = curation field, J(arch) = architecture source
γ12(N) = 0.629 · log10(N) − 5.886
Pythia-calibrated. Coupling crosses zero at Nc ≈ 3.5B for this family. Other families have different Nc (0.12B–7B range, 60× variation). Curated families (Phi, Qwen, Gemma) bypass the tax entirely.
Caveat: This ODE captures the cooperative regime (Nc1) but does not model the second transition (Nc2). Each cascade stage has its own dynamics. (Paper 3A)
Self-Steering — Alignment Correction at the Bottleneck Layer
The coupling structure is exploitable. Adding a truth-direction vector at the quarter-depth probe layer corrects misaligned outputs with zero retraining. Click any prompt below to see real before/after results from Pythia-410M.
Live Results — Click a prompt to see steering effect
Without Steering
With CAPE Steering
Phase:
cos(truth):
Strength:
Changed:
How It Works
1
Truth direction Mean diff of calibration activations (true vs false statements)
2
Probe layer Quarter-depth (layer 6 of 24) where coupling bottleneck lives
3
Steer Add truth_direction × strength to hidden state at probe layer
4
Result Output changes from misaligned to aligned, zero capability loss
Try It Live — Bring Your Own API Key (optional)
Enter your HuggingFace API token to run text generation with and without CAPE-guided system prompts. Your key is stored only in your browser (localStorage) and sent directly to the HuggingFace API — it never touches our servers.