All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately.
Optimizer & schedule changes: - Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28 - Per-group Adam betas and weight decay (instead of shared global betas) - Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps - Warmup: ratio-based -> absolute steps (default 40) - Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05 - Weight decay schedule: linear -> cosine decay - Polar express norm factor 1.02 -> 1.01 Architecture & init changes: - VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive - Add post-QK-norm scaling (q,k *= 1.15) for sharper attention - Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller - RoPE base theta 10K -> 100K - Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile) - Logit softcap 20 -> 15
This commit is contained in:
+1
-1
@@ -113,7 +113,7 @@ def muon_step_fused(
|
||||
|
||||
# Polar express
|
||||
X = g.bfloat16()
|
||||
X = X / (X.norm(dim=(-2, -1), keepdim=True) * 1.02 + 1e-6)
|
||||
X = X / (X.norm(dim=(-2, -1), keepdim=True) * 1.01 + 1e-6)
|
||||
if g.size(-2) > g.size(-1): # Tall matrix
|
||||
for a, b, c in polar_express_coeffs[:ns_steps]:
|
||||
A = X.mT @ X
|
||||
|
||||
Reference in New Issue
Block a user