Autoresearch round 2: smear, backout, and hyperparameter tuning

New architectural features:
- Smear: mix previous token embedding into current position via learned
  gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
  projection to remove low-level features

Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Andrej Karpathy
2026-03-14 17:03:06 +00:00
parent f068604948
commit a825e63f81
4 changed files with 73 additions and 18 deletions
+11 -4
View File
@@ -367,11 +367,18 @@ def get_lr_multiplier(it):
progress = (num_iterations - it) / warmdown_iters
return progress * 1.0 + (1 - progress) * args.final_lr_frac
# Momentum scheduler for Muon optimizer (warms up to 0.97 over the first 400 steps)
# Momentum scheduler for Muon optimizer (warms up to 0.97, warms down to 0.90 during LR warmdown)
def get_muon_momentum(it):
frac = min(it / 400, 1)
momentum = (1 - frac) * 0.85 + frac * 0.97
return momentum
warmdown_iters = round(args.warmdown_ratio * num_iterations)
warmdown_start = num_iterations - warmdown_iters
if it < 400:
frac = it / 400
return (1 - frac) * 0.85 + frac * 0.97
elif it >= warmdown_start:
progress = (it - warmdown_start) / warmdown_iters
return 0.97 * (1 - progress) + 0.90 * progress
else:
return 0.97
# Weight decay scheduler for Muon optimizer (cosine decay to zero over the course of training)
def get_weight_decay(it):