Autoresearch round 2: smear, backout, and hyperparameter tuning
New architectural features: - Smear: mix previous token embedding into current position via learned gate, providing cheap bigram-like info (works in training + KV cache) - Backout: subtract learned fraction of mid-layer residual before logit projection to remove low-level features Hyperparameter tuning: - Muon momentum warmdown 0.97→0.90 during LR warmdown phase - Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05 - c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4 - Speedrun data:params ratio reduced to 8 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
+11
-4
@@ -367,11 +367,18 @@ def get_lr_multiplier(it):
|
||||
progress = (num_iterations - it) / warmdown_iters
|
||||
return progress * 1.0 + (1 - progress) * args.final_lr_frac
|
||||
|
||||
# Momentum scheduler for Muon optimizer (warms up to 0.97 over the first 400 steps)
|
||||
# Momentum scheduler for Muon optimizer (warms up to 0.97, warms down to 0.90 during LR warmdown)
|
||||
def get_muon_momentum(it):
|
||||
frac = min(it / 400, 1)
|
||||
momentum = (1 - frac) * 0.85 + frac * 0.97
|
||||
return momentum
|
||||
warmdown_iters = round(args.warmdown_ratio * num_iterations)
|
||||
warmdown_start = num_iterations - warmdown_iters
|
||||
if it < 400:
|
||||
frac = it / 400
|
||||
return (1 - frac) * 0.85 + frac * 0.97
|
||||
elif it >= warmdown_start:
|
||||
progress = (it - warmdown_start) / warmdown_iters
|
||||
return 0.97 * (1 - progress) + 0.90 * progress
|
||||
else:
|
||||
return 0.97
|
||||
|
||||
# Weight decay scheduler for Muon optimizer (cosine decay to zero over the course of training)
|
||||
def get_weight_decay(it):
|
||||
|
||||
Reference in New Issue
Block a user