Commit Graph

5 Commits

Author SHA1 Message Date
Andrej Karpathy c0dbf1f3ff use COMPUTE_DTYPE-aware cast in Muon polar express step
The bf16 cast is intentional for speed on Hopper+ GPUs, but should be
skipped on other platforms rather than blindly applied. fp16 is unstable
here due to its limited exponent range, and fp32 platforms don't benefit
from the cast. Now: bf16 when COMPUTE_DTYPE is bf16, no cast otherwise.

Inspired by PR #667.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:19:14 +00:00
Andrej Karpathy 6ed7d1d82c All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately.
Optimizer & schedule changes:
- Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28
- Per-group Adam betas and weight decay (instead of shared global betas)
- Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps
- Warmup: ratio-based -> absolute steps (default 40)
- Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05
- Weight decay schedule: linear -> cosine decay
- Polar express norm factor 1.02 -> 1.01

Architecture & init changes:
- VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive
- Add post-QK-norm scaling (q,k *= 1.15) for sharper attention
- Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller
- RoPE base theta 10K -> 100K
- Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile)
- Logit softcap 20 -> 15
2026-03-09 20:45:17 +00:00
Andrej Karpathy 98eed6df18 bring back an assert guarding against bad param sizing 2026-02-05 18:14:30 +00:00
Andrej Karpathy 718e5e9d67 correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt 2026-02-05 01:39:26 +00:00
Andrej Karpathy 41bb2eac32 Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help 2026-01-29 00:52:08 +00:00