nanochat-omni/nanochat/muon.py at b33e394528103f26c3190b55c11ca4d942f6ad7f - nanochat-omni - Gitea: Git with a cup of tea

fam/nanochat-omni

Files

T

Andrej Karpathy 2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.

2026-01-11 16:56:59 +00:00

14 KiB

Raw Blame History

View Raw