nanochat-omni/scripts/base_train.py at 2c4473dd1b608a403700b098f867b202c2a03522 - nanochat-omni - Gitea: Git with a cup of tea

fam/nanochat-omni

Files

T

Andrej Karpathy 2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.

2026-01-11 16:56:59 +00:00

22 KiB

Raw Blame History

View Raw