correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt

This commit is contained in:
Andrej Karpathy
2026-02-05 01:39:26 +00:00
parent 542beb0c8c
commit 718e5e9d67
2 changed files with 7 additions and 3 deletions
+4
View File
@@ -67,6 +67,10 @@ Polar Express Sign Method for orthogonalization.
https://arxiv.org/pdf/2505.16932
by Noah Amsel, David Persson, Christopher Musco, Robert M. Gower.
NorMuon variance reduction: per-neuron/column adaptive learning rate that normalizes
update scales after orthogonalization (Muon's output has non-uniform scales across neurons).
https://arxiv.org/pdf/2510.05491
Some of the changes in nanochat implementation:
- Uses a simpler, more general approach to parameter grouping and stacking
- Uses a single fused kernel for the momentum -> polar_express -> variance_reduction -> update step