nanochat-omni/nanochat at 2c4473dd1b608a403700b098f867b202c2a03522 - nanochat-omni - Gitea: Git with a cup of tea

fam/nanochat-omni

Files

T

History

Andrej Karpathy 2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.

2026-01-11 16:56:59 +00:00

..

__init__.py

initial commit

2025-10-13 06:49:24 -07:00

adamw.py

fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug

2026-01-08 18:18:42 +00:00

checkpoint_manager.py

rename checkpoint_dir to checkpoints_dir for consistency.

2025-12-08 18:32:12 -08:00

common.py

fix: safe DDP cleanup (check initialized PG, not just env) (#256 )

2025-12-27 20:27:40 -08:00

core_eval.py

initial commit

2025-10-13 06:49:24 -07:00

dataloader.py

feat: pad vocab size to 64 for DDP optimizers and efficiency

2025-12-09 12:38:18 +01:00

dataset.py

initial commit

2025-10-13 06:49:24 -07:00

engine.py

delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts

2026-01-04 19:14:23 +00:00

execution.py

nit delete redundant catch/raise in execute

2025-10-29 08:10:03 -07:00

gpt.py

Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.

2026-01-11 16:56:59 +00:00

logo.svg

initial commit

2025-10-13 06:49:24 -07:00

loss_eval.py

fix typos

2025-11-14 11:20:25 +01:00

muon.py

Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.

2026-01-11 16:56:59 +00:00

report.py

fix small bug where this would break if git stage has deleted files

2026-01-04 19:11:43 +00:00

tokenizer.py

alright add transformers as a dep of the repo because it should be easy to evaluate the CORE score of HF models. Not super happy about it but i tried it and the uv.lock doesn't get bloated as much as i expected

2026-01-04 20:37:28 +00:00

ui.html

Fix conversation scroll to bottom on some browsers + remove duplicated padding (#348 )

2025-12-31 13:03:22 -08:00