Commit Graph

62 Commits

Author SHA1 Message Date
Andrej Karpathy b33e394528 oops actually make SSSL the default window pattern 2026-01-11 21:50:35 +00:00
Andrej Karpathy fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb 2026-01-11 21:49:54 +00:00
Andrej Karpathy aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb 2026-01-11 18:47:35 +00:00
Andrej Karpathy 2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes. 2026-01-11 16:56:59 +00:00
Andrej Karpathy 061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then 2026-01-08 02:16:50 +00:00
Andrej Karpathy ccf4b7f9bf nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script 2026-01-07 22:11:59 +00:00
Adria Blancafort 1b5de29e71 Fix undefined variable in chat_rl after recent refactor
* Fix undefined variable

* Remove unused import

Remove unused import 're' from chat_rl.py
2026-01-07 09:08:57 -08:00
Andrej Karpathy ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3 2026-01-05 18:57:46 +00:00
Andrej Karpathy 9d4c9b786d many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works 2026-01-05 00:38:09 +00:00
Andrej Karpathy eb7bbc1b66 delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts 2026-01-04 19:14:23 +00:00
Andrej Karpathy 48abd7d85f simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer 2026-01-01 21:15:09 +00:00
helloaidank 389d019a0b small change to doc string at top of tok_train.py (#402) 2025-12-31 12:57:26 -08:00
Andrej 088726aa7d clean up model_tag handling across scripts a bit more. 2025-12-27 20:01:09 -08:00
Andrej Karpathy 2874eda59a update to new os env var to get rid of deprecation warning 2025-12-28 03:32:46 +00:00
DU Wenjie ea4229851b bugfix 2025-12-26 19:02:12 +08:00
DU Wenjie 7840049189 bugfix keep same args style in scripts/base_eval.py 2025-12-26 17:29:08 +08:00
duwenjie 92c6654b95 bugfix save and load ckpt from model_tag dir 2025-12-21 15:07:04 +08:00
Andrej 39cccc527f small bugfix make mid_train script work even with a tiny number of iterations 2025-12-08 18:27:32 -08:00
Andrej 8b1cecaa95 Apply suggestion from @svlandeg for nicer looking comparison
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2025-12-08 18:27:06 -08:00
Andrej 58f3e84e01 clean up train/val loader in sft for consistency with mid/base 2025-12-08 18:23:57 -08:00
Sanzo00 53b3a4fb81 fix: missing val_bpb on resume 2025-11-22 11:04:20 +08:00
svlandeg 4bcc3bb698 clarify comment 2025-11-21 13:19:45 +01:00
Eric Silberstein f37d45c21f remove unneeded iter() 2025-11-20 15:14:56 -05:00
Eric Silberstein dddb95caac make mid_train script work even with a tiny number of iterations 2025-11-19 15:52:20 -05:00
Andrej 4763ce612a Small fixes to typos 2025-11-14 07:25:59 -08:00
svlandeg a2fb3c83a6 fix typos 2025-11-14 11:20:25 +01:00
Andrej Karpathy c6abcdfe3a big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster. 2025-11-13 15:34:40 +00:00
Andrej Karpathy c6b7ab7440 grad clip logging and printing and cosmetics 2025-11-05 21:08:30 +00:00
svlandeg 2ce62ec076 ensure consistency of quotes within each statement 2025-11-03 21:52:02 +01:00
svlandeg c72b8b2309 add explicit UTF-8 encoding 2025-11-03 21:27:12 +01:00
Dipesh Babu 226953b841 fix: open JSONL and results CSV with UTF-8 encoding for portability 2025-11-03 01:20:56 -05:00
svlandeg 52e85aaf80 Merge branch 'master' into fix/typo 2025-11-02 13:41:13 +01:00
Andrej Karpathy cf587acb1a move eval bundle download to be lazy and inside the python code so that we can substantially simplify the run bash scripts 2025-11-01 16:04:38 +00:00
Andrej Karpathy 7d2c4a3d95 delete pandas dep in base_eval use csv instead 2025-11-01 15:28:30 +00:00
Andrej dfc88334b6 fix tok/sec calculation bug when grad accum steps > 1
Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1
2025-10-30 08:36:32 -07:00
svlandeg 70319851fc fix typo 2025-10-29 19:48:34 +01:00
svlandeg 8c9b004c99 typo fixes in scripts 2025-10-28 20:17:31 +01:00
water-vapor a9de4b1038 Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1 2025-10-26 01:43:49 -05:00
Andrej Karpathy 8892470f29 add the SpellingBee task so that nanochat can count r in strawberry etc. along the way we had to add a bunch of new functionality, e.g. extend the calculator to support the count function of python. possibly the current TaskMixture uses way too many synthetic examples of SpellingBee because the eval gives us exactly 100% performance on spelling. We can tune this later to reclaim some wall clock time here I think 2025-10-24 14:02:48 +00:00
Andrej Karpathy 81597cd616 move the lr schedule args up in base_train so they are tunable in configurator 2025-10-24 13:27:31 +00:00
Luke Stanley defd1246aa Fix Torch crash caused by pinning on CPU 2025-10-21 20:28:10 +00:00
Andrej Karpathy a088b7a6ec use enable_gqa of pytorch sdpa, allows us to delete some code, didnt realize it's available 2025-10-21 18:07:33 +00:00
Andrej Karpathy 5bdc99abfb merge and resolve conflict 2025-10-21 17:19:10 +00:00
Andrej Karpathy dfcb1c16f1 Merge branch 'master' into cpu-mps-dev 2025-10-21 17:15:53 +00:00
Andrej Karpathy fe5aed940b add personality to nanochat. breaks previous code on git pull and requires download of a new file from s3, but there is a helpful error message so hopefully its ok 2025-10-21 15:04:58 +00:00
karpathy 2e9669e03a upgrading all other files to be able to use cpu/mps as well as cuda. various minor other changes ,e.g. changing max_iterations to num_iterations in sft script for consistency in naming 2025-10-20 10:15:17 -07:00
Andrej Karpathy c1d2ed1c13 use orig_model in sampling, silly of me to miss this 2025-10-20 00:05:09 +00:00
Andrej Karpathy 2bc521a6de use orig_model in sampling, silly of me to miss this 2025-10-20 00:04:15 +00:00
karpathy ae02650afe update the midtraining script too 2025-10-16 16:33:17 -07:00
karpathy df600b6ed5 many small tweaks. base, eval, core work now i think 2025-10-16 15:46:18 -07:00