Commit Graph

14 Commits

Author SHA1 Message Date
Andrej 7931e0903a rename checkpoint_dir to checkpoints_dir for consistency. 2025-12-08 18:32:12 -08:00
Andrej Karpathy 90442de35f fix bug where any rank has to be able to create checkpoint_dir if saving optim 2025-12-08 20:45:19 +00:00
Eric Silberstein a4a0959c73 renamed find_largest_model() argument checkpoint_dir to checkpoints_dir for clarity 2025-11-19 15:33:36 -05:00
Andrej Karpathy c6abcdfe3a big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster. 2025-11-13 15:34:40 +00:00
Andrej d1558c7873 handle bf16 on MPS by casting to fp32 during load checkpoint 2025-11-04 09:42:50 -08:00
Dipesh Babu 7a40ee77b4 fix: cast bf16 to fp32 on MPS (like CPU) to avoid dtype issues 2025-11-03 16:00:56 -05:00
svlandeg 2ce62ec076 ensure consistency of quotes within each statement 2025-11-03 21:52:02 +01:00
svlandeg c72b8b2309 add explicit UTF-8 encoding 2025-11-03 21:27:12 +01:00
Josh Odom f1e15f5f4d Fixing subtle bug: lstrip removes all matching characters, including potentially required ones. Use removeprefix instead. 2025-11-02 23:40:37 -06:00
svlandeg 5bfcd31b73 revert more formatting changes 2025-11-02 14:17:10 +01:00
svlandeg 036a3c5881 revert formatting changes to facilitate review 2025-11-02 14:16:43 +01:00
Manuel Saelices d54c9cbf8c CPU Support, as bfloat16 params breaks inference 2025-11-01 23:38:50 +01:00
Mirza-Samad-Ahmed-Baig afaa5b4c90 Fix: Handle missing d<number> model tags in find_largest_model 2025-10-14 00:24:07 +03:00
karpathy 3a5e0bc50b initial commit 2025-10-13 06:49:24 -07:00