Andrej Karpathy
|
c6abcdfe3a
|
big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster.
|
2025-11-13 15:34:40 +00:00 |
|
Andrej
|
d1558c7873
|
handle bf16 on MPS by casting to fp32 during load checkpoint
|
2025-11-04 09:42:50 -08:00 |
|
Dipesh Babu
|
7a40ee77b4
|
fix: cast bf16 to fp32 on MPS (like CPU) to avoid dtype issues
|
2025-11-03 16:00:56 -05:00 |
|
svlandeg
|
2ce62ec076
|
ensure consistency of quotes within each statement
|
2025-11-03 21:52:02 +01:00 |
|
svlandeg
|
c72b8b2309
|
add explicit UTF-8 encoding
|
2025-11-03 21:27:12 +01:00 |
|
Josh Odom
|
f1e15f5f4d
|
Fixing subtle bug: lstrip removes all matching characters, including potentially required ones. Use removeprefix instead.
|
2025-11-02 23:40:37 -06:00 |
|
svlandeg
|
5bfcd31b73
|
revert more formatting changes
|
2025-11-02 14:17:10 +01:00 |
|
svlandeg
|
036a3c5881
|
revert formatting changes to facilitate review
|
2025-11-02 14:16:43 +01:00 |
|
Manuel Saelices
|
d54c9cbf8c
|
CPU Support, as bfloat16 params breaks inference
|
2025-11-01 23:38:50 +01:00 |
|
Mirza-Samad-Ahmed-Baig
|
afaa5b4c90
|
Fix: Handle missing d<number> model tags in find_largest_model
|
2025-10-14 00:24:07 +03:00 |
|
karpathy
|
3a5e0bc50b
|
initial commit
|
2025-10-13 06:49:24 -07:00 |
|