Andrej Karpathy
|
22a71aa3d3
|
fuse adamw into a single torch compiled kernel similar to muon. it's about 1.7X faster, but overall it's so tiny that it's not making a major dent
|
2026-01-15 23:30:44 +00:00 |
|
Andrej Karpathy
|
aa530cdad5
|
Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb
|
2026-01-11 18:47:35 +00:00 |
|
Andrej Karpathy
|
4ddc803797
|
fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
|
2026-01-08 18:18:42 +00:00 |
|
Matěj Kripner
|
bbc57da7d5
|
slightly nicer error message
|
2025-12-09 12:46:48 +01:00 |
|
Matěj Kripner
|
f1bf69d562
|
feat: pad vocab size to 64 for DDP optimizers and efficiency
|
2025-12-09 12:38:18 +01:00 |
|
Sermet Pekin
|
49cd02f283
|
fix: remove unnecessary tensor allocation in DistAdamW optimizer
fix: remove unnecessary tensor allocation in DistAdamW optimizer
|
2025-10-20 12:03:26 +03:00 |
|
karpathy
|
3a5e0bc50b
|
initial commit
|
2025-10-13 06:49:24 -07:00 |
|