Commit Graph

12 Commits

Author SHA1 Message Date
Andrej Karpathy e1770a3061 remove spurious cast, gets compiled away anyway but it's confusing people 2025-12-27 23:07:48 +00:00
Matěj Kripner d314e96aa2 formatting 2025-12-09 12:48:46 +01:00
Matěj Kripner f1bf69d562 feat: pad vocab size to 64 for DDP optimizers and efficiency 2025-12-09 12:38:18 +01:00
Andrej 849d95ae1f remove unnecessary check to make the logic in CausalSelfAttention.forward() clearer 2025-12-08 18:30:37 -08:00
Andrej Karpathy bffdb2ef91 group common code to make things neater in gpt logit computation 2025-12-09 02:01:05 +00:00
spjosyula 16788eed3c fix(model): apply float32 cast before logits softcapping
This change ensures that the logits softcapping operation (tanh) is performed in float32 precision rather than bfloat16. Previously, the code cast to float32 after the tanh operation, which meant the non-linearity was computed with bfloat16 precision
2025-11-23 20:12:09 +05:30
Eric Silberstein 5c93a56be5 remove unnecessary check 2025-11-19 16:31:41 -05:00
Sam Abrahams 11e68bf442 Fix comment: rotary embeddings final dimension size 2025-11-17 11:32:56 -05:00
Andrej Karpathy bc1fca39f3 mqa -> gqa to reduce confusion 2025-11-15 15:43:37 +00:00
Andrej Karpathy a088b7a6ec use enable_gqa of pytorch sdpa, allows us to delete some code, didnt realize it's available 2025-10-21 18:07:33 +00:00
karpathy 306bc380ab add support for CPU and for MPS. I had to change a few cosmetic things. I also discovered I think a bit of a bug, where I was casting wte to bfloat16 in the wrong place (the model init) instead of in init_weights 2025-10-16 10:04:43 -07:00
karpathy 3a5e0bc50b initial commit 2025-10-13 06:49:24 -07:00