nanochat-omni

Author	SHA1	Message	Date
Andrej Karpathy	2ff7d51252	integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge	2026-01-11 20:33:19 +00:00
Andrej Karpathy	aa530cdad5	Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb	2026-01-11 18:47:35 +00:00
Andrej Karpathy	2c4473dd1b	Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.	2026-01-11 16:56:59 +00:00
Andrej Karpathy	ccf4b7f9bf	nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script	2026-01-07 22:11:59 +00:00
Andrej Karpathy	48abd7d85f	simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer	2026-01-01 21:15:09 +00:00
Andrej Karpathy	e1770a3061	remove spurious cast, gets compiled away anyway but it's confusing people	2025-12-27 23:07:48 +00:00
Matěj Kripner	d314e96aa2	formatting	2025-12-09 12:48:46 +01:00
Matěj Kripner	f1bf69d562	feat: pad vocab size to 64 for DDP optimizers and efficiency	2025-12-09 12:38:18 +01:00
Andrej	849d95ae1f	remove unnecessary check to make the logic in CausalSelfAttention.forward() clearer	2025-12-08 18:30:37 -08:00
Andrej Karpathy	bffdb2ef91	group common code to make things neater in gpt logit computation	2025-12-09 02:01:05 +00:00
spjosyula	16788eed3c	fix(model): apply float32 cast before logits softcapping This change ensures that the logits softcapping operation (tanh) is performed in float32 precision rather than bfloat16. Previously, the code cast to float32 after the tanh operation, which meant the non-linearity was computed with bfloat16 precision	2025-11-23 20:12:09 +05:30
Eric Silberstein	5c93a56be5	remove unnecessary check	2025-11-19 16:31:41 -05:00
Sam Abrahams	11e68bf442	Fix comment: rotary embeddings final dimension size	2025-11-17 11:32:56 -05:00
Andrej Karpathy	bc1fca39f3	mqa -> gqa to reduce confusion	2025-11-15 15:43:37 +00:00
Andrej Karpathy	a088b7a6ec	use enable_gqa of pytorch sdpa, allows us to delete some code, didnt realize it's available	2025-10-21 18:07:33 +00:00
karpathy	306bc380ab	add support for CPU and for MPS. I had to change a few cosmetic things. I also discovered I think a bit of a bug, where I was casting wte to bfloat16 in the wrong place (the model init) instead of in init_weights	2025-10-16 10:04:43 -07:00
karpathy	3a5e0bc50b	initial commit	2025-10-13 06:49:24 -07:00

17 Commits