nanochat-omni

Author	SHA1	Message	Date
Andrej Karpathy	22a71aa3d3	fuse adamw into a single torch compiled kernel similar to muon. it's about 1.7X faster, but overall it's so tiny that it's not making a major dent	2026-01-15 23:30:44 +00:00
Andrej Karpathy	aa530cdad5	Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb	2026-01-11 18:47:35 +00:00
Andrej Karpathy	4ddc803797	fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug	2026-01-08 18:18:42 +00:00
Matěj Kripner	bbc57da7d5	slightly nicer error message	2025-12-09 12:46:48 +01:00
Matěj Kripner	f1bf69d562	feat: pad vocab size to 64 for DDP optimizers and efficiency	2025-12-09 12:38:18 +01:00
Sermet Pekin	49cd02f283	fix: remove unnecessary tensor allocation in DistAdamW optimizer fix: remove unnecessary tensor allocation in DistAdamW optimizer	2025-10-20 12:03:26 +03:00
karpathy	3a5e0bc50b	initial commit	2025-10-13 06:49:24 -07:00