nanochat-omni

Author	SHA1	Message	Date
Andrej Karpathy	6bb92403d5	changes and optimizations to muon, making it more efficient and simpler/cleaner a bit	2026-01-15 03:20:48 +00:00
Andrej Karpathy	3142ca1a28	minor helpful message	2026-01-15 03:20:21 +00:00
Andrej Karpathy	3b50b77ed3	fix base_loss to report correct loss by switching the dataloader to the new default	2026-01-13 22:09:36 +00:00
Andrej Karpathy	43c29dd9d5	Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training The new DataLoader ensures that every token sequence in train/val batches has a BOS token at the beginning. Therefore, no token streams start abruptly in the middle of a document, which could be confusing for the model. Note that this changes the loss scale because there are fewer confusing tokens in the train/val batches. The main downside is that we now waste about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md entry for this change for a lot more information.	2026-01-13 20:05:47 +00:00
Andrej Karpathy	23985413aa	adjust the comment on the regex pattern per recent experimnet see dev/LOG.md	2026-01-13 17:50:39 +00:00
Andrej Karpathy	21608ec51e	allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway	2026-01-12 03:10:13 +00:00
Andrej Karpathy	fbc1484e8c	add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb	2026-01-11 21:49:54 +00:00
Andrej Karpathy	2ff7d51252	integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge	2026-01-11 20:33:19 +00:00
Andrej Karpathy	201d705957	recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints	2026-01-11 20:13:12 +00:00
Andrej Karpathy	aa530cdad5	Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb	2026-01-11 18:47:35 +00:00
Andrej Karpathy	2c4473dd1b	Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.	2026-01-11 16:56:59 +00:00
Andrej Karpathy	4ddc803797	fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug	2026-01-08 18:18:42 +00:00
Andrej Karpathy	ccf4b7f9bf	nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script	2026-01-07 22:11:59 +00:00
Andrej Karpathy	962b6bfba3	alright add transformers as a dep of the repo because it should be easy to evaluate the CORE score of HF models. Not super happy about it but i tried it and the uv.lock doesn't get bloated as much as i expected	2026-01-04 20:37:28 +00:00
Andrej Karpathy	eb7bbc1b66	delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts	2026-01-04 19:14:23 +00:00
Andrej Karpathy	507d54224a	fix small bug where this would break if git stage has deleted files	2026-01-04 19:11:43 +00:00
Andrej Karpathy	be56d29b87	simplify redundant if/elif in bloat metrics 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-04 01:40:42 +00:00
Andrej Karpathy	ee79f29fbd	replace files-to-prompt with git ls-files for bloat metrics files-to-prompt was including untracked files (knowledge/, dev scripts, etc.) which inflated the bloat metrics. now we use git ls-files to only count tracked source files, which is more accurate and removes an external dependency. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-04 01:38:15 +00:00
Andrej Karpathy	48abd7d85f	simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer	2026-01-01 21:15:09 +00:00
Paweł Krefta	10231dfb40	Fix conversation scroll to bottom on some browsers + remove duplicated padding (#348 )	2025-12-31 13:03:22 -08:00
Andrej Karpathy	8f979a8bda	fix: sample first token independently for each row in multi-sample generation Previously, when generating multiple samples (num_samples > 1), the first token after prefill was sampled once and broadcast to all rows, causing all samples to start identically. Now the prefill logits are expanded to num_samples and sampled independently for each row. Also simplified the generation loop by moving the forward pass to the end of the loop, eliminating the first_iteration flag and if/else branching. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-28 04:52:13 +00:00
Dipesh Babu	2f2d7ab80c	fix: safe DDP cleanup (check initialized PG, not just env) (#256 )	2025-12-27 20:27:40 -08:00
Andrej Karpathy	e1770a3061	remove spurious cast, gets compiled away anyway but it's confusing people	2025-12-27 23:07:48 +00:00
Andrej Karpathy	49389ecaa8	fix tf32 warning for deprecated api use	2025-12-27 22:03:06 +00:00
Matěj Kripner	d314e96aa2	formatting	2025-12-09 12:48:46 +01:00
Matěj Kripner	bbc57da7d5	slightly nicer error message	2025-12-09 12:46:48 +01:00
Matěj Kripner	f1bf69d562	feat: pad vocab size to 64 for DDP optimizers and efficiency	2025-12-09 12:38:18 +01:00
Andrej	7931e0903a	rename checkpoint_dir to checkpoints_dir for consistency.	2025-12-08 18:32:12 -08:00
Andrej	849d95ae1f	remove unnecessary check to make the logic in CausalSelfAttention.forward() clearer	2025-12-08 18:30:37 -08:00
Andrej	1b2a675c88	Improve KV cache code readability	2025-12-08 18:19:05 -08:00
Andrej	72a7cf2bc4	Fix distributed Parquet dataloader resume for multi-epoch training	2025-12-08 18:15:02 -08:00
Andrej Karpathy	bffdb2ef91	group common code to make things neater in gpt logit computation	2025-12-09 02:01:05 +00:00
Andrej	cbf30c842c	apply float32 cast before logits softcapping so the tanh is in fp32. torch compile fuses this correctly with no extra memory costs.	2025-12-08 14:17:43 -08:00
Andrej Karpathy	90442de35f	fix bug where any rank has to be able to create checkpoint_dir if saving optim	2025-12-08 20:45:19 +00:00
sunyujun03	01ea71be39	Fix distributed Parquet dataloader resume for multi-epoch training	2025-12-08 00:10:19 -06:00
deepbuilder	06677c30e0	Refactor dimension validation for KV cache	2025-11-28 15:22:18 -05:00
deepbuilder	a770dcef2e	Fix kv_cache indexing to explicitly include head dimension	2025-11-28 15:00:14 -05:00
spjosyula	16788eed3c	fix(model): apply float32 cast before logits softcapping This change ensures that the logits softcapping operation (tanh) is performed in float32 precision rather than bfloat16. Previously, the code cast to float32 after the tanh operation, which meant the non-linearity was computed with bfloat16 precision	2025-11-23 20:12:09 +05:30
Eric Silberstein	5c93a56be5	remove unnecessary check	2025-11-19 16:31:41 -05:00
Eric Silberstein	a4a0959c73	renamed find_largest_model() argument checkpoint_dir to checkpoints_dir for clarity	2025-11-19 15:33:36 -05:00
Sam Abrahams	11e68bf442	Fix comment: rotary embeddings final dimension size	2025-11-17 11:32:56 -05:00
Andrej Karpathy	bc1fca39f3	mqa -> gqa to reduce confusion	2025-11-15 15:43:37 +00:00
Andrej	f66a780f68	Fix torch.dtype mismatching when running engine inline test.	2025-11-14 07:28:29 -08:00
Andrej	4763ce612a	Small fixes to typos	2025-11-14 07:25:59 -08:00
Sofie Van Landeghem	c6f5bd67db	revert change of base to sft for quick inline test	2025-11-14 12:20:03 +01:00
svlandeg	a2fb3c83a6	fix typos	2025-11-14 11:20:25 +01:00
Andrej Karpathy	c6abcdfe3a	big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster.	2025-11-13 15:34:40 +00:00
Andrej Karpathy	91f09ccd0d	minor fix comment in engine	2025-11-13 15:28:18 +00:00
howardgao@outlook.com	b399e43168	fix engine test bug	2025-11-06 08:56:45 +08:00
Andrej	3a2ae631c4	Merge branch 'master' into master	2025-11-04 16:35:02 -08:00

1 2

98 Commits