Commit Graph

92 Commits

Author SHA1 Message Date
Andrej Karpathy fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb 2026-01-11 21:49:54 +00:00
Andrej Karpathy 2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge 2026-01-11 20:33:19 +00:00
Andrej Karpathy 201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints 2026-01-11 20:13:12 +00:00
Andrej Karpathy aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb 2026-01-11 18:47:35 +00:00
Andrej Karpathy 2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes. 2026-01-11 16:56:59 +00:00
Andrej Karpathy 4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug 2026-01-08 18:18:42 +00:00
Andrej Karpathy ccf4b7f9bf nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script 2026-01-07 22:11:59 +00:00
Andrej Karpathy 962b6bfba3 alright add transformers as a dep of the repo because it should be easy to evaluate the CORE score of HF models. Not super happy about it but i tried it and the uv.lock doesn't get bloated as much as i expected 2026-01-04 20:37:28 +00:00
Andrej Karpathy eb7bbc1b66 delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts 2026-01-04 19:14:23 +00:00
Andrej Karpathy 507d54224a fix small bug where this would break if git stage has deleted files 2026-01-04 19:11:43 +00:00
Andrej Karpathy be56d29b87 simplify redundant if/elif in bloat metrics
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-04 01:40:42 +00:00
Andrej Karpathy ee79f29fbd replace files-to-prompt with git ls-files for bloat metrics
files-to-prompt was including untracked files (knowledge/, dev scripts, etc.) which inflated the bloat metrics. now we use git ls-files to only count tracked source files, which is more accurate and removes an external dependency.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-04 01:38:15 +00:00
Andrej Karpathy 48abd7d85f simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer 2026-01-01 21:15:09 +00:00
Paweł Krefta 10231dfb40 Fix conversation scroll to bottom on some browsers + remove duplicated padding (#348) 2025-12-31 13:03:22 -08:00
Andrej Karpathy 8f979a8bda fix: sample first token independently for each row in multi-sample generation
Previously, when generating multiple samples (num_samples > 1), the first
token after prefill was sampled once and broadcast to all rows, causing
all samples to start identically. Now the prefill logits are expanded to
num_samples and sampled independently for each row.

Also simplified the generation loop by moving the forward pass to the end
of the loop, eliminating the first_iteration flag and if/else branching.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-28 04:52:13 +00:00
Dipesh Babu 2f2d7ab80c fix: safe DDP cleanup (check initialized PG, not just env) (#256) 2025-12-27 20:27:40 -08:00
Andrej Karpathy e1770a3061 remove spurious cast, gets compiled away anyway but it's confusing people 2025-12-27 23:07:48 +00:00
Andrej Karpathy 49389ecaa8 fix tf32 warning for deprecated api use 2025-12-27 22:03:06 +00:00
Matěj Kripner d314e96aa2 formatting 2025-12-09 12:48:46 +01:00
Matěj Kripner bbc57da7d5 slightly nicer error message 2025-12-09 12:46:48 +01:00
Matěj Kripner f1bf69d562 feat: pad vocab size to 64 for DDP optimizers and efficiency 2025-12-09 12:38:18 +01:00
Andrej 7931e0903a rename checkpoint_dir to checkpoints_dir for consistency. 2025-12-08 18:32:12 -08:00
Andrej 849d95ae1f remove unnecessary check to make the logic in CausalSelfAttention.forward() clearer 2025-12-08 18:30:37 -08:00
Andrej 1b2a675c88 Improve KV cache code readability 2025-12-08 18:19:05 -08:00
Andrej 72a7cf2bc4 Fix distributed Parquet dataloader resume for multi-epoch training 2025-12-08 18:15:02 -08:00
Andrej Karpathy bffdb2ef91 group common code to make things neater in gpt logit computation 2025-12-09 02:01:05 +00:00
Andrej cbf30c842c apply float32 cast before logits softcapping so the tanh is in fp32. torch compile fuses this correctly with no extra memory costs. 2025-12-08 14:17:43 -08:00
Andrej Karpathy 90442de35f fix bug where any rank has to be able to create checkpoint_dir if saving optim 2025-12-08 20:45:19 +00:00
sunyujun03 01ea71be39 Fix distributed Parquet dataloader resume for multi-epoch training 2025-12-08 00:10:19 -06:00
deepbuilder 06677c30e0 Refactor dimension validation for KV cache 2025-11-28 15:22:18 -05:00
deepbuilder a770dcef2e Fix kv_cache indexing to explicitly include head dimension 2025-11-28 15:00:14 -05:00
spjosyula 16788eed3c fix(model): apply float32 cast before logits softcapping
This change ensures that the logits softcapping operation (tanh) is performed in float32 precision rather than bfloat16. Previously, the code cast to float32 after the tanh operation, which meant the non-linearity was computed with bfloat16 precision
2025-11-23 20:12:09 +05:30
Eric Silberstein 5c93a56be5 remove unnecessary check 2025-11-19 16:31:41 -05:00
Eric Silberstein a4a0959c73 renamed find_largest_model() argument checkpoint_dir to checkpoints_dir for clarity 2025-11-19 15:33:36 -05:00
Sam Abrahams 11e68bf442 Fix comment: rotary embeddings final dimension size 2025-11-17 11:32:56 -05:00
Andrej Karpathy bc1fca39f3 mqa -> gqa to reduce confusion 2025-11-15 15:43:37 +00:00
Andrej f66a780f68 Fix torch.dtype mismatching when running engine inline test. 2025-11-14 07:28:29 -08:00
Andrej 4763ce612a Small fixes to typos 2025-11-14 07:25:59 -08:00
Sofie Van Landeghem c6f5bd67db revert change of base to sft for quick inline test 2025-11-14 12:20:03 +01:00
svlandeg a2fb3c83a6 fix typos 2025-11-14 11:20:25 +01:00
Andrej Karpathy c6abcdfe3a big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster. 2025-11-13 15:34:40 +00:00
Andrej Karpathy 91f09ccd0d minor fix comment in engine 2025-11-13 15:28:18 +00:00
howardgao@outlook.com b399e43168 fix engine test bug 2025-11-06 08:56:45 +08:00
Andrej 3a2ae631c4 Merge branch 'master' into master 2025-11-04 16:35:02 -08:00
Andrej d1558c7873 handle bf16 on MPS by casting to fp32 during load checkpoint 2025-11-04 09:42:50 -08:00
Yasser Makram 1e89af9862 Replace fcntl with filelock for Windows compatibility 2025-11-04 07:22:34 +00:00
Dipesh Babu 7a40ee77b4 fix: cast bf16 to fp32 on MPS (like CPU) to avoid dtype issues 2025-11-03 16:00:56 -05:00
svlandeg 2ce62ec076 ensure consistency of quotes within each statement 2025-11-03 21:52:02 +01:00
svlandeg c72b8b2309 add explicit UTF-8 encoding 2025-11-03 21:27:12 +01:00
Josh Odom f1e15f5f4d Fixing subtle bug: lstrip removes all matching characters, including potentially required ones. Use removeprefix instead. 2025-11-02 23:40:37 -06:00