Commit Graph

130 Commits

Author SHA1 Message Date
Andrej Karpathy 788dadeb88 a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps 2026-02-16 14:41:53 +00:00
Andrej Karpathy e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm 2026-02-10 18:46:39 +00:00
Andrej Karpathy 98eed6df18 bring back an assert guarding against bad param sizing 2026-02-05 18:14:30 +00:00
Andrej Karpathy 718e5e9d67 correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt 2026-02-05 01:39:26 +00:00
Sofie Van Landeghem 72b9064f9d remove leftover mid references (#491) 2026-02-02 08:33:46 -08:00
Andrej Karpathy e8fec97d4c slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector 2026-02-02 01:17:30 +00:00
Sofie Van Landeghem 4d6415b8ef use _PEAK_FLOPS_TABLE instead of if-else structure (#479) 2026-01-31 19:45:06 -08:00
Sofie Van Landeghem 43078c347e clean up original tokenizing_distributed_data_loader (#478) 2026-01-31 19:44:12 -08:00
Franci Penov dc291c627f Add Blackwell (SM100) GPU support via SDPA fallback (#475) 2026-01-31 19:42:58 -08:00
Andrej Karpathy 3ba42e8135 Fix SDPA KV-cache decode to respect sliding window (#456)
SDPA fallback now respects sliding window during single-token KV-cache
decode by slicing K/V to the last (window + 1) tokens.

Also simplifies the mask building for chunk inference to properly apply
sliding window in that path as well.

Fixes #452

Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 17:32:12 +00:00
Harsh Gupta 2e17723817 Fix generate() crash when top_k=0 (#467)
Prevent a crash in generate() by skipping top-k filtering when top_k is set to 0
2026-01-30 09:21:02 -08:00
Andrej Karpathy d6c4f3b923 i think this is the new torch 2.9+ API for declaring tf32 preference 2026-01-30 17:03:15 +00:00
Andrej Karpathy 6a341f2ecf contiguous views and single HtoD transfer for inputs/targets much cleaner 2026-01-30 00:23:01 +00:00
Andrej Karpathy 41bb2eac32 Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help 2026-01-29 00:52:08 +00:00
Andrej Karpathy 74554be3b5 revert engram, not seeing an improvement at larger scale 2026-01-28 20:07:39 +00:00
Andrej Karpathy c8d93beed2 add engram-lite, add log, tune scaling laws analysis scripts 2026-01-27 22:31:17 +00:00
Andrej Karpathy 59e36cc727 first version of engram following modded nanogpt style 2026-01-25 18:59:51 +00:00
Andrej Karpathy 85b3e95e09 320 experiments just to tune the adam beta1 of x0 a little bit up from 0.8 to 0.96 2026-01-25 00:04:02 +00:00
xiayan0118 6a477eedbd fix: pass device_type to compute_init in engine.__main__ (#451)
When running engine.py directly on non-GPU devices (CPU, MPS),
compute_init() needs the device_type parameter to initialize correctly.
This fixes failures on machines without CUDA support.
2026-01-19 17:19:51 -08:00
Andrej Karpathy a91743c168 Merge branch 've' 2026-01-18 15:14:39 +00:00
Andrej Karpathy e7ed2082b8 update the default GPTConfig kwargs otherwise they are confusing 2026-01-17 21:16:46 +00:00
karpathy f9a7e0f111 update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption 2026-01-17 12:27:30 -08:00
Andrej Karpathy f5425245f9 more GPU types from PR 147 thanks @Qubitium 2026-01-17 03:22:20 +00:00
Andrej Karpathy 2955650327 add detection of device to report more correct mfu for bf16 2026-01-17 03:16:14 +00:00
Yamahammer e1dafc510f Reduce token waste in BOS bestfit by cropping shortest doc (#445)
When no document fits the remaining row space, crop the shortest
document in the buffer instead of the first. This minimizes
discarded tokens.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-16 18:50:34 -08:00
Andrej Karpathy e85db6b4a4 alternating design 2026-01-16 23:52:12 +00:00
Andrej Karpathy 9a88194c3f simply one VE per layer, works best 2026-01-16 22:08:52 +00:00
Andrej Karpathy 0b58d70e99 full ve version works very well 2026-01-16 21:16:47 +00:00
Andrej Karpathy e3f58b838e ranked version 2026-01-16 20:59:42 +00:00
Andrej Karpathy b62a5bc44a naturally i failed to include the actual code in the previous commit facepalm 2026-01-16 17:39:41 +00:00
Andrej Karpathy 8203efa919 implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user. 2026-01-16 17:37:51 +00:00
Andrej Karpathy 22a71aa3d3 fuse adamw into a single torch compiled kernel similar to muon. it's about 1.7X faster, but overall it's so tiny that it's not making a major dent 2026-01-15 23:30:44 +00:00
Andrej Karpathy 6bb92403d5 changes and optimizations to muon, making it more efficient and simpler/cleaner a bit 2026-01-15 03:20:48 +00:00
Andrej Karpathy 3142ca1a28 minor helpful message 2026-01-15 03:20:21 +00:00
Andrej Karpathy 3b50b77ed3 fix base_loss to report correct loss by switching the dataloader to the new default 2026-01-13 22:09:36 +00:00
Andrej Karpathy 43c29dd9d5 Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.
2026-01-13 20:05:47 +00:00
Andrej Karpathy 23985413aa adjust the comment on the regex pattern per recent experimnet see dev/LOG.md 2026-01-13 17:50:39 +00:00
Andrej Karpathy 21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway 2026-01-12 03:10:13 +00:00
Andrej Karpathy fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb 2026-01-11 21:49:54 +00:00
Andrej Karpathy 2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge 2026-01-11 20:33:19 +00:00
Andrej Karpathy 201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints 2026-01-11 20:13:12 +00:00
Andrej Karpathy aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb 2026-01-11 18:47:35 +00:00
Andrej Karpathy 2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes. 2026-01-11 16:56:59 +00:00
Andrej Karpathy 4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug 2026-01-08 18:18:42 +00:00
Andrej Karpathy ccf4b7f9bf nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script 2026-01-07 22:11:59 +00:00
Andrej Karpathy 962b6bfba3 alright add transformers as a dep of the repo because it should be easy to evaluate the CORE score of HF models. Not super happy about it but i tried it and the uv.lock doesn't get bloated as much as i expected 2026-01-04 20:37:28 +00:00
Andrej Karpathy eb7bbc1b66 delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts 2026-01-04 19:14:23 +00:00
Andrej Karpathy 507d54224a fix small bug where this would break if git stage has deleted files 2026-01-04 19:11:43 +00:00
Andrej Karpathy be56d29b87 simplify redundant if/elif in bloat metrics
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-04 01:40:42 +00:00
Andrej Karpathy ee79f29fbd replace files-to-prompt with git ls-files for bloat metrics
files-to-prompt was including untracked files (knowledge/, dev scripts, etc.) which inflated the bloat metrics. now we use git ls-files to only count tracked source files, which is more accurate and removes an external dependency.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-04 01:38:15 +00:00