Franci Penov
dc291c627f
Add Blackwell (SM100) GPU support via SDPA fallback ( #475 )
2026-01-31 19:42:58 -08:00
Andrej Karpathy
3ba42e8135
Fix SDPA KV-cache decode to respect sliding window ( #456 )
...
SDPA fallback now respects sliding window during single-token KV-cache
decode by slicing K/V to the last (window + 1) tokens.
Also simplifies the mask building for chunk inference to properly apply
sliding window in that path as well.
Fixes #452
Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com >
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-30 17:32:12 +00:00
Harsh Gupta
2e17723817
Fix generate() crash when top_k=0 ( #467 )
...
Prevent a crash in generate() by skipping top-k filtering when top_k is set to 0
2026-01-30 09:21:02 -08:00
Andrej Karpathy
d6c4f3b923
i think this is the new torch 2.9+ API for declaring tf32 preference
2026-01-30 17:03:15 +00:00
Andrej Karpathy
6a341f2ecf
contiguous views and single HtoD transfer for inputs/targets much cleaner
2026-01-30 00:23:01 +00:00
Andrej Karpathy
41bb2eac32
Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help
2026-01-29 00:52:08 +00:00
Andrej Karpathy
74554be3b5
revert engram, not seeing an improvement at larger scale
2026-01-28 20:07:39 +00:00
Andrej Karpathy
c8d93beed2
add engram-lite, add log, tune scaling laws analysis scripts
2026-01-27 22:31:17 +00:00
Andrej Karpathy
59e36cc727
first version of engram following modded nanogpt style
2026-01-25 18:59:51 +00:00
Andrej Karpathy
85b3e95e09
320 experiments just to tune the adam beta1 of x0 a little bit up from 0.8 to 0.96
2026-01-25 00:04:02 +00:00
xiayan0118
6a477eedbd
fix: pass device_type to compute_init in engine.__main__ ( #451 )
...
When running engine.py directly on non-GPU devices (CPU, MPS),
compute_init() needs the device_type parameter to initialize correctly.
This fixes failures on machines without CUDA support.
2026-01-19 17:19:51 -08:00
Andrej Karpathy
a91743c168
Merge branch 've'
2026-01-18 15:14:39 +00:00
Andrej Karpathy
e7ed2082b8
update the default GPTConfig kwargs otherwise they are confusing
2026-01-17 21:16:46 +00:00
karpathy
f9a7e0f111
update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption
2026-01-17 12:27:30 -08:00
Andrej Karpathy
f5425245f9
more GPU types from PR 147 thanks @Qubitium
2026-01-17 03:22:20 +00:00
Andrej Karpathy
2955650327
add detection of device to report more correct mfu for bf16
2026-01-17 03:16:14 +00:00
Yamahammer
e1dafc510f
Reduce token waste in BOS bestfit by cropping shortest doc ( #445 )
...
When no document fits the remaining row space, crop the shortest
document in the buffer instead of the first. This minimizes
discarded tokens.
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-16 18:50:34 -08:00
Andrej Karpathy
e85db6b4a4
alternating design
2026-01-16 23:52:12 +00:00
Andrej Karpathy
9a88194c3f
simply one VE per layer, works best
2026-01-16 22:08:52 +00:00
Andrej Karpathy
0b58d70e99
full ve version works very well
2026-01-16 21:16:47 +00:00
Andrej Karpathy
e3f58b838e
ranked version
2026-01-16 20:59:42 +00:00
Andrej Karpathy
b62a5bc44a
naturally i failed to include the actual code in the previous commit facepalm
2026-01-16 17:39:41 +00:00
Andrej Karpathy
8203efa919
implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.
2026-01-16 17:37:51 +00:00
Andrej Karpathy
22a71aa3d3
fuse adamw into a single torch compiled kernel similar to muon. it's about 1.7X faster, but overall it's so tiny that it's not making a major dent
2026-01-15 23:30:44 +00:00
Andrej Karpathy
6bb92403d5
changes and optimizations to muon, making it more efficient and simpler/cleaner a bit
2026-01-15 03:20:48 +00:00
Andrej Karpathy
3142ca1a28
minor helpful message
2026-01-15 03:20:21 +00:00
Andrej Karpathy
3b50b77ed3
fix base_loss to report correct loss by switching the dataloader to the new default
2026-01-13 22:09:36 +00:00
Andrej Karpathy
43c29dd9d5
Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
...
The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.
2026-01-13 20:05:47 +00:00
Andrej Karpathy
23985413aa
adjust the comment on the regex pattern per recent experimnet see dev/LOG.md
2026-01-13 17:50:39 +00:00
Andrej Karpathy
21608ec51e
allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway
2026-01-12 03:10:13 +00:00
Andrej Karpathy
fbc1484e8c
add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2026-01-11 21:49:54 +00:00
Andrej Karpathy
2ff7d51252
integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
2026-01-11 20:33:19 +00:00
Andrej Karpathy
201d705957
recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
2026-01-11 20:13:12 +00:00
Andrej Karpathy
aa530cdad5
Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb
2026-01-11 18:47:35 +00:00
Andrej Karpathy
2c4473dd1b
Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.
2026-01-11 16:56:59 +00:00
Andrej Karpathy
4ddc803797
fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
2026-01-08 18:18:42 +00:00
Andrej Karpathy
ccf4b7f9bf
nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script
2026-01-07 22:11:59 +00:00
Andrej Karpathy
962b6bfba3
alright add transformers as a dep of the repo because it should be easy to evaluate the CORE score of HF models. Not super happy about it but i tried it and the uv.lock doesn't get bloated as much as i expected
2026-01-04 20:37:28 +00:00
Andrej Karpathy
eb7bbc1b66
delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts
2026-01-04 19:14:23 +00:00
Andrej Karpathy
507d54224a
fix small bug where this would break if git stage has deleted files
2026-01-04 19:11:43 +00:00
Andrej Karpathy
be56d29b87
simplify redundant if/elif in bloat metrics
...
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-04 01:40:42 +00:00
Andrej Karpathy
ee79f29fbd
replace files-to-prompt with git ls-files for bloat metrics
...
files-to-prompt was including untracked files (knowledge/, dev scripts, etc.) which inflated the bloat metrics. now we use git ls-files to only count tracked source files, which is more accurate and removes an external dependency.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-04 01:38:15 +00:00
Andrej Karpathy
48abd7d85f
simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer
2026-01-01 21:15:09 +00:00
Paweł Krefta
10231dfb40
Fix conversation scroll to bottom on some browsers + remove duplicated padding ( #348 )
2025-12-31 13:03:22 -08:00
Andrej Karpathy
8f979a8bda
fix: sample first token independently for each row in multi-sample generation
...
Previously, when generating multiple samples (num_samples > 1), the first
token after prefill was sampled once and broadcast to all rows, causing
all samples to start identically. Now the prefill logits are expanded to
num_samples and sampled independently for each row.
Also simplified the generation loop by moving the forward pass to the end
of the loop, eliminating the first_iteration flag and if/else branching.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-28 04:52:13 +00:00
Dipesh Babu
2f2d7ab80c
fix: safe DDP cleanup (check initialized PG, not just env) ( #256 )
2025-12-27 20:27:40 -08:00
Andrej Karpathy
e1770a3061
remove spurious cast, gets compiled away anyway but it's confusing people
2025-12-27 23:07:48 +00:00
Andrej Karpathy
49389ecaa8
fix tf32 warning for deprecated api use
2025-12-27 22:03:06 +00:00
Matěj Kripner
d314e96aa2
formatting
2025-12-09 12:48:46 +01:00
Matěj Kripner
bbc57da7d5
slightly nicer error message
2025-12-09 12:46:48 +01:00