b19b4f3e49
fix bug in speedrun script, batch size that doesn't OOM on 8XH100 for d24 is 16
Andrej Karpathy
2026-02-02 15:50:14 +00:00
230d6cf6c6
tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3
Andrej Karpathy
2026-02-02 01:45:59 +00:00
07c4dd4cd9
manually control the over-active garbage collector, save a small few minutes from a typical run
Andrej Karpathy
2026-02-02 01:44:30 +00:00
e8fec97d4c
slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector
Andrej Karpathy
2026-02-02 01:17:30 +00:00
8b4849d548
fix bug in chat_sft, the attention window must be preserved sigh
Andrej Karpathy
2026-02-01 20:58:44 +00:00
eaf49a33c8
fix path which i think was modified during the refactor and this is a bug introduced by claude i believe
Andrej Karpathy
2026-02-01 20:15:19 +00:00
31b61d2d17
fix broken import sigh
Andrej Karpathy
2026-02-01 05:03:44 +00:00
4d6415b8ef
use _PEAK_FLOPS_TABLE instead of if-else structure (#479)
Sofie Van Landeghem
2026-02-01 04:45:06 +01:00
43078c347e
clean up original tokenizing_distributed_data_loader (#478)
Sofie Van Landeghem
2026-02-01 04:44:12 +01:00
dc291c627f
Add Blackwell (SM100) GPU support via SDPA fallback (#475)
Franci Penov
2026-01-31 19:42:58 -08:00
0307997f9b
merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both
Andrej Karpathy
2026-02-01 02:36:43 +00:00
1ddaad1c1c
nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1
Andrej Karpathy
2026-01-31 19:12:25 +00:00
348fbb301b
fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining
Andrej Karpathy
2026-01-31 18:21:36 +00:00
3c3a3d7042
warmdown of 0.5 is slightly better:
Andrej Karpathy
2026-01-31 01:08:44 +00:00
4d8dbaf6e0
Fix escape character in README bibtex entry (#454)
Andrei Panferov
2026-01-30 18:34:02 +01:00
3ba42e8135
Fix SDPA KV-cache decode to respect sliding window (#456)
Andrej Karpathy
2026-01-30 17:32:12 +00:00
ace6740bdd
feat: allow top_k=0 in web api to disable filtering (#458)
Aarushi Singh
2026-01-30 22:51:41 +05:30
02baa15405
i am feeling in a delete mood today. i need to delete a lot of code. there is too much code and surface area and complexity. ew
Andrej Karpathy
2026-01-30 17:08:53 +00:00
d6c4f3b923
i think this is the new torch 2.9+ API for declaring tf32 preference
Andrej Karpathy
2026-01-30 17:03:15 +00:00
067daa7758
small fix cpu script ty PR #474
Andrej Karpathy
2026-01-30 02:11:25 +00:00
6a341f2ecf
contiguous views and single HtoD transfer for inputs/targets much cleaner
Andrej Karpathy
2026-01-30 00:23:01 +00:00
ebd4d9bbf5
tried muonh, appealing but didn't work out of the box
Andrej Karpathy
2026-01-29 19:01:36 +00:00
41bb2eac32
Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help
Andrej Karpathy
2026-01-29 00:50:50 +00:00
64a651a63c
include .claude is ok
Andrej Karpathy
2026-01-29 00:35:02 +00:00
65df0de42b
add arxiv reading skill
Andrej Karpathy
2026-01-29 00:34:24 +00:00
74554be3b5
revert engram, not seeing an improvement at larger scale
Andrej Karpathy
2026-01-28 20:07:39 +00:00
d5418ea5a1
Fix link to DeepSeek Engram paper (#470)
Sofie Van Landeghem
2026-01-28 17:31:44 +01:00
c88bbf8133
Merge branch 'engram'
Andrej Karpathy
2026-01-27 22:33:16 +00:00
8630d32be4
quick fix to not OOM main speedrun script
Andrej Karpathy
2026-01-26 22:31:42 +00:00
59e36cc727
first version of engram following modded nanogpt style
Andrej Karpathy
2026-01-25 18:59:51 +00:00
85b3e95e09
320 experiments just to tune the adam beta1 of x0 a little bit up from 0.8 to 0.96
Andrej Karpathy
2026-01-25 00:03:55 +00:00
6a477eedbd
fix: pass device_type to compute_init in engine.__main__ (#451)
xiayan0118
2026-01-19 17:19:51 -08:00
63bb5831e2
something i've wanted to do for a while - move all .sh runs to their own directory so they don't pollute root dir
Andrej Karpathy
2026-01-18 15:27:41 +00:00
a91743c168
Merge branch 've'
Andrej Karpathy
2026-01-18 15:14:39 +00:00
d58fcd9d73
log for jan 17
Andrej Karpathy
2026-01-18 03:01:13 +00:00
babde18ce1
small tweaks
Andrej Karpathy
2026-01-18 03:00:38 +00:00
cf5c9e5b8e
resolve a crash for odd depths because FA3 needs head_dim % 8 == 0
Andrej Karpathy
2026-01-18 00:07:08 +00:00
413e91aa0f
optimal ratio is now around 4
Andrej Karpathy
2026-01-17 23:51:09 +00:00
e7ed2082b8
update the default GPTConfig kwargs otherwise they are confusing
Andrej Karpathy
2026-01-17 21:16:46 +00:00
f9a7e0f111
update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption
karpathy
2026-01-17 12:27:30 -08:00
f5425245f9
more GPU types from PR 147 thanks @Qubitium
Andrej Karpathy
2026-01-17 03:22:20 +00:00
2955650327
add detection of device to report more correct mfu for bf16
Andrej Karpathy
2026-01-17 03:16:12 +00:00
e1dafc510f
Reduce token waste in BOS bestfit by cropping shortest doc (#445)
Yamahammer
2026-01-16 21:50:34 -05:00
6460dc6382
tweaks to readme a bit
Andrej Karpathy
2026-01-17 02:28:31 +00:00
1933e85046
brief update to log
Andrej Karpathy
2026-01-17 00:25:50 +00:00
3b95d4fd39
allow label for scaling laws script
Andrej Karpathy
2026-01-17 00:23:30 +00:00
e85db6b4a4
alternating design
Andrej Karpathy
2026-01-16 23:52:12 +00:00
9a88194c3f
simply one VE per layer, works best
Andrej Karpathy
2026-01-16 22:08:52 +00:00
0b58d70e99
full ve version works very well
Andrej Karpathy
2026-01-16 21:16:47 +00:00
e3f58b838e
ranked version
Andrej Karpathy
2026-01-16 20:59:42 +00:00
184d4c12b1
also add to log about the FA3 changes
Andrej Karpathy
2026-01-16 18:25:04 +00:00
b62a5bc44a
naturally i failed to include the actual code in the previous commit facepalm
Andrej Karpathy
2026-01-16 17:39:41 +00:00
8203efa919
implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.
Andrej Karpathy
2026-01-16 17:37:51 +00:00
50413d2d67
typo in comments: change "GAPO" to "DAPO"
Haoyu Wang
2026-01-16 01:03:42 -05:00
fbf2bbea25
update log with a bunch of attempts
Andrej Karpathy
2026-01-16 02:21:17 +00:00
747ed4491f
add negative result on olmo3 pretraining mix
Andrej Karpathy
2026-01-16 00:43:54 +00:00
7d1700c521
add zstd lib
Andrej Karpathy
2026-01-16 00:40:59 +00:00
d4ea28d4e2
Fix args in readme (#438)
Sofie Van Landeghem
2026-01-16 01:26:38 +01:00
bdcc030ffa
oops legacy spurious line now
Andrej Karpathy
2026-01-15 23:32:20 +00:00
22a71aa3d3
fuse adamw into a single torch compiled kernel similar to muon. it's about 1.7X faster, but overall it's so tiny that it's not making a major dent
Andrej Karpathy
2026-01-15 23:30:44 +00:00
255f8b9af6
cleanly separate cpu and gpu sections
Andrej Karpathy
2026-01-15 23:30:11 +00:00
6bb92403d5
changes and optimizations to muon, making it more efficient and simpler/cleaner a bit
Andrej Karpathy
2026-01-15 03:20:48 +00:00
3142ca1a28
minor helpful message
Andrej Karpathy
2026-01-15 03:20:21 +00:00
7312ec9898
fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way
Andrej Karpathy
2026-01-13 22:45:27 +00:00
3b50b77ed3
fix base_loss to report correct loss by switching the dataloader to the new default
Andrej Karpathy
2026-01-13 22:09:36 +00:00
f92efce169
add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance
Andrej Karpathy
2026-01-13 21:33:54 +00:00
43c29dd9d5
Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
Andrej Karpathy
2026-01-13 20:05:47 +00:00
23985413aa
adjust the comment on the regex pattern per recent experimnet see dev/LOG.md
Andrej Karpathy
2026-01-13 17:50:39 +00:00
64b48d0e5c
validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs
Andrej Karpathy
2026-01-13 17:45:06 +00:00
238353c998
document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight.
Andrej Karpathy
2026-01-13 17:14:29 +00:00
4610a838a1
record negative result on MTP
Andrej Karpathy
2026-01-12 05:23:47 +00:00
21608ec51e
allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway
Andrej Karpathy
2026-01-12 03:10:13 +00:00
aa95fb2e03
make miniseries more generic and easier to run and less hard coded
Andrej Karpathy
2026-01-12 02:54:35 +00:00
b33e394528
oops actually make SSSL the default window pattern
Andrej Karpathy
2026-01-11 21:50:35 +00:00
fbc1484e8c
add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
Andrej Karpathy
2026-01-11 21:49:54 +00:00
2ff7d51252
integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
Andrej Karpathy
2026-01-11 20:33:19 +00:00
201d705957
recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
Andrej Karpathy
2026-01-11 20:13:12 +00:00
aa530cdad5
Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb
Andrej Karpathy
2026-01-11 18:47:35 +00:00
2c4473dd1b
Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.
Andrej Karpathy
2026-01-11 16:56:59 +00:00
f5a0ea4d3f
take out these gitignore dirs
Andrej Karpathy
2026-01-08 18:18:39 +00:00
4ddc803797
fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
Andrej Karpathy
2026-01-08 18:18:22 +00:00
a1ccb3dc0b
remove rust compilation as rustbpe is now installed from separate package (#416)
Sofie Van Landeghem
2026-01-08 15:18:37 +01:00
061f83c152
delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
Andrej Karpathy
2026-01-08 02:16:50 +00:00
e8c30c3b19
add notebook used for scaling laws analysis
Andrej Karpathy
2026-01-07 22:28:53 +00:00
3af4dcf6ee
also add scaling_laws.sh script if it's a useful reference
Andrej Karpathy
2026-01-07 22:25:13 +00:00
4cc605b940
quick pointer to miniseries post in readme for now
Andrej Karpathy
2026-01-07 22:14:21 +00:00
ccf4b7f9bf
nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script
Andrej Karpathy
2026-01-07 22:11:52 +00:00
1b5de29e71
Fix undefined variable in chat_rl after recent refactor
Adria Blancafort
2026-01-07 18:08:57 +01:00
ae0bf52529
tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3
Andrej Karpathy
2026-01-05 18:57:46 +00:00
eec0c79563
also add matplotlib dep so that we can have jupyter notebooks
Andrej Karpathy
2026-01-05 18:41:09 +00:00
54e59c38ad
add notebook on deriving the CORE estimates for the GPT-3 miniseries.
Andrej Karpathy
2026-01-05 18:40:28 +00:00
9d4c9b786d
many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works
Andrej Karpathy
2026-01-05 00:38:09 +00:00
962b6bfba3
alright add transformers as a dep of the repo because it should be easy to evaluate the CORE score of HF models. Not super happy about it but i tried it and the uv.lock doesn't get bloated as much as i expected
Andrej Karpathy
2026-01-04 20:37:28 +00:00
ed2082fbc4
sane secrets management
Andrej Karpathy
2026-01-04 19:29:22 +00:00