nanochat-omni

fam/nanochat-omni

Fork 0

Commit Graph

Select branches

Hide Pull Requests

main

mochi/issue-4

pre-fork

#5

72b9064f9d remove leftover mid references (#491) Sofie Van Landeghem 2026-02-02 17:33:46 +01:00
b19b4f3e49 fix bug in speedrun script, batch size that doesn't OOM on 8XH100 for d24 is 16 Andrej Karpathy 2026-02-02 15:50:14 +00:00
230d6cf6c6 tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3 Andrej Karpathy 2026-02-02 01:45:59 +00:00
07c4dd4cd9 manually control the over-active garbage collector, save a small few minutes from a typical run Andrej Karpathy 2026-02-02 01:44:30 +00:00
e8fec97d4c slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector Andrej Karpathy 2026-02-02 01:17:30 +00:00
8b4849d548 fix bug in chat_sft, the attention window must be preserved sigh Andrej Karpathy 2026-02-01 20:58:44 +00:00
eaf49a33c8 fix path which i think was modified during the refactor and this is a bug introduced by claude i believe Andrej Karpathy 2026-02-01 20:15:19 +00:00
31b61d2d17 fix broken import sigh Andrej Karpathy 2026-02-01 05:03:44 +00:00
4d6415b8ef use _PEAK_FLOPS_TABLE instead of if-else structure (#479) Sofie Van Landeghem 2026-02-01 04:45:06 +01:00
43078c347e clean up original tokenizing_distributed_data_loader (#478) Sofie Van Landeghem 2026-02-01 04:44:12 +01:00
dc291c627f Add Blackwell (SM100) GPU support via SDPA fallback (#475) Franci Penov 2026-01-31 19:42:58 -08:00
0307997f9b merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both Andrej Karpathy 2026-02-01 02:36:43 +00:00
1ddaad1c1c nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1 Andrej Karpathy 2026-01-31 19:12:25 +00:00
348fbb301b fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining Andrej Karpathy 2026-01-31 18:21:36 +00:00
3c3a3d7042 warmdown of 0.5 is slightly better: Andrej Karpathy 2026-01-31 01:08:44 +00:00
4d8dbaf6e0 Fix escape character in README bibtex entry (#454) Andrei Panferov 2026-01-30 18:34:02 +01:00
3ba42e8135 Fix SDPA KV-cache decode to respect sliding window (#456) Andrej Karpathy 2026-01-30 17:32:12 +00:00
ace6740bdd feat: allow top_k=0 in web api to disable filtering (#458) Aarushi Singh 2026-01-30 22:51:41 +05:30
2e17723817 Fix generate() crash when top_k=0 (#467) Harsh Gupta 2026-01-30 22:51:02 +05:30
02baa15405 i am feeling in a delete mood today. i need to delete a lot of code. there is too much code and surface area and complexity. ew Andrej Karpathy 2026-01-30 17:08:53 +00:00
d6c4f3b923 i think this is the new torch 2.9+ API for declaring tf32 preference Andrej Karpathy 2026-01-30 17:03:15 +00:00
067daa7758 small fix cpu script ty PR #474 Andrej Karpathy 2026-01-30 02:11:25 +00:00
6a341f2ecf contiguous views and single HtoD transfer for inputs/targets much cleaner Andrej Karpathy 2026-01-30 00:23:01 +00:00
ebd4d9bbf5 tried muonh, appealing but didn't work out of the box Andrej Karpathy 2026-01-29 19:01:36 +00:00
41bb2eac32 Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help Andrej Karpathy 2026-01-29 00:50:50 +00:00
64a651a63c include .claude is ok Andrej Karpathy 2026-01-29 00:35:02 +00:00
65df0de42b add arxiv reading skill Andrej Karpathy 2026-01-29 00:34:24 +00:00
74554be3b5 revert engram, not seeing an improvement at larger scale Andrej Karpathy 2026-01-28 20:07:39 +00:00
d5418ea5a1 Fix link to DeepSeek Engram paper (#470) Sofie Van Landeghem 2026-01-28 17:31:44 +01:00
c88bbf8133 Merge branch 'engram' Andrej Karpathy 2026-01-27 22:33:16 +00:00
c8d93beed2 add engram-lite, add log, tune scaling laws analysis scripts Andrej Karpathy 2026-01-27 22:31:17 +00:00
8630d32be4 quick fix to not OOM main speedrun script Andrej Karpathy 2026-01-26 22:31:42 +00:00
59e36cc727 first version of engram following modded nanogpt style Andrej Karpathy 2026-01-25 18:59:51 +00:00
85b3e95e09 320 experiments just to tune the adam beta1 of x0 a little bit up from 0.8 to 0.96 Andrej Karpathy 2026-01-25 00:03:55 +00:00
6a477eedbd fix: pass device_type to compute_init in engine.__main__ (#451) xiayan0118 2026-01-19 17:19:51 -08:00
63bb5831e2 something i've wanted to do for a while - move all .sh runs to their own directory so they don't pollute root dir Andrej Karpathy 2026-01-18 15:27:41 +00:00
a91743c168 Merge branch 've' Andrej Karpathy 2026-01-18 15:14:39 +00:00
d58fcd9d73 log for jan 17 Andrej Karpathy 2026-01-18 03:01:13 +00:00
babde18ce1 small tweaks Andrej Karpathy 2026-01-18 03:00:38 +00:00
cf5c9e5b8e resolve a crash for odd depths because FA3 needs head_dim % 8 == 0 Andrej Karpathy 2026-01-18 00:07:08 +00:00
413e91aa0f optimal ratio is now around 4 Andrej Karpathy 2026-01-17 23:51:09 +00:00
e7ed2082b8 update the default GPTConfig kwargs otherwise they are confusing Andrej Karpathy 2026-01-17 21:16:46 +00:00
f9a7e0f111 update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption karpathy 2026-01-17 12:27:30 -08:00
f5425245f9 more GPU types from PR 147 thanks @Qubitium Andrej Karpathy 2026-01-17 03:22:20 +00:00
2955650327 add detection of device to report more correct mfu for bf16 Andrej Karpathy 2026-01-17 03:16:12 +00:00
77a46902e4 Fix WANDB_RUN parameter passing in runcpu.sh (#407) Yury Kirpichev 2026-01-16 18:59:44 -08:00
bbc4413c58 Add high value engine tests for core invariants (33 LoC) (#396) Barış Özmen 2026-01-17 05:59:12 +03:00
f42ae9e901 fix condition to perform bpb evaluation (#324) Nitish Pandey 2026-01-17 08:26:43 +05:30
e1dafc510f Reduce token waste in BOS bestfit by cropping shortest doc (#445) Yamahammer 2026-01-16 21:50:34 -05:00
6460dc6382 tweaks to readme a bit Andrej Karpathy 2026-01-17 02:28:31 +00:00
1933e85046 brief update to log Andrej Karpathy 2026-01-17 00:25:50 +00:00
3b95d4fd39 allow label for scaling laws script Andrej Karpathy 2026-01-17 00:23:30 +00:00
e85db6b4a4 alternating design Andrej Karpathy 2026-01-16 23:52:12 +00:00
9a88194c3f simply one VE per layer, works best Andrej Karpathy 2026-01-16 22:08:52 +00:00
0b58d70e99 full ve version works very well Andrej Karpathy 2026-01-16 21:16:47 +00:00
e3f58b838e ranked version Andrej Karpathy 2026-01-16 20:59:42 +00:00
184d4c12b1 also add to log about the FA3 changes Andrej Karpathy 2026-01-16 18:25:04 +00:00
b62a5bc44a naturally i failed to include the actual code in the previous commit facepalm Andrej Karpathy 2026-01-16 17:39:41 +00:00
8203efa919 implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user. Andrej Karpathy 2026-01-16 17:37:51 +00:00
50413d2d67 typo in comments: change "GAPO" to "DAPO" Haoyu Wang 2026-01-16 01:03:42 -05:00
fbf2bbea25 update log with a bunch of attempts Andrej Karpathy 2026-01-16 02:21:17 +00:00
747ed4491f add negative result on olmo3 pretraining mix Andrej Karpathy 2026-01-16 00:43:54 +00:00
7d1700c521 add zstd lib Andrej Karpathy 2026-01-16 00:40:59 +00:00
d4ea28d4e2 Fix args in readme (#438) Sofie Van Landeghem 2026-01-16 01:26:38 +01:00
bdcc030ffa oops legacy spurious line now Andrej Karpathy 2026-01-15 23:32:20 +00:00
22a71aa3d3 fuse adamw into a single torch compiled kernel similar to muon. it's about 1.7X faster, but overall it's so tiny that it's not making a major dent Andrej Karpathy 2026-01-15 23:30:44 +00:00
255f8b9af6 cleanly separate cpu and gpu sections Andrej Karpathy 2026-01-15 23:30:11 +00:00
6bb92403d5 changes and optimizations to muon, making it more efficient and simpler/cleaner a bit Andrej Karpathy 2026-01-15 03:20:48 +00:00
3142ca1a28 minor helpful message Andrej Karpathy 2026-01-15 03:20:21 +00:00
7312ec9898 fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way Andrej Karpathy 2026-01-13 22:45:27 +00:00
3b50b77ed3 fix base_loss to report correct loss by switching the dataloader to the new default Andrej Karpathy 2026-01-13 22:09:36 +00:00
f92efce169 add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance Andrej Karpathy 2026-01-13 21:33:54 +00:00
43c29dd9d5 Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training Andrej Karpathy 2026-01-13 20:05:47 +00:00
23985413aa adjust the comment on the regex pattern per recent experimnet see dev/LOG.md Andrej Karpathy 2026-01-13 17:50:39 +00:00
64b48d0e5c validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs Andrej Karpathy 2026-01-13 17:45:06 +00:00
238353c998 document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight. Andrej Karpathy 2026-01-13 17:14:29 +00:00
4610a838a1 record negative result on MTP Andrej Karpathy 2026-01-12 05:23:47 +00:00
21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway Andrej Karpathy 2026-01-12 03:10:13 +00:00
aa95fb2e03 make miniseries more generic and easier to run and less hard coded Andrej Karpathy 2026-01-12 02:54:35 +00:00
b33e394528 oops actually make SSSL the default window pattern Andrej Karpathy 2026-01-11 21:50:35 +00:00
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb Andrej Karpathy 2026-01-11 21:49:54 +00:00
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge Andrej Karpathy 2026-01-11 20:33:19 +00:00
201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints Andrej Karpathy 2026-01-11 20:13:12 +00:00
aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb Andrej Karpathy 2026-01-11 18:47:35 +00:00
2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes. Andrej Karpathy 2026-01-11 16:56:59 +00:00
f5a0ea4d3f take out these gitignore dirs Andrej Karpathy 2026-01-08 18:18:39 +00:00
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug Andrej Karpathy 2026-01-08 18:18:22 +00:00
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416) Sofie Van Landeghem 2026-01-08 15:18:37 +01:00
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then Andrej Karpathy 2026-01-08 02:16:50 +00:00
e8c30c3b19 add notebook used for scaling laws analysis Andrej Karpathy 2026-01-07 22:28:53 +00:00
3af4dcf6ee also add scaling_laws.sh script if it's a useful reference Andrej Karpathy 2026-01-07 22:25:13 +00:00
4cc605b940 quick pointer to miniseries post in readme for now Andrej Karpathy 2026-01-07 22:14:21 +00:00
ccf4b7f9bf nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script Andrej Karpathy 2026-01-07 22:11:52 +00:00
1b5de29e71 Fix undefined variable in chat_rl after recent refactor Adria Blancafort 2026-01-07 18:08:57 +01:00
ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3 Andrej Karpathy 2026-01-05 18:57:46 +00:00
eec0c79563 also add matplotlib dep so that we can have jupyter notebooks Andrej Karpathy 2026-01-05 18:41:09 +00:00
54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries. Andrej Karpathy 2026-01-05 18:40:28 +00:00
9d4c9b786d many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works Andrej Karpathy 2026-01-05 00:38:09 +00:00
962b6bfba3 alright add transformers as a dep of the repo because it should be easy to evaluate the CORE score of HF models. Not super happy about it but i tried it and the uv.lock doesn't get bloated as much as i expected Andrej Karpathy 2026-01-04 20:37:28 +00:00
ed2082fbc4 sane secrets management Andrej Karpathy 2026-01-04 19:29:22 +00:00

1 2 3 4

Commit Graph Select branches Hide Pull Requests main mochi/issue-4 pre-fork #5 Mono Color

Commit Graph

Select branches

Hide Pull Requests

main

mochi/issue-4

pre-fork

#5