9cae824aa5
Merge pull request 'doc: prefer ModelScope for Whisper encoder weights (closes#4)' (#5) from mochi/issue-4 into main
fam2026-05-05 21:26:21 +00:00
8ef90bc154
add setuptools for CPU run
svlandeg
2026-04-13 10:50:57 +02:00
94b73ad29a
fix: initialize smear and backout lambdas in init_weights
Marcin Bogdanski
2026-04-03 20:39:55 +00:00
a445144d39
create a group for dev dependencies, there is no need to install all this other stuff just for speedrun and it's exposing people to dependency chain attacks. we need to delete more dependencies. dependencies bad bad bad
Andrej Karpathy
2026-03-26 03:40:55 +00:00
03be953668
delete non-essential deps from legacy use
Andrej Karpathy
2026-03-26 03:10:08 +00:00
7808dc7159
Merge pull request #595 from svlandeg/fix/typo
Andrej
2026-03-25 14:40:25 -07:00
a4ed96687b
Merge pull request #634 from 2bitbit/fix-docs-and-comments
Andrej
2026-03-25 14:31:49 -07:00
7b70f6b411
Merge pull request #639 from mathieu-lacage/master
Andrej
2026-03-25 14:29:30 -07:00
47e983eea7
fix: use meta device in disable_fp8 to avoid VRAM spike (#616)
RoomWithOutRoof
2026-03-26 05:24:57 +08:00
c0dbf1f3ff
use COMPUTE_DTYPE-aware cast in Muon polar express step
Andrej Karpathy
2026-03-25 20:19:14 +00:00
4e1694cc95
bunch of ideas tried from openai/parameter-golf, all negative results for nanochat
Andrej Karpathy
2026-03-24 22:13:13 +00:00
1cd94d768f
bump D:N ratio to 12 per recent scaling laws re-run
Andrej Karpathy
2026-03-24 19:25:50 +00:00
c16db281ff
fix small bug with params logging and batch size
Andrej Karpathy
2026-03-24 19:25:34 +00:00
dfe7d39ce8
Merge branch 'master' into fix/typo
svlandeg
2026-03-18 17:01:45 +01:00
5019accc5b
fix scaling laws scripts after the bigram embeddings were removed
Andrej Karpathy
2026-03-17 16:55:56 +00:00
f068604948
new leaderboard entry coming from improvements of autoresearch round 1, time to gpt-2 from 2.02 hours to 1.80 hours
Andrej Karpathy
2026-03-10 06:26:39 +00:00
6ed7d1d82c
All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately.
Andrej Karpathy
2026-03-09 20:45:17 +00:00
f8ff0439b9
two more small typos
svlandeg
2026-03-06 11:03:00 +01:00
1076f97059
delete autocast, an unnecessary thorn in my side, manage dtypes directly
Andrej Karpathy
2026-03-04 23:55:24 +00:00
752abc836e
Ensure that inputs and targets are contiguous (#569)
Sofie Van Landeghem
2026-03-04 22:58:27 +01:00
4b4077425b
Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously
Andrej Karpathy
2026-03-04 20:02:07 +00:00
324e69c45d
big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise
Andrej Karpathy
2026-03-04 19:47:12 +00:00
b07604ebaa
document the legacy fineweb100b dataset and the new climbmix400b dataset
Andrej Karpathy
2026-03-03 17:24:31 +00:00
aba30cb037
tune logit softcap?
Andrej Karpathy
2026-03-02 18:19:37 +00:00
83dccc20ae
Restore completion-only loss masking in SFT dataloader (#582)
Anish
2026-03-03 06:07:47 +05:30
f5fe7925ed
update dev log with recent
Andrej Karpathy
2026-02-17 15:44:54 +00:00
1415fb7617
tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft
Andrej Karpathy
2026-02-16 20:23:04 +00:00
77f8fb8303
a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps
Andrej Karpathy
2026-02-16 14:41:53 +00:00
0a23f87643
Fix bug in setting precision (#538)
George Shakan
2026-02-18 10:42:11 -05:00
4a6e47b0c6
update dev log with recent
Andrej Karpathy
2026-02-17 15:44:54 +00:00
8180e1d8c1
tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft
Andrej Karpathy
2026-02-16 20:23:04 +00:00
788dadeb88
a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps
Andrej Karpathy
2026-02-16 14:41:53 +00:00
124f49be98
Removed redundant qunatization of gradients
Alan
2026-02-15 15:41:33 +00:00
d9678ff0f9
Save FP8 tensors in autograd ctx instead of full-precision inputs
Alan
2026-02-15 14:31:54 +00:00
2f09686724
clarify that this is bf16 mfu we're talking about
Andrej Karpathy
2026-02-10 23:35:00 +00:00
e569b59f92
delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
Andrej Karpathy
2026-02-10 18:46:39 +00:00
1ec0a34779
at 28 and above we start to need batch size 8
Andrej Karpathy
2026-02-08 18:26:34 +00:00
ff46300720
tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
Andrej Karpathy
2026-02-08 17:54:12 +00:00
aeff095e97
better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
Andrej Karpathy
2026-02-06 19:22:28 +00:00
685271dc8d
new optimal ratio for d26 training
Andrej Karpathy
2026-02-06 19:21:27 +00:00
e527521a3f
briefly mention batch ramp experimentation too, too weak to merge in my few attempts
Andrej Karpathy
2026-02-05 22:21:03 +00:00
96522798f1
docs docs docs
Andrej Karpathy
2026-02-05 20:27:07 +00:00
5fdd5cdb24
new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier
Andrej Karpathy
2026-02-05 20:11:32 +00:00
2c062aaa94
nit: don't mutate args, create new var for total_batch_size
Andrej Karpathy
2026-02-05 19:59:46 +00:00
f41dd3cbd7
auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on
Andrej Karpathy
2026-02-05 19:40:37 +00:00
98eed6df18
bring back an assert guarding against bad param sizing
Andrej Karpathy
2026-02-05 18:14:09 +00:00
012da1a78b
Typo fixes (#480)
Sofie Van Landeghem
2026-02-05 19:12:50 +01:00
75b302f331
fix hash commit on leaderboard and a paragraph clarification
Andrej Karpathy
2026-02-05 16:14:28 +00:00
1144d186ed
try and fail relu^2 -> swiglu
Andrej Karpathy
2026-02-05 02:42:46 +00:00
d63b7ab9ac
try and fail relu^2 -> swiglu
Andrej Karpathy
2026-02-05 02:41:46 +00:00
718e5e9d67
correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt
Andrej Karpathy
2026-02-05 01:39:26 +00:00
542beb0c8c
bump speedrun to be the up to date leaderboard run
Andrej Karpathy
2026-02-04 02:12:04 +00:00
d510b1385b
quick experiments to log
Andrej Karpathy
2026-02-03 23:21:39 +00:00
16b8ac7da3
oops forgot to attach leaderboard file too
Andrej Karpathy
2026-02-03 21:06:12 +00:00
fe55b092b8
minor cosmetics for the table
Andrej Karpathy
2026-02-03 21:05:28 +00:00
a67eba35dc
add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2
Andrej Karpathy
2026-02-03 20:54:30 +00:00
6079f78fc3
add fp8 training with torchao
Andrej Karpathy
2026-02-03 20:51:26 +00:00
8ebc14b348
small touchups to the eval script, re-order items etc, cosmetic
Andrej Karpathy
2026-02-03 20:25:48 +00:00