Commit Graph

381 Commits

Author SHA1 Message Date
Andrej Karpathy dc54a1a307 tried and failed at DyT 2026-05-05 03:17:21 +00:00
Andrej 0aaca56805 Merge pull request #706 from svlandeg/fix/cpu
Add setuptools for CPU run
2026-04-14 11:33:14 -07:00
Andrej b9b6ce137b Merge pull request #686 from marcinbogdanski/fix/init-smear-backout-lambdas
Initialize smear and backout lambdas in init_weights()
2026-04-13 16:08:04 -07:00
Sofie Van Landeghem a3ca42a678 add comment 2026-04-13 14:17:23 +02:00
Sofie Van Landeghem 9822cc7424 use nn.init and initialize smear gate's weight as well 2026-04-13 14:03:18 +02:00
svlandeg 12839c11e3 update uv lock 2026-04-13 11:20:38 +02:00
svlandeg 8ef90bc154 add setuptools for CPU run 2026-04-13 10:50:57 +02:00
Marcin Bogdanski 94b73ad29a fix: initialize smear and backout lambdas in init_weights 2026-04-03 20:39:55 +00:00
Andrej Karpathy a445144d39 create a group for dev dependencies, there is no need to install all this other stuff just for speedrun and it's exposing people to dependency chain attacks. we need to delete more dependencies. dependencies bad bad bad 2026-03-26 03:41:28 +00:00
Andrej Karpathy 03be953668 delete non-essential deps from legacy use 2026-03-26 03:41:28 +00:00
Andrej 7808dc7159 Merge pull request #595 from svlandeg/fix/typo
Small fixes
2026-03-25 14:40:25 -07:00
Andrej a4ed96687b Merge pull request #634 from 2bitbit/fix-docs-and-comments
fix: correct minor typos in help text, README, and comments
2026-03-25 14:31:49 -07:00
Andrej 7b70f6b411 Merge pull request #639 from mathieu-lacage/master
Verified: both paths return identical data (99,842 rows), and all splits under subset='all' load cleanly.
2026-03-25 14:29:30 -07:00
RoomWithOutRoof 47e983eea7 fix: use meta device in disable_fp8 to avoid VRAM spike (#616)
When swapping Float8Linear to Linear in disable_fp8 context manager,
using device=fp8_module.weight.device directly allocates new tensors
on GPU, causing unnecessary VRAM spike (~1GB for large models).

This fix uses device='meta' to avoid physical memory allocation,
then swaps in the weight tensor reference. This eliminates the
unnecessary VRAM spike during evaluation phase.

Fixes issue #592

Co-authored-by: RoomWithOutRoof <roomwithoutroof@sparklab.ai>
2026-03-25 14:24:57 -07:00
Andrej Karpathy c0dbf1f3ff use COMPUTE_DTYPE-aware cast in Muon polar express step
The bf16 cast is intentional for speed on Hopper+ GPUs, but should be
skipped on other platforms rather than blindly applied. fp16 is unstable
here due to its limited exponent range, and fp32 platforms don't benefit
from the cast. Now: bf16 when COMPUTE_DTYPE is bf16, no cast otherwise.

Inspired by PR #667.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:19:14 +00:00
Andrej Karpathy 4e1694cc95 bunch of ideas tried from openai/parameter-golf, all negative results for nanochat 2026-03-24 22:13:13 +00:00
Andrej Karpathy 1cd94d768f bump D:N ratio to 12 per recent scaling laws re-run 2026-03-24 19:25:50 +00:00
Andrej Karpathy c16db281ff fix small bug with params logging and batch size 2026-03-24 19:25:34 +00:00
svlandeg dfe7d39ce8 Merge branch 'master' into fix/typo 2026-03-18 17:01:45 +01:00
Andrej Karpathy 5019accc5b fix scaling laws scripts after the bigram embeddings were removed 2026-03-17 16:55:56 +00:00
svlandeg 51f42a4406 ~1.5h :-) 2026-03-15 22:29:27 +01:00
svlandeg 1f9e42a855 two more typos, from PR 645 2026-03-15 22:27:18 +01:00
svlandeg bd6e9c8d5f fix numbering 2026-03-15 22:18:18 +01:00
svlandeg 02e865c2ab Merge branch 'master' into fix/typo 2026-03-15 22:18:01 +01:00
Andrej Karpathy 1b1cc3c599 submit new time to GPT-2 leaderboard entry: 99 minutes 2026-03-14 17:15:01 +00:00
Andrej Karpathy a825e63f81 Autoresearch round 2: smear, backout, and hyperparameter tuning
New architectural features:
- Smear: mix previous token embedding into current position via learned
  gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
  projection to remove low-level features

Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 17:03:06 +00:00
svlandeg 6405b26d24 Merge branch 'master' into fix/typo 2026-03-13 13:56:50 +01:00
svlandeg 1052d25d45 we only need to wait 2h now! 2026-03-13 13:46:16 +01:00
Mathieu Lacage a641b6ca96 MMLU main split is named auxiliary_train, not train 2026-03-13 13:19:10 +01:00
2bitbit 2bb93b2ae4 fix: correct minor typos in help text, README, and comments 2026-03-12 17:03:26 +08:00
svlandeg d96558bcb0 fix heading, cf #622 2026-03-10 09:57:30 +01:00
Andrej Karpathy f068604948 new leaderboard entry coming from improvements of autoresearch round 1, time to gpt-2 from 2.02 hours to 1.80 hours 2026-03-10 06:26:39 +00:00
Andrej Karpathy 6ed7d1d82c All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately.
Optimizer & schedule changes:
- Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28
- Per-group Adam betas and weight decay (instead of shared global betas)
- Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps
- Warmup: ratio-based -> absolute steps (default 40)
- Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05
- Weight decay schedule: linear -> cosine decay
- Polar express norm factor 1.02 -> 1.01

Architecture & init changes:
- VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive
- Add post-QK-norm scaling (q,k *= 1.15) for sharper attention
- Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller
- RoPE base theta 10K -> 100K
- Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile)
- Logit softcap 20 -> 15
2026-03-09 20:45:17 +00:00
svlandeg f8ff0439b9 two more small typos 2026-03-06 11:03:00 +01:00
Andrej Karpathy 1076f97059 delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
Sofie Van Landeghem 752abc836e Ensure that inputs and targets are contiguous (#569)
* call reshape instead of view in case the tensors are not contiguous

* fix directly in data loader instead
2026-03-04 13:58:27 -08:00
Andrej Karpathy 4b4077425b Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously 2026-03-04 20:02:07 +00:00
Andrej Karpathy 324e69c45d big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise 2026-03-04 19:47:12 +00:00
Andrej Karpathy b07604ebaa document the legacy fineweb100b dataset and the new climbmix400b dataset 2026-03-03 17:24:31 +00:00
Andrej Karpathy aba30cb037 tune logit softcap? 2026-03-03 00:38:53 +00:00
Anish 83dccc20ae Restore completion-only loss masking in SFT dataloader (#582)
* printing steps count

* adding reply only loss for chat

* using the mask by render_conversation function of tokeniser

* undoing some changes

* putting back the comment which got removed accidently, no functionality change
2026-03-02 16:37:47 -08:00
Dipesh Babu c7ba252142 docs: fix typos in experiment log (#547) 2026-02-20 08:03:45 -08:00
Andrej Karpathy 2dffdc8cf6 document MoE exploration 2026-02-19 02:53:47 +00:00
Andrej Karpathy 48804bff3a report negative result on fineweb dataset 2026-02-18 23:45:31 +00:00
Andrej Karpathy bb5137860e fix comment 2026-02-18 23:26:22 +00:00
Andrej Karpathy 458555117b Merge branch 'Chetter2-patch-1' 2026-02-18 23:17:39 +00:00
Andrej Karpathy bac5a35dd7 fix minor bug in fp8 application to skip tiny matmuls 2026-02-18 23:17:29 +00:00
George Shakan ad55575326 Fix bug in setting precision (#538) 2026-02-18 15:49:18 +00:00
Sofie Van Landeghem cac43e8511 Fix MockModel's device definition (#535)
* fix MockModel's device definition

* cleanup
2026-02-18 15:49:18 +00:00
Andrej Karpathy f5fe7925ed update dev log with recent 2026-02-18 15:49:18 +00:00