nanochat-omni

Author	SHA1	Message	Date
Fam Zheng	d760915daa	omni: gpt.forward — optional audio_features for soft-token prepend W1 needs the GPT to consume Whisper-projector outputs as a prefix of "soft tokens" sitting in front of the text token embeddings (LLaVA-style). The change is intentionally minimal: - forward() takes a new keyword-only arg `audio_features` of shape (B, T_a, n_embd). They must already be projected to n_embd by the caller (Projector lives in nanochat.audio, kept out of GPT itself). - The audio rows are normed (matches the post-wte norm convention) and concatenated after smear so smear stays a strictly text-side op (its prev-token semantics aren't defined for soft tokens, and revisiting that belongs to a later phase). - Rotary embeddings are re-sliced for T_a + T_text. Audio gets positions 0..T_a-1, text 0-shifts to T_a..T_full-1. The 10× over-allocated rotary cache in __init__ already covers this. - value_embeds lookup uses an idx padded with 0 for audio positions. They feed the v residual but the gate (`ve_gate`) is input-dependent and will learn to suppress the dummy rows; for W1 smoke this is fine. - targets are auto-padded with -1 (ignore_index) over audio positions so the LM is only graded on text predictions. Not yet supported: audio_features with kv_cache. The KV-cache path is a prefill+decode protocol that would need its own audio-aware semantics; W1 runs train-style forwards only, so we just assert.	2026-05-05 22:38:49 +01:00
Fam Zheng	7939990181	patch: CN mirrors for pytorch-wheels and HF datasets - pyproject.toml + uv.lock: pytorch-cu128/cpu indexes → mirror.sjtu.edu.cn (aliyun lacks 2.9.1, sjtu has it) - nanochat/dataset.py: climbmix BASE_URL → hf-mirror.com For ailab (CN, RTX 5090) where direct pytorch.org and huggingface.co are unreachable. Override at uv-sync time with UV_DEFAULT_INDEX env.	2026-05-05 22:21:21 +01:00
Sofie Van Landeghem	a3ca42a678	add comment	2026-04-13 14:17:23 +02:00
Sofie Van Landeghem	9822cc7424	use nn.init and initialize smear gate's weight as well	2026-04-13 14:03:18 +02:00
Marcin Bogdanski	94b73ad29a	fix: initialize smear and backout lambdas in init_weights	2026-04-03 20:39:55 +00:00
Andrej Karpathy	c0dbf1f3ff	use COMPUTE_DTYPE-aware cast in Muon polar express step The bf16 cast is intentional for speed on Hopper+ GPUs, but should be skipped on other platforms rather than blindly applied. fp16 is unstable here due to its limited exponent range, and fp32 platforms don't benefit from the cast. Now: bf16 when COMPUTE_DTYPE is bf16, no cast otherwise. Inspired by PR #667. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 20:19:14 +00:00
Andrej Karpathy	a825e63f81	Autoresearch round 2: smear, backout, and hyperparameter tuning New architectural features: - Smear: mix previous token embedding into current position via learned gate, providing cheap bigram-like info (works in training + KV cache) - Backout: subtract learned fraction of mid-layer residual before logit projection to remove low-level features Hyperparameter tuning: - Muon momentum warmdown 0.97→0.90 during LR warmdown phase - Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05 - c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4 - Speedrun data:params ratio reduced to 8 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-14 17:03:06 +00:00
Andrej Karpathy	6ed7d1d82c	All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately. Optimizer & schedule changes: - Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28 - Per-group Adam betas and weight decay (instead of shared global betas) - Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps - Warmup: ratio-based -> absolute steps (default 40) - Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05 - Weight decay schedule: linear -> cosine decay - Polar express norm factor 1.02 -> 1.01 Architecture & init changes: - VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive - Add post-QK-norm scaling (q,k *= 1.15) for sharper attention - Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller - RoPE base theta 10K -> 100K - Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile) - Logit softcap 20 -> 15	2026-03-09 20:45:17 +00:00
Andrej Karpathy	1076f97059	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
Andrej Karpathy	324e69c45d	big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise	2026-03-04 19:47:12 +00:00
Andrej Karpathy	aba30cb037	tune logit softcap?	2026-03-03 00:38:53 +00:00
George Shakan	ad55575326	Fix bug in setting precision (#538 )	2026-02-18 15:49:18 +00:00
Andrej Karpathy	1415fb7617	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-18 15:49:18 +00:00
Andrej Karpathy	77f8fb8303	a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps	2026-02-18 15:49:18 +00:00
Alan	124f49be98	Removed redundant qunatization of gradients	2026-02-15 15:41:33 +00:00
Alan	d9678ff0f9	Save FP8 tensors in autograd ctx instead of full-precision inputs Store quantized input/weight and their inverse scales in _Float8Matmul ctx to avoid re-quantization in backward and reduce saved-activation memory without changing numerics.	2026-02-15 14:31:54 +00:00
Andrej Karpathy	e569b59f92	delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm	2026-02-10 18:46:39 +00:00
Andrej Karpathy	98eed6df18	bring back an assert guarding against bad param sizing	2026-02-05 18:14:30 +00:00
Andrej Karpathy	718e5e9d67	correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt	2026-02-05 01:39:26 +00:00
Sofie Van Landeghem	72b9064f9d	remove leftover mid references (#491 )	2026-02-02 08:33:46 -08:00
Andrej Karpathy	e8fec97d4c	slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector	2026-02-02 01:17:30 +00:00
Sofie Van Landeghem	4d6415b8ef	use _PEAK_FLOPS_TABLE instead of if-else structure (#479 )	2026-01-31 19:45:06 -08:00
Sofie Van Landeghem	43078c347e	clean up original tokenizing_distributed_data_loader (#478 )	2026-01-31 19:44:12 -08:00
Franci Penov	dc291c627f	Add Blackwell (SM100) GPU support via SDPA fallback (#475 )	2026-01-31 19:42:58 -08:00
Andrej Karpathy	3ba42e8135	Fix SDPA KV-cache decode to respect sliding window (#456 ) SDPA fallback now respects sliding window during single-token KV-cache decode by slicing K/V to the last (window + 1) tokens. Also simplifies the mask building for chunk inference to properly apply sliding window in that path as well. Fixes #452 Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 17:32:12 +00:00
Harsh Gupta	2e17723817	Fix generate() crash when top_k=0 (#467 ) Prevent a crash in generate() by skipping top-k filtering when top_k is set to 0	2026-01-30 09:21:02 -08:00
Andrej Karpathy	d6c4f3b923	i think this is the new torch 2.9+ API for declaring tf32 preference	2026-01-30 17:03:15 +00:00
Andrej Karpathy	6a341f2ecf	contiguous views and single HtoD transfer for inputs/targets much cleaner	2026-01-30 00:23:01 +00:00
Andrej Karpathy	41bb2eac32	Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help	2026-01-29 00:52:08 +00:00
Andrej Karpathy	74554be3b5	revert engram, not seeing an improvement at larger scale	2026-01-28 20:07:39 +00:00
Andrej Karpathy	c8d93beed2	add engram-lite, add log, tune scaling laws analysis scripts	2026-01-27 22:31:17 +00:00
Andrej Karpathy	59e36cc727	first version of engram following modded nanogpt style	2026-01-25 18:59:51 +00:00
Andrej Karpathy	85b3e95e09	320 experiments just to tune the adam beta1 of x0 a little bit up from 0.8 to 0.96	2026-01-25 00:04:02 +00:00
xiayan0118	6a477eedbd	fix: pass device_type to compute_init in engine.__main__ (#451 ) When running engine.py directly on non-GPU devices (CPU, MPS), compute_init() needs the device_type parameter to initialize correctly. This fixes failures on machines without CUDA support.	2026-01-19 17:19:51 -08:00
Andrej Karpathy	a91743c168	Merge branch 've'	2026-01-18 15:14:39 +00:00
Andrej Karpathy	e7ed2082b8	update the default GPTConfig kwargs otherwise they are confusing	2026-01-17 21:16:46 +00:00
karpathy	f9a7e0f111	update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption	2026-01-17 12:27:30 -08:00
Andrej Karpathy	f5425245f9	more GPU types from PR 147 thanks @Qubitium	2026-01-17 03:22:20 +00:00
Andrej Karpathy	2955650327	add detection of device to report more correct mfu for bf16	2026-01-17 03:16:14 +00:00
Yamahammer	e1dafc510f	Reduce token waste in BOS bestfit by cropping shortest doc (#445 ) When no document fits the remaining row space, crop the shortest document in the buffer instead of the first. This minimizes discarded tokens. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-16 18:50:34 -08:00
Andrej Karpathy	e85db6b4a4	alternating design	2026-01-16 23:52:12 +00:00
Andrej Karpathy	9a88194c3f	simply one VE per layer, works best	2026-01-16 22:08:52 +00:00
Andrej Karpathy	0b58d70e99	full ve version works very well	2026-01-16 21:16:47 +00:00
Andrej Karpathy	e3f58b838e	ranked version	2026-01-16 20:59:42 +00:00
Andrej Karpathy	b62a5bc44a	naturally i failed to include the actual code in the previous commit facepalm	2026-01-16 17:39:41 +00:00
Andrej Karpathy	8203efa919	implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.	2026-01-16 17:37:51 +00:00
Andrej Karpathy	22a71aa3d3	fuse adamw into a single torch compiled kernel similar to muon. it's about 1.7X faster, but overall it's so tiny that it's not making a major dent	2026-01-15 23:30:44 +00:00
Andrej Karpathy	6bb92403d5	changes and optimizations to muon, making it more efficient and simpler/cleaner a bit	2026-01-15 03:20:48 +00:00
Andrej Karpathy	3142ca1a28	minor helpful message	2026-01-15 03:20:21 +00:00
Andrej Karpathy	3b50b77ed3	fix base_loss to report correct loss by switching the dataloader to the new default	2026-01-13 22:09:36 +00:00

1 2 3

145 Commits