nanochat-omni

Author	SHA1	Message	Date
Fam Zheng	3c1cc3302f	omni: W1 audio align smoke — synthetic dataset + 50-step script End-to-end smoke proving the audio path: wav -> WhisperEncoder (frozen) -> Projector -> prepend to text embeddings -> tiny d6 GPT (random init) -> CE loss on text only Pass criterion is a plain "loss drops by at least 0.5". On a 4090 the run finishes in ~1 s and goes 5.55 -> 0.17 over 50 steps, so the threshold has plenty of headroom against false positives. Two design calls worth keeping in mind: 1. Synthetic sine clips, not LibriSpeech. W1 is forward-path proof, not alignment quality, and a deterministic offline dataset means no network on the smoke path. data/audio_smoke/manifest.jsonl is the only thing committed; wavs are regenerated by audio_smoke_data.py and gitignored. W2 swaps in real LibriSpeech. 2. Standalone byte-level tokenizer (UTF-8 bytes + a single BOS, vocab=257). Avoids depending on a trained nanochat BPE — the d6 GPT is random anyway, so vocab choice doesn't matter for "does the gradient flow" smoke. W2 onwards uses the real BPE on a real base. Caveat documented in doc/todo.md: because the LM is also random and being trained, the loss-down here mostly reflects the LM memorising 5 short strings, not Whisper-Projector alignment. That's fine for proving plumbing; W2 freezes the LM so projector-only gradient is the only path to lower loss.	2026-05-05 22:39:20 +01:00
Fam Zheng	7cc94cf584	omni: nanochat/audio.py — frozen Whisper encoder + Projector The audio modality module that pairs with the gpt.forward audio_features hook. Two things live here: WhisperEncoder: thin wrapper around transformers' WhisperModel.encoder. - Weight loading prefers ModelScope when WHISPER_MS_ID is set (matches the CN-mirror policy in doc/todo.md — modelscope is first-class for model weights, hf-mirror is fallback). Otherwise falls back to plain HF, with WHISPER_HF_ID as the override and `openai/whisper-base` as the default (the smallest variant that still produces useful features for smoke). - Encoder params have requires_grad=False from __init__ so they never appear in the optimizer's param list. Caller does not need to remember to freeze it. - preprocess() runs the feature extractor; forward() takes (B, n_mels, T_mel) and returns last_hidden_state (B, T_enc, d_model). Whisper pads every clip to 30 s, so T_enc is a constant 1500 regardless of input duration — handy for batching, wasteful for short clips. We accept the waste at W1; W2 can switch to streaming-style chunking. - Note for W3+/W5+: last_hidden_state is the most text-semantic layer. When we start caring about timbre / prosody / emotion ("质感感知"), we should expose middle layers or a learnable weighted sum across layers. Projector: 2-layer MLP (in_dim → out_dim → out_dim) with GELU and the nanochat Linear class so master weights stay fp32 while forward runs in the activation dtype (bf16). fc2 is zero-initialized so the model starts ignoring audio entirely, which gives a clean baseline before any training signal flows through (audio path is opt-out by default, opt-in by training).	2026-05-05 22:39:05 +01:00
Fam Zheng	d760915daa	omni: gpt.forward — optional audio_features for soft-token prepend W1 needs the GPT to consume Whisper-projector outputs as a prefix of "soft tokens" sitting in front of the text token embeddings (LLaVA-style). The change is intentionally minimal: - forward() takes a new keyword-only arg `audio_features` of shape (B, T_a, n_embd). They must already be projected to n_embd by the caller (Projector lives in nanochat.audio, kept out of GPT itself). - The audio rows are normed (matches the post-wte norm convention) and concatenated after smear so smear stays a strictly text-side op (its prev-token semantics aren't defined for soft tokens, and revisiting that belongs to a later phase). - Rotary embeddings are re-sliced for T_a + T_text. Audio gets positions 0..T_a-1, text 0-shifts to T_a..T_full-1. The 10× over-allocated rotary cache in __init__ already covers this. - value_embeds lookup uses an idx padded with 0 for audio positions. They feed the v residual but the gate (`ve_gate`) is input-dependent and will learn to suppress the dummy rows; for W1 smoke this is fine. - targets are auto-padded with -1 (ignore_index) over audio positions so the LM is only graded on text predictions. Not yet supported: audio_features with kv_cache. The KV-cache path is a prefill+decode protocol that would need its own audio-aware semantics; W1 runs train-style forwards only, so we just assert.	2026-05-05 22:38:49 +01:00
fam	9cae824aa5	Merge pull request 'doc: prefer ModelScope for Whisper encoder weights (closes #4 )' (#5 ) from mochi/issue-4 into main smoke / nanochat-smoke (push) Successful in 32s Details Reviewed-on: https://famzheng.me/gitea/fam/nanochat-omni/pulls/5	2026-05-05 21:26:21 +00:00
mochi	62642b805b	doc: prefer ModelScope for Whisper encoder weights (closes #4 ) smoke / nanochat-smoke (push) Successful in 33s Details W1 todo 里 audio.py 的 WhisperEncoder 之前写的是从 HF mirror 拉权重，国内拉 HF（哪怕走 hf-mirror）经常被卡。改成首选 ModelScope（例如 iic/Whisper-large-v3 / iic/Whisper-small），HF mirror 留作 fallback。 infra 决定那条也顺手把 mirror 列表对齐到 pip / 模型权重 / HF 数据集三条线，写清楚 modelscope 是模型权重首选。	2026-05-05 22:25:38 +01:00
Fam Zheng	b585e07dc2	omni: CI smoke + docs + README preamble smoke / nanochat-smoke (push) Successful in 2m30s Details - .gitea/workflows/smoke.yml: gitea CI on ailab gpu runner (manual git clone since actions/checkout@v4 mis-resolves subpath gitea); injects WANDB_API_KEY + CI_RUN_TAG=smoke-$run_number - scripts/smoke.sh: in-place smoke (uv sync + 1 shard + tokenizer + d=6 50-step base_train); idempotent cache at /data/nanochat-smoke/ - doc/research_feasibility.md: voice-first multimodal feasibility study (mochi) - doc/todo.md: phase-by-phase roadmap (W1 Whisper smoke → W4 MVP) - README.md: omni preamble pointing at upstream nanochat README - .gitignore: exclude .claude/ runtime files	2026-05-05 22:21:31 +01:00
Fam Zheng	7939990181	patch: CN mirrors for pytorch-wheels and HF datasets - pyproject.toml + uv.lock: pytorch-cu128/cpu indexes → mirror.sjtu.edu.cn (aliyun lacks 2.9.1, sjtu has it) - nanochat/dataset.py: climbmix BASE_URL → hf-mirror.com For ailab (CN, RTX 5090) where direct pytorch.org and huggingface.co are unreachable. Override at uv-sync time with UV_DEFAULT_INDEX env.	2026-05-05 22:21:21 +01:00
Andrej Karpathy	dc54a1a307	tried and failed at DyT	2026-05-05 03:17:21 +00:00
Andrej	0aaca56805	Merge pull request #706 from svlandeg/fix/cpu Add setuptools for CPU run	2026-04-14 11:33:14 -07:00
Andrej	b9b6ce137b	Merge pull request #686 from marcinbogdanski/fix/init-smear-backout-lambdas Initialize smear and backout lambdas in init_weights()	2026-04-13 16:08:04 -07:00
Sofie Van Landeghem	a3ca42a678	add comment	2026-04-13 14:17:23 +02:00
Sofie Van Landeghem	9822cc7424	use nn.init and initialize smear gate's weight as well	2026-04-13 14:03:18 +02:00
svlandeg	12839c11e3	update uv lock	2026-04-13 11:20:38 +02:00
svlandeg	8ef90bc154	add setuptools for CPU run	2026-04-13 10:50:57 +02:00
Marcin Bogdanski	94b73ad29a	fix: initialize smear and backout lambdas in init_weights	2026-04-03 20:39:55 +00:00
Andrej Karpathy	a445144d39	create a group for dev dependencies, there is no need to install all this other stuff just for speedrun and it's exposing people to dependency chain attacks. we need to delete more dependencies. dependencies bad bad bad	2026-03-26 03:41:28 +00:00
Andrej Karpathy	03be953668	delete non-essential deps from legacy use	2026-03-26 03:41:28 +00:00
Andrej	7808dc7159	Merge pull request #595 from svlandeg/fix/typo Small fixes	2026-03-25 14:40:25 -07:00
Andrej	a4ed96687b	Merge pull request #634 from 2bitbit/fix-docs-and-comments fix: correct minor typos in help text, README, and comments	2026-03-25 14:31:49 -07:00
Andrej	7b70f6b411	Merge pull request #639 from mathieu-lacage/master Verified: both paths return identical data (99,842 rows), and all splits under subset='all' load cleanly.	2026-03-25 14:29:30 -07:00
RoomWithOutRoof	47e983eea7	fix: use meta device in disable_fp8 to avoid VRAM spike (#616 ) When swapping Float8Linear to Linear in disable_fp8 context manager, using device=fp8_module.weight.device directly allocates new tensors on GPU, causing unnecessary VRAM spike (~1GB for large models). This fix uses device='meta' to avoid physical memory allocation, then swaps in the weight tensor reference. This eliminates the unnecessary VRAM spike during evaluation phase. Fixes issue #592 Co-authored-by: RoomWithOutRoof <roomwithoutroof@sparklab.ai>	2026-03-25 14:24:57 -07:00
Andrej Karpathy	c0dbf1f3ff	use COMPUTE_DTYPE-aware cast in Muon polar express step The bf16 cast is intentional for speed on Hopper+ GPUs, but should be skipped on other platforms rather than blindly applied. fp16 is unstable here due to its limited exponent range, and fp32 platforms don't benefit from the cast. Now: bf16 when COMPUTE_DTYPE is bf16, no cast otherwise. Inspired by PR #667. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 20:19:14 +00:00
Andrej Karpathy	4e1694cc95	bunch of ideas tried from openai/parameter-golf, all negative results for nanochat	2026-03-24 22:13:13 +00:00
Andrej Karpathy	1cd94d768f	bump D:N ratio to 12 per recent scaling laws re-run	2026-03-24 19:25:50 +00:00
Andrej Karpathy	c16db281ff	fix small bug with params logging and batch size	2026-03-24 19:25:34 +00:00
svlandeg	dfe7d39ce8	Merge branch 'master' into fix/typo	2026-03-18 17:01:45 +01:00
Andrej Karpathy	5019accc5b	fix scaling laws scripts after the bigram embeddings were removed	2026-03-17 16:55:56 +00:00
svlandeg	51f42a4406	~1.5h :-)	2026-03-15 22:29:27 +01:00
svlandeg	1f9e42a855	two more typos, from PR 645	2026-03-15 22:27:18 +01:00
svlandeg	bd6e9c8d5f	fix numbering	2026-03-15 22:18:18 +01:00
svlandeg	02e865c2ab	Merge branch 'master' into fix/typo	2026-03-15 22:18:01 +01:00
Andrej Karpathy	1b1cc3c599	submit new time to GPT-2 leaderboard entry: 99 minutes	2026-03-14 17:15:01 +00:00
Andrej Karpathy	a825e63f81	Autoresearch round 2: smear, backout, and hyperparameter tuning New architectural features: - Smear: mix previous token embedding into current position via learned gate, providing cheap bigram-like info (works in training + KV cache) - Backout: subtract learned fraction of mid-layer residual before logit projection to remove low-level features Hyperparameter tuning: - Muon momentum warmdown 0.97→0.90 during LR warmdown phase - Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05 - c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4 - Speedrun data:params ratio reduced to 8 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-14 17:03:06 +00:00
svlandeg	6405b26d24	Merge branch 'master' into fix/typo	2026-03-13 13:56:50 +01:00
svlandeg	1052d25d45	we only need to wait 2h now!	2026-03-13 13:46:16 +01:00
Mathieu Lacage	a641b6ca96	MMLU main split is named auxiliary_train, not train	2026-03-13 13:19:10 +01:00
2bitbit	2bb93b2ae4	fix: correct minor typos in help text, README, and comments	2026-03-12 17:03:26 +08:00
svlandeg	d96558bcb0	fix heading, cf #622	2026-03-10 09:57:30 +01:00
Andrej Karpathy	f068604948	new leaderboard entry coming from improvements of autoresearch round 1, time to gpt-2 from 2.02 hours to 1.80 hours	2026-03-10 06:26:39 +00:00
Andrej Karpathy	6ed7d1d82c	All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately. Optimizer & schedule changes: - Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28 - Per-group Adam betas and weight decay (instead of shared global betas) - Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps - Warmup: ratio-based -> absolute steps (default 40) - Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05 - Weight decay schedule: linear -> cosine decay - Polar express norm factor 1.02 -> 1.01 Architecture & init changes: - VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive - Add post-QK-norm scaling (q,k *= 1.15) for sharper attention - Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller - RoPE base theta 10K -> 100K - Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile) - Logit softcap 20 -> 15	2026-03-09 20:45:17 +00:00
svlandeg	f8ff0439b9	two more small typos	2026-03-06 11:03:00 +01:00
Andrej Karpathy	1076f97059	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
Sofie Van Landeghem	752abc836e	Ensure that inputs and targets are contiguous (#569 ) * call reshape instead of view in case the tensors are not contiguous * fix directly in data loader instead	2026-03-04 13:58:27 -08:00
Andrej Karpathy	4b4077425b	Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously	2026-03-04 20:02:07 +00:00
Andrej Karpathy	324e69c45d	big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise	2026-03-04 19:47:12 +00:00
Andrej Karpathy	b07604ebaa	document the legacy fineweb100b dataset and the new climbmix400b dataset	2026-03-03 17:24:31 +00:00
Andrej Karpathy	aba30cb037	tune logit softcap?	2026-03-03 00:38:53 +00:00
Anish	83dccc20ae	Restore completion-only loss masking in SFT dataloader (#582 ) * printing steps count * adding reply only loss for chat * using the mask by render_conversation function of tokeniser * undoing some changes * putting back the comment which got removed accidently, no functionality change	2026-03-02 16:37:47 -08:00
Dipesh Babu	c7ba252142	docs: fix typos in experiment log (#547 )	2026-02-20 08:03:45 -08:00
Andrej Karpathy	2dffdc8cf6	document MoE exploration	2026-02-19 02:53:47 +00:00

1 2 3 4 5 ...

388 Commits