omni: W1 audio align smoke — synthetic dataset + 50-step script

End-to-end smoke proving the audio path:
  wav -> WhisperEncoder (frozen) -> Projector -> prepend to text embeddings
      -> tiny d6 GPT (random init) -> CE loss on text only

Pass criterion is a plain "loss drops by at least 0.5". On a 4090 the run
finishes in ~1 s and goes 5.55 -> 0.17 over 50 steps, so the threshold has
plenty of headroom against false positives.

Two design calls worth keeping in mind:

1. Synthetic sine clips, not LibriSpeech. W1 is forward-path proof, not
   alignment quality, and a deterministic offline dataset means no network
   on the smoke path. data/audio_smoke/manifest.jsonl is the only thing
   committed; wavs are regenerated by audio_smoke_data.py and gitignored.
   W2 swaps in real LibriSpeech.

2. Standalone byte-level tokenizer (UTF-8 bytes + a single BOS, vocab=257).
   Avoids depending on a trained nanochat BPE — the d6 GPT is random
   anyway, so vocab choice doesn't matter for "does the gradient flow"
   smoke. W2 onwards uses the real BPE on a real base.

Caveat documented in doc/todo.md: because the LM is also random and being
trained, the loss-down here mostly reflects the LM memorising 5 short
strings, not Whisper-Projector alignment. That's fine for proving
plumbing; W2 freezes the LM so projector-only gradient is the only path
to lower loss.
This commit is contained in:
Fam Zheng
2026-05-05 22:39:20 +01:00
parent 7cc94cf584
commit 3c1cc3302f
5 changed files with 274 additions and 4 deletions
+9 -4
View File
@@ -19,12 +19,17 @@
参考 research §1.2 模块图。
- [ ] `nanochat/audio.py`WhisperEncoder wrapper(冻结,权重优先走 ModelScope,例如 `iic/Whisper-large-v3` / `iic/Whisper-small`HF mirror 留作 fallback+ ProjectorMLP,输出维度对齐 nanochat `model_dim`
- [ ] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数,作为 soft tokens prepend 到 text embedding 前面
- [ ] mini dataset110 段 5s wav + 字幕,落 `data/audio_smoke/`git 内不存音频,仅清单 + 下载脚本
- [ ] `scripts/audio_align_smoke.py`50 步、d6 nanochat base、loss 下降即过
- [x] `nanochat/audio.py`WhisperEncoder wrapper(冻结,ModelScope 优先经 `WHISPER_MS_ID`HF fallback 默认 `openai/whisper-base`+ ProjectorMLP,输出维度对齐 nanochat `n_embd`
- [x] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数,作为 soft tokens prepend 到 text embedding 前面kv_cache 路径暂不支持,audio 位置 targets 自动 -1 mask
- [x] mini dataset5 段 5s 合成正弦 + 字幕,落 `data/audio_smoke/`wav 由 `scripts/audio_smoke_data.py` 生成,gitignore 排除
- [x] `scripts/audio_align_smoke.py`50 步、d6 随机初始化 GPT、字节级 tokenizer、loss 下降即过(4090 实测 ~1s5.55→0.17
- [ ] CI 加 audio smoke jobailab runner 装 ffmpegwhisper 走 transformers 即可)
W1 后续可改进(暂搁,留给 W3+/W5+ 质感任务):
- 当前用 `last_hidden_state`(最偏文本语义的层);为质感感知应切到中间层 / 多层 weighted sum / w2v-bert
- d6 GPT 是随机初始化,alignment 信号其实在练 LM 而非 projectorW2 上真 base 后 freeze LM、只练 projector 才是真正的弱对齐
## W2 — S1 弱对齐训练
- [ ] 拉 LibriSpeech 100hHF mirror),预提 Whisper-base encoder 特征落盘 webdataset