nanochat-omni/doc/todo.md at 3c1cc3302f90c49271488ada29b9137c4946f53f

Fam Zheng 3c1cc3302f omni: W1 audio align smoke — synthetic dataset + 50-step script

End-to-end smoke proving the audio path:
  wav -> WhisperEncoder (frozen) -> Projector -> prepend to text embeddings
      -> tiny d6 GPT (random init) -> CE loss on text only

Pass criterion is a plain "loss drops by at least 0.5". On a 4090 the run
finishes in ~1 s and goes 5.55 -> 0.17 over 50 steps, so the threshold has
plenty of headroom against false positives.

Two design calls worth keeping in mind:

1. Synthetic sine clips, not LibriSpeech. W1 is forward-path proof, not
   alignment quality, and a deterministic offline dataset means no network
   on the smoke path. data/audio_smoke/manifest.jsonl is the only thing
   committed; wavs are regenerated by audio_smoke_data.py and gitignored.
   W2 swaps in real LibriSpeech.

2. Standalone byte-level tokenizer (UTF-8 bytes + a single BOS, vocab=257).
   Avoids depending on a trained nanochat BPE — the d6 GPT is random
   anyway, so vocab choice doesn't matter for "does the gradient flow"
   smoke. W2 onwards uses the real BPE on a real base.

Caveat documented in doc/todo.md: because the LM is also random and being
trained, the loss-down here mostly reflects the LM memorising 5 short
strings, not Whisper-Projector alignment. That's fine for proving
plumbing; W2 freezes the LM so projector-only gradient is the only path
to lower loss.

3.7 KiB

Raw Blame History

nanochat-omni TODO

近期 — 仓库结构 / 工程基建

W1 — Whisper encoder + Projector forward smoke

W2 — S1 弱对齐训练

W3 — S2 指令 + LoRA

W4 — MVP demo

W5+ — 扩规模 / 质感数据 / vision

决定事项

暂搁 / 待定

3.7 KiB Raw Blame History Unescape Escape

nanochat-omni TODO

近期 — 仓库结构 / 工程基建

W1 — Whisper encoder + Projector forward smoke

W2 — S1 弱对齐训练

W3 — S2 指令 + LoRA

W4 — MVP demo

W5+ — 扩规模 / 质感数据 / vision

决定事项

暂搁 / 待定

3.7 KiB

Raw Blame History