3c1cc3302f
End-to-end smoke proving the audio path:
wav -> WhisperEncoder (frozen) -> Projector -> prepend to text embeddings
-> tiny d6 GPT (random init) -> CE loss on text only
Pass criterion is a plain "loss drops by at least 0.5". On a 4090 the run
finishes in ~1 s and goes 5.55 -> 0.17 over 50 steps, so the threshold has
plenty of headroom against false positives.
Two design calls worth keeping in mind:
1. Synthetic sine clips, not LibriSpeech. W1 is forward-path proof, not
alignment quality, and a deterministic offline dataset means no network
on the smoke path. data/audio_smoke/manifest.jsonl is the only thing
committed; wavs are regenerated by audio_smoke_data.py and gitignored.
W2 swaps in real LibriSpeech.
2. Standalone byte-level tokenizer (UTF-8 bytes + a single BOS, vocab=257).
Avoids depending on a trained nanochat BPE — the d6 GPT is random
anyway, so vocab choice doesn't matter for "does the gradient flow"
smoke. W2 onwards uses the real BPE on a real base.
Caveat documented in doc/todo.md: because the LM is also random and being
trained, the loss-down here mostly reflects the LM memorising 5 short
strings, not Whisper-Projector alignment. That's fine for proving
plumbing; W2 freezes the LM so projector-only gradient is the only path
to lower loss.
68 lines
3.7 KiB
Markdown
68 lines
3.7 KiB
Markdown
# nanochat-omni TODO
|
||
|
||
定位:**质感感知语音输入**(audio-first,输出仅 text,vision 排后期)。
|
||
参考:[research_feasibility.md](research_feasibility.md)(mochi, 2026-05-05)的 W1–W8 时间盘。
|
||
|
||
---
|
||
|
||
## 近期 — 仓库结构 / 工程基建
|
||
|
||
- [ ] **submodule 展平为 monorepo fork**
|
||
- `git remote add upstream https://github.com/karpathy/nanochat.git`
|
||
- 重写 `main`:`git reset --hard upstream/main` + cherry-pick 我们的 7 个 commits(CI / smoke / wandb / gitignore / README)
|
||
- force push(repo 没人 fork,安全),删 `.gitmodules` + `upstream/nanochat/`
|
||
- 拉 upstream 更新就 `git fetch upstream && git merge upstream/main`
|
||
- [ ] CN mirror patch 直接落到 `pyproject.toml` / `nanochat/dataset.py`(fork 后不用 sed)
|
||
- [ ] CI smoke 跟着 fork 重路径化(`upstream/nanochat/` → 根目录)
|
||
|
||
## W1 — Whisper encoder + Projector forward smoke
|
||
|
||
参考 research §1.2 模块图。
|
||
|
||
- [x] `nanochat/audio.py`:WhisperEncoder wrapper(冻结,ModelScope 优先经 `WHISPER_MS_ID`,HF fallback 默认 `openai/whisper-base`)+ Projector(MLP,输出维度对齐 nanochat `n_embd`)
|
||
- [x] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数,作为 soft tokens prepend 到 text embedding 前面(kv_cache 路径暂不支持,audio 位置 targets 自动 -1 mask)
|
||
- [x] mini dataset:5 段 5s 合成正弦 + 字幕,落 `data/audio_smoke/`(wav 由 `scripts/audio_smoke_data.py` 生成,gitignore 排除)
|
||
- [x] `scripts/audio_align_smoke.py`:50 步、d6 随机初始化 GPT、字节级 tokenizer、loss 下降即过(4090 实测 ~1s,5.55→0.17)
|
||
- [ ] CI 加 audio smoke job(ailab runner 装 ffmpeg;whisper 走 transformers 即可)
|
||
|
||
W1 后续可改进(暂搁,留给 W3+/W5+ 质感任务):
|
||
|
||
- 当前用 `last_hidden_state`(最偏文本语义的层);为质感感知应切到中间层 / 多层 weighted sum / w2v-bert
|
||
- d6 GPT 是随机初始化,alignment 信号其实在练 LM 而非 projector;W2 上真 base 后 freeze LM、只练 projector 才是真正的弱对齐
|
||
|
||
## W2 — S1 弱对齐训练
|
||
|
||
- [ ] 拉 LibriSpeech 100h(HF mirror),预提 Whisper-base encoder 特征落盘 webdataset
|
||
- [ ] `scripts/audio_align_train.py`:冻结 LM + Whisper,只训 Projector
|
||
- [ ] PCA 可视化对齐效果(特征→文本嵌入空间是否聚类)
|
||
- [ ] wandb 项目:`nanochat-omni-audio`(跟 nanochat 文本 base 的 `nanochat` 分开)
|
||
|
||
## W3 — S2 指令 + LoRA
|
||
|
||
- [ ] LoRA 接入 nanochat `Linear`(rank=16,仅 attention/MLP)
|
||
- [ ] 5w 条音频指令数据 mix(AudioBench + 自合成)
|
||
- [ ] eval:自建 200 题 AudioBench-mini
|
||
|
||
## W4 — MVP demo
|
||
|
||
- [ ] 复用 `scripts/chat_web.py`,加录音上传
|
||
- [ ] AudioBench-mini 准确率 ≥40%(baseline 25%)
|
||
- [ ] 4090 端到端首 token <2s
|
||
|
||
## W5+ — 扩规模 / 质感数据 / vision
|
||
|
||
参考 research §4.1,留到 W5–W8 展开。
|
||
|
||
## 决定事项
|
||
|
||
- **backbone**:nanochat 自训 d12 → d20 → d26(不借现成 gemma/qwen,保持 hackable 灵魂)
|
||
- **顺序**:audio 先,vision 排 W7+,多模态输出(TTS/imagegen)不做
|
||
- **infra**:训练 + smoke CI 都跑在 ailab(5090, 32G);CN mirror 走 sjtu/aliyun(pip)、modelscope(模型权重,首选)、hf-mirror(HF 数据集 / 权重 fallback)
|
||
- **monorepo fork pattern**:上游 nanochat 的代码就是我们的代码,omni 改动直接进 `nanochat/` 包
|
||
|
||
## 暂搁 / 待定
|
||
|
||
- [ ] vision 通路:W7+ 启动,参考 LLaVA recipe,跟 audio 复用 Projector 抽象
|
||
- [ ] 质感数据自合成:用 ailab CosyVoice 或 IndexTTS 生情感变体(s1/i7 上有现成 server,跨机数据生产链待定)
|
||
- [ ] B40 / GB10 实测:MVP 不依赖
|