Files
Fam Zheng 3c1cc3302f omni: W1 audio align smoke — synthetic dataset + 50-step script
End-to-end smoke proving the audio path:
  wav -> WhisperEncoder (frozen) -> Projector -> prepend to text embeddings
      -> tiny d6 GPT (random init) -> CE loss on text only

Pass criterion is a plain "loss drops by at least 0.5". On a 4090 the run
finishes in ~1 s and goes 5.55 -> 0.17 over 50 steps, so the threshold has
plenty of headroom against false positives.

Two design calls worth keeping in mind:

1. Synthetic sine clips, not LibriSpeech. W1 is forward-path proof, not
   alignment quality, and a deterministic offline dataset means no network
   on the smoke path. data/audio_smoke/manifest.jsonl is the only thing
   committed; wavs are regenerated by audio_smoke_data.py and gitignored.
   W2 swaps in real LibriSpeech.

2. Standalone byte-level tokenizer (UTF-8 bytes + a single BOS, vocab=257).
   Avoids depending on a trained nanochat BPE — the d6 GPT is random
   anyway, so vocab choice doesn't matter for "does the gradient flow"
   smoke. W2 onwards uses the real BPE on a real base.

Caveat documented in doc/todo.md: because the LM is also random and being
trained, the loss-down here mostly reflects the LM memorising 5 short
strings, not Whisper-Projector alignment. That's fine for proving
plumbing; W2 freezes the LM so projector-only gradient is the only path
to lower loss.
2026-05-05 22:39:20 +01:00

68 lines
3.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# nanochat-omni TODO
定位:**质感感知语音输入**audio-first,输出仅 textvision 排后期)。
参考:[research_feasibility.md](research_feasibility.md)mochi, 2026-05-05)的 W1W8 时间盘。
---
## 近期 — 仓库结构 / 工程基建
- [ ] **submodule 展平为 monorepo fork**
- `git remote add upstream https://github.com/karpathy/nanochat.git`
- 重写 `main``git reset --hard upstream/main` + cherry-pick 我们的 7 个 commitsCI / smoke / wandb / gitignore / README
- force pushrepo 没人 fork,安全),删 `.gitmodules` + `upstream/nanochat/`
- 拉 upstream 更新就 `git fetch upstream && git merge upstream/main`
- [ ] CN mirror patch 直接落到 `pyproject.toml` / `nanochat/dataset.py`fork 后不用 sed
- [ ] CI smoke 跟着 fork 重路径化(`upstream/nanochat/` → 根目录)
## W1 — Whisper encoder + Projector forward smoke
参考 research §1.2 模块图。
- [x] `nanochat/audio.py`WhisperEncoder wrapper(冻结,ModelScope 优先经 `WHISPER_MS_ID`HF fallback 默认 `openai/whisper-base`+ ProjectorMLP,输出维度对齐 nanochat `n_embd`
- [x] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数,作为 soft tokens prepend 到 text embedding 前面(kv_cache 路径暂不支持,audio 位置 targets 自动 -1 mask
- [x] mini dataset5 段 5s 合成正弦 + 字幕,落 `data/audio_smoke/`wav 由 `scripts/audio_smoke_data.py` 生成,gitignore 排除)
- [x] `scripts/audio_align_smoke.py`:50 步、d6 随机初始化 GPT、字节级 tokenizer、loss 下降即过(4090 实测 ~1s5.55→0.17
- [ ] CI 加 audio smoke jobailab runner 装 ffmpegwhisper 走 transformers 即可)
W1 后续可改进(暂搁,留给 W3+/W5+ 质感任务):
- 当前用 `last_hidden_state`(最偏文本语义的层);为质感感知应切到中间层 / 多层 weighted sum / w2v-bert
- d6 GPT 是随机初始化,alignment 信号其实在练 LM 而非 projectorW2 上真 base 后 freeze LM、只练 projector 才是真正的弱对齐
## W2 — S1 弱对齐训练
- [ ] 拉 LibriSpeech 100hHF mirror),预提 Whisper-base encoder 特征落盘 webdataset
- [ ] `scripts/audio_align_train.py`:冻结 LM + Whisper,只训 Projector
- [ ] PCA 可视化对齐效果(特征→文本嵌入空间是否聚类)
- [ ] wandb 项目:`nanochat-omni-audio`(跟 nanochat 文本 base 的 `nanochat` 分开)
## W3 — S2 指令 + LoRA
- [ ] LoRA 接入 nanochat `Linear`rank=16,仅 attention/MLP
- [ ] 5w 条音频指令数据 mixAudioBench + 自合成)
- [ ] eval:自建 200 题 AudioBench-mini
## W4 — MVP demo
- [ ] 复用 `scripts/chat_web.py`,加录音上传
- [ ] AudioBench-mini 准确率 ≥40%baseline 25%
- [ ] 4090 端到端首 token <2s
## W5+ — 扩规模 / 质感数据 / vision
参考 research §4.1,留到 W5W8 展开。
## 决定事项
- **backbone**nanochat 自训 d12 → d20 → d26(不借现成 gemma/qwen,保持 hackable 灵魂)
- **顺序**audio 先,vision 排 W7+,多模态输出(TTS/imagegen)不做
- **infra**:训练 + smoke CI 都跑在 ailab5090, 32G);CN mirror 走 sjtu/aliyunpip)、modelscope(模型权重,首选)、hf-mirrorHF 数据集 / 权重 fallback
- **monorepo fork pattern**:上游 nanochat 的代码就是我们的代码,omni 改动直接进 `nanochat/`
## 暂搁 / 待定
- [ ] vision 通路:W7+ 启动,参考 LLaVA recipe,跟 audio 复用 Projector 抽象
- [ ] 质感数据自合成:用 ailab CosyVoice 或 IndexTTS 生情感变体(s1/i7 上有现成 server,跨机数据生产链待定)
- [ ] B40 / GB10 实测:MVP 不依赖