From 9dba7422bfc272e8edbd174225b95c3d53476789 Mon Sep 17 00:00:00 2001 From: Fam Zheng Date: Tue, 5 May 2026 22:17:36 +0100 Subject: [PATCH] =?UTF-8?q?doc:=20rename=20docs/=20=E2=86=92=20doc/=20and?= =?UTF-8?q?=20add=20todo.md?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Concrete near-term tasks (monorepo fork prep + W1 audio smoke) on top of mochi's research feasibility doc. --- {docs => doc}/research_feasibility.md | 0 doc/todo.md | 62 +++++++++++++++++++++++++++ 2 files changed, 62 insertions(+) rename {docs => doc}/research_feasibility.md (100%) create mode 100644 doc/todo.md diff --git a/docs/research_feasibility.md b/doc/research_feasibility.md similarity index 100% rename from docs/research_feasibility.md rename to doc/research_feasibility.md diff --git a/doc/todo.md b/doc/todo.md new file mode 100644 index 0000000..16fcba8 --- /dev/null +++ b/doc/todo.md @@ -0,0 +1,62 @@ +# nanochat-omni TODO + +定位:**质感感知语音输入**(audio-first,输出仅 text,vision 排后期)。 +参考:[research_feasibility.md](research_feasibility.md)(mochi, 2026-05-05)的 W1–W8 时间盘。 + +--- + +## 近期 — 仓库结构 / 工程基建 + +- [ ] **submodule 展平为 monorepo fork** + - `git remote add upstream https://github.com/karpathy/nanochat.git` + - 重写 `main`:`git reset --hard upstream/main` + cherry-pick 我们的 7 个 commits(CI / smoke / wandb / gitignore / README) + - force push(repo 没人 fork,安全),删 `.gitmodules` + `upstream/nanochat/` + - 拉 upstream 更新就 `git fetch upstream && git merge upstream/main` +- [ ] CN mirror patch 直接落到 `pyproject.toml` / `nanochat/dataset.py`(fork 后不用 sed) +- [ ] CI smoke 跟着 fork 重路径化(`upstream/nanochat/` → 根目录) + +## W1 — Whisper encoder + Projector forward smoke + +参考 research §1.2 模块图。 + +- [ ] `nanochat/audio.py`:WhisperEncoder wrapper(冻结,从 HF mirror 拉权重)+ Projector(MLP,输出维度对齐 nanochat `model_dim`) +- [ ] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数,作为 soft tokens prepend 到 text embedding 前面 +- [ ] mini dataset:1–10 段 5s wav + 字幕,落 `data/audio_smoke/`(git 内不存音频,仅清单 + 下载脚本) +- [ ] `scripts/audio_align_smoke.py`:50 步、d6 nanochat base、loss 下降即过 +- [ ] CI 加 audio smoke job(ailab runner 装 ffmpeg;whisper 走 transformers 即可) + +## W2 — S1 弱对齐训练 + +- [ ] 拉 LibriSpeech 100h(HF mirror),预提 Whisper-base encoder 特征落盘 webdataset +- [ ] `scripts/audio_align_train.py`:冻结 LM + Whisper,只训 Projector +- [ ] PCA 可视化对齐效果(特征→文本嵌入空间是否聚类) +- [ ] wandb 项目:`nanochat-omni-audio`(跟 nanochat 文本 base 的 `nanochat` 分开) + +## W3 — S2 指令 + LoRA + +- [ ] LoRA 接入 nanochat `Linear`(rank=16,仅 attention/MLP) +- [ ] 5w 条音频指令数据 mix(AudioBench + 自合成) +- [ ] eval:自建 200 题 AudioBench-mini + +## W4 — MVP demo + +- [ ] 复用 `scripts/chat_web.py`,加录音上传 +- [ ] AudioBench-mini 准确率 ≥40%(baseline 25%) +- [ ] 4090 端到端首 token <2s + +## W5+ — 扩规模 / 质感数据 / vision + +参考 research §4.1,留到 W5–W8 展开。 + +## 决定事项 + +- **backbone**:nanochat 自训 d12 → d20 → d26(不借现成 gemma/qwen,保持 hackable 灵魂) +- **顺序**:audio 先,vision 排 W7+,多模态输出(TTS/imagegen)不做 +- **infra**:训练 + smoke CI 都跑在 ailab(5090, 32G),CN mirror 走 sjtu/aliyun/hf-mirror +- **monorepo fork pattern**:上游 nanochat 的代码就是我们的代码,omni 改动直接进 `nanochat/` 包 + +## 暂搁 / 待定 + +- [ ] vision 通路:W7+ 启动,参考 LLaVA recipe,跟 audio 复用 Projector 抽象 +- [ ] 质感数据自合成:用 ailab CosyVoice 或 IndexTTS 生情感变体(s1/i7 上有现成 server,跨机数据生产链待定) +- [ ] B40 / GB10 实测:MVP 不依赖