From 9dba7422bfc272e8edbd174225b95c3d53476789 Mon Sep 17 00:00:00 2001
From: Fam Zheng <fam@euphon.net>
Date: Tue, 5 May 2026 22:17:36 +0100
Subject: [PATCH] =?UTF-8?q?doc:=20rename=20docs/=20=E2=86=92=20doc/=20and?=
 =?UTF-8?q?=20add=20todo.md?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Concrete near-term tasks (monorepo fork prep + W1 audio smoke) on
top of mochi's research feasibility doc.
---
 {docs => doc}/research_feasibility.md |  0
 doc/todo.md                           | 62 +++++++++++++++++++++++++++
 2 files changed, 62 insertions(+)
 rename {docs => doc}/research_feasibility.md (100%)
 create mode 100644 doc/todo.md

diff --git a/docs/research_feasibility.md b/doc/research_feasibility.md
similarity index 100%
rename from docs/research_feasibility.md
rename to doc/research_feasibility.md
diff --git a/doc/todo.md b/doc/todo.md
new file mode 100644
index 0000000..16fcba8
--- /dev/null
+++ b/doc/todo.md
@@ -0,0 +1,62 @@
+# nanochat-omni TODO
+
+定位：**质感感知语音输入**（audio-first，输出仅 text，vision 排后期）。
+参考：[research_feasibility.md](research_feasibility.md)（mochi, 2026-05-05）的 W1–W8 时间盘。
+
+---
+
+## 近期 — 仓库结构 / 工程基建
+
+- [ ] **submodule 展平为 monorepo fork**
+  - `git remote add upstream https://github.com/karpathy/nanochat.git`
+  - 重写 `main`：`git reset --hard upstream/main` + cherry-pick 我们的 7 个 commits（CI / smoke / wandb / gitignore / README）
+  - force push（repo 没人 fork，安全），删 `.gitmodules` + `upstream/nanochat/`
+  - 拉 upstream 更新就 `git fetch upstream && git merge upstream/main`
+- [ ] CN mirror patch 直接落到 `pyproject.toml` / `nanochat/dataset.py`（fork 后不用 sed）
+- [ ] CI smoke 跟着 fork 重路径化（`upstream/nanochat/` → 根目录）
+
+## W1 — Whisper encoder + Projector forward smoke
+
+参考 research §1.2 模块图。
+
+- [ ] `nanochat/audio.py`：WhisperEncoder wrapper（冻结，从 HF mirror 拉权重）+ Projector（MLP，输出维度对齐 nanochat `model_dim`）
+- [ ] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数，作为 soft tokens prepend 到 text embedding 前面
+- [ ] mini dataset：1–10 段 5s wav + 字幕，落 `data/audio_smoke/`（git 内不存音频，仅清单 + 下载脚本）
+- [ ] `scripts/audio_align_smoke.py`：50 步、d6 nanochat base、loss 下降即过
+- [ ] CI 加 audio smoke job（ailab runner 装 ffmpeg；whisper 走 transformers 即可）
+
+## W2 — S1 弱对齐训练
+
+- [ ] 拉 LibriSpeech 100h（HF mirror），预提 Whisper-base encoder 特征落盘 webdataset
+- [ ] `scripts/audio_align_train.py`：冻结 LM + Whisper，只训 Projector
+- [ ] PCA 可视化对齐效果（特征→文本嵌入空间是否聚类）
+- [ ] wandb 项目：`nanochat-omni-audio`（跟 nanochat 文本 base 的 `nanochat` 分开）
+
+## W3 — S2 指令 + LoRA
+
+- [ ] LoRA 接入 nanochat `Linear`（rank=16，仅 attention/MLP）
+- [ ] 5w 条音频指令数据 mix（AudioBench + 自合成）
+- [ ] eval：自建 200 题 AudioBench-mini
+
+## W4 — MVP demo
+
+- [ ] 复用 `scripts/chat_web.py`，加录音上传
+- [ ] AudioBench-mini 准确率 ≥40%（baseline 25%）
+- [ ] 4090 端到端首 token <2s
+
+## W5+ — 扩规模 / 质感数据 / vision
+
+参考 research §4.1，留到 W5–W8 展开。
+
+## 决定事项
+
+- **backbone**：nanochat 自训 d12 → d20 → d26（不借现成 gemma/qwen，保持 hackable 灵魂）
+- **顺序**：audio 先，vision 排 W7+，多模态输出（TTS/imagegen）不做
+- **infra**：训练 + smoke CI 都跑在 ailab（5090, 32G），CN mirror 走 sjtu/aliyun/hf-mirror
+- **monorepo fork pattern**：上游 nanochat 的代码就是我们的代码，omni 改动直接进 `nanochat/` 包
+
+## 暂搁 / 待定
+
+- [ ] vision 通路：W7+ 启动，参考 LLaVA recipe，跟 audio 复用 Projector 抽象
+- [ ] 质感数据自合成：用 ailab CosyVoice 或 IndexTTS 生情感变体（s1/i7 上有现成 server，跨机数据生产链待定）
+- [ ] B40 / GB10 实测：MVP 不依赖