Files
nanochat-omni/doc/todo.md
T
Fam Zheng b585e07dc2
smoke / nanochat-smoke (push) Successful in 2m30s
omni: CI smoke + docs + README preamble
- .gitea/workflows/smoke.yml: gitea CI on ailab gpu runner (manual git
  clone since actions/checkout@v4 mis-resolves subpath gitea); injects
  WANDB_API_KEY + CI_RUN_TAG=smoke-$run_number
- scripts/smoke.sh: in-place smoke (uv sync + 1 shard + tokenizer +
  d=6 50-step base_train); idempotent cache at /data/nanochat-smoke/
- doc/research_feasibility.md: voice-first multimodal feasibility study (mochi)
- doc/todo.md: phase-by-phase roadmap (W1 Whisper smoke → W4 MVP)
- README.md: omni preamble pointing at upstream nanochat README
- .gitignore: exclude .claude/ runtime files
2026-05-05 22:21:31 +01:00

63 lines
3.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# nanochat-omni TODO
定位:**质感感知语音输入**audio-first,输出仅 textvision 排后期)。
参考:[research_feasibility.md](research_feasibility.md)mochi, 2026-05-05)的 W1W8 时间盘。
---
## 近期 — 仓库结构 / 工程基建
- [ ] **submodule 展平为 monorepo fork**
- `git remote add upstream https://github.com/karpathy/nanochat.git`
- 重写 `main``git reset --hard upstream/main` + cherry-pick 我们的 7 个 commitsCI / smoke / wandb / gitignore / README
- force pushrepo 没人 fork,安全),删 `.gitmodules` + `upstream/nanochat/`
- 拉 upstream 更新就 `git fetch upstream && git merge upstream/main`
- [ ] CN mirror patch 直接落到 `pyproject.toml` / `nanochat/dataset.py`fork 后不用 sed
- [ ] CI smoke 跟着 fork 重路径化(`upstream/nanochat/` → 根目录)
## W1 — Whisper encoder + Projector forward smoke
参考 research §1.2 模块图。
- [ ] `nanochat/audio.py`WhisperEncoder wrapper(冻结,从 HF mirror 拉权重)+ ProjectorMLP,输出维度对齐 nanochat `model_dim`
- [ ] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数,作为 soft tokens prepend 到 text embedding 前面
- [ ] mini dataset110 段 5s wav + 字幕,落 `data/audio_smoke/`(git 内不存音频,仅清单 + 下载脚本)
- [ ] `scripts/audio_align_smoke.py`50 步、d6 nanochat base、loss 下降即过
- [ ] CI 加 audio smoke jobailab runner 装 ffmpegwhisper 走 transformers 即可)
## W2 — S1 弱对齐训练
- [ ] 拉 LibriSpeech 100hHF mirror),预提 Whisper-base encoder 特征落盘 webdataset
- [ ] `scripts/audio_align_train.py`:冻结 LM + Whisper,只训 Projector
- [ ] PCA 可视化对齐效果(特征→文本嵌入空间是否聚类)
- [ ] wandb 项目:`nanochat-omni-audio`(跟 nanochat 文本 base 的 `nanochat` 分开)
## W3 — S2 指令 + LoRA
- [ ] LoRA 接入 nanochat `Linear`rank=16,仅 attention/MLP
- [ ] 5w 条音频指令数据 mixAudioBench + 自合成)
- [ ] eval:自建 200 题 AudioBench-mini
## W4 — MVP demo
- [ ] 复用 `scripts/chat_web.py`,加录音上传
- [ ] AudioBench-mini 准确率 ≥40%baseline 25%
- [ ] 4090 端到端首 token <2s
## W5+ — 扩规模 / 质感数据 / vision
参考 research §4.1,留到 W5W8 展开。
## 决定事项
- **backbone**nanochat 自训 d12 → d20 → d26(不借现成 gemma/qwen,保持 hackable 灵魂)
- **顺序**audio 先,vision 排 W7+,多模态输出(TTS/imagegen)不做
- **infra**:训练 + smoke CI 都跑在 ailab5090, 32G),CN mirror 走 sjtu/aliyun/hf-mirror
- **monorepo fork pattern**:上游 nanochat 的代码就是我们的代码,omni 改动直接进 `nanochat/`
## 暂搁 / 待定
- [ ] vision 通路:W7+ 启动,参考 LLaVA recipe,跟 audio 复用 Projector 抽象
- [ ] 质感数据自合成:用 ailab CosyVoice 或 IndexTTS 生情感变体(s1/i7 上有现成 server,跨机数据生产链待定)
- [ ] B40 / GB10 实测:MVP 不依赖