omni: CI smoke + docs + README preamble
smoke / nanochat-smoke (push) Successful in 2m30s

- .gitea/workflows/smoke.yml: gitea CI on ailab gpu runner (manual git
  clone since actions/checkout@v4 mis-resolves subpath gitea); injects
  WANDB_API_KEY + CI_RUN_TAG=smoke-$run_number
- scripts/smoke.sh: in-place smoke (uv sync + 1 shard + tokenizer +
  d=6 50-step base_train); idempotent cache at /data/nanochat-smoke/
- doc/research_feasibility.md: voice-first multimodal feasibility study (mochi)
- doc/todo.md: phase-by-phase roadmap (W1 Whisper smoke → W4 MVP)
- README.md: omni preamble pointing at upstream nanochat README
- .gitignore: exclude .claude/ runtime files
This commit is contained in:
Fam Zheng
2026-05-05 22:21:31 +01:00
parent 7939990181
commit b585e07dc2
6 changed files with 367 additions and 1 deletions
+62
View File
@@ -0,0 +1,62 @@
# nanochat-omni TODO
定位:**质感感知语音输入**audio-first,输出仅 textvision 排后期)。
参考:[research_feasibility.md](research_feasibility.md)mochi, 2026-05-05)的 W1W8 时间盘。
---
## 近期 — 仓库结构 / 工程基建
- [ ] **submodule 展平为 monorepo fork**
- `git remote add upstream https://github.com/karpathy/nanochat.git`
- 重写 `main``git reset --hard upstream/main` + cherry-pick 我们的 7 个 commitsCI / smoke / wandb / gitignore / README
- force pushrepo 没人 fork,安全),删 `.gitmodules` + `upstream/nanochat/`
- 拉 upstream 更新就 `git fetch upstream && git merge upstream/main`
- [ ] CN mirror patch 直接落到 `pyproject.toml` / `nanochat/dataset.py`fork 后不用 sed
- [ ] CI smoke 跟着 fork 重路径化(`upstream/nanochat/` → 根目录)
## W1 — Whisper encoder + Projector forward smoke
参考 research §1.2 模块图。
- [ ] `nanochat/audio.py`WhisperEncoder wrapper(冻结,从 HF mirror 拉权重)+ ProjectorMLP,输出维度对齐 nanochat `model_dim`
- [ ] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数,作为 soft tokens prepend 到 text embedding 前面
- [ ] mini dataset110 段 5s wav + 字幕,落 `data/audio_smoke/`(git 内不存音频,仅清单 + 下载脚本)
- [ ] `scripts/audio_align_smoke.py`50 步、d6 nanochat base、loss 下降即过
- [ ] CI 加 audio smoke jobailab runner 装 ffmpegwhisper 走 transformers 即可)
## W2 — S1 弱对齐训练
- [ ] 拉 LibriSpeech 100hHF mirror),预提 Whisper-base encoder 特征落盘 webdataset
- [ ] `scripts/audio_align_train.py`:冻结 LM + Whisper,只训 Projector
- [ ] PCA 可视化对齐效果(特征→文本嵌入空间是否聚类)
- [ ] wandb 项目:`nanochat-omni-audio`(跟 nanochat 文本 base 的 `nanochat` 分开)
## W3 — S2 指令 + LoRA
- [ ] LoRA 接入 nanochat `Linear`rank=16,仅 attention/MLP
- [ ] 5w 条音频指令数据 mixAudioBench + 自合成)
- [ ] eval:自建 200 题 AudioBench-mini
## W4 — MVP demo
- [ ] 复用 `scripts/chat_web.py`,加录音上传
- [ ] AudioBench-mini 准确率 ≥40%baseline 25%
- [ ] 4090 端到端首 token <2s
## W5+ — 扩规模 / 质感数据 / vision
参考 research §4.1,留到 W5W8 展开。
## 决定事项
- **backbone**nanochat 自训 d12 → d20 → d26(不借现成 gemma/qwen,保持 hackable 灵魂)
- **顺序**audio 先,vision 排 W7+,多模态输出(TTS/imagegen)不做
- **infra**:训练 + smoke CI 都跑在 ailab5090, 32G),CN mirror 走 sjtu/aliyun/hf-mirror
- **monorepo fork pattern**:上游 nanochat 的代码就是我们的代码,omni 改动直接进 `nanochat/`
## 暂搁 / 待定
- [ ] vision 通路:W7+ 启动,参考 LLaVA recipe,跟 audio 复用 Projector 抽象
- [ ] 质感数据自合成:用 ailab CosyVoice 或 IndexTTS 生情感变体(s1/i7 上有现成 server,跨机数据生产链待定)
- [ ] B40 / GB10 实测:MVP 不依赖