doc: W1 audio smoke summary

Walkthrough of the three W1 commits, the 4090 result (50 steps in ~1s, loss 5.55 → 0.17), and the limitations to keep in mind before reading into the loss-down (LM is also random + tiny vocab, so the drop is mostly memorisation, not Whisper-Projector alignment — W2 freezes the LM specifically to test that). Includes the W2 hand-off checklist.
omni: W1 audio align smoke — synthetic dataset + 50-step script
2026-05-05 22:40:57 +01:00 · 2026-05-05 22:39:20 +01:00 · 2026-05-05 22:39:05 +01:00 · 2026-05-05 22:38:49 +01:00
8 changed files with 561 additions and 6 deletions
@@ -14,3 +14,6 @@ wandb/

 # Claude Code runtime
 .claude/
+
+# W1 audio smoke: regenerated by scripts/audio_smoke_data.py, only manifest is committed
+data/audio_smoke/wavs/
@@ -0,0 +1,5 @@
+{"wav": "wavs/sine_0220.wav", "text": "low tone", "sr": 16000}
+{"wav": "wavs/sine_0330.wav", "text": "mid low tone", "sr": 16000}
+{"wav": "wavs/sine_0440.wav", "text": "middle tone", "sr": 16000}
+{"wav": "wavs/sine_0660.wav", "text": "mid high tone", "sr": 16000}
+{"wav": "wavs/sine_0880.wav", "text": "high tone", "sr": 16000}
@@ -19,12 +19,17 @@

 参考 research §1.2 模块图。

- [ ] `nanochat/audio.py`：WhisperEncoder wrapper（冻结，权重优先走 ModelScope，例如 `iic/Whisper-large-v3` / `iic/Whisper-small`；HF mirror 留作 fallback）+ Projector（MLP，输出维度对齐 nanochat `model_dim`）
- [ ] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数，作为 soft tokens prepend 到 text embedding 前面
- [ ] mini dataset：1–10 段 5s wav + 字幕，落 `data/audio_smoke/`（git 内不存音频，仅清单 + 下载脚本）
- [ ] `scripts/audio_align_smoke.py`：50 步、d6 nanochat base、loss 下降即过
+- [x] `nanochat/audio.py`：WhisperEncoder wrapper（冻结，ModelScope 优先经 `WHISPER_MS_ID`，HF fallback 默认 `openai/whisper-base`）+ Projector（MLP，输出维度对齐 nanochat `n_embd`）
+- [x] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数，作为 soft tokens prepend 到 text embedding 前面（kv_cache 路径暂不支持，audio 位置 targets 自动 -1 mask）
+- [x] mini dataset：5 段 5s 合成正弦 + 字幕，落 `data/audio_smoke/`（wav 由 `scripts/audio_smoke_data.py` 生成，gitignore 排除）
+- [x] `scripts/audio_align_smoke.py`：50 步、d6 随机初始化 GPT、字节级 tokenizer、loss 下降即过（4090 实测 ~1s，5.55→0.17）
 - [ ] CI 加 audio smoke job（ailab runner 装 ffmpeg；whisper 走 transformers 即可）

+W1 后续可改进（暂搁，留给 W3+/W5+ 质感任务）：
+
+- 当前用 `last_hidden_state`（最偏文本语义的层）；为质感感知应切到中间层 / 多层 weighted sum / w2v-bert
+- d6 GPT 是随机初始化，alignment 信号其实在练 LM 而非 projector；W2 上真 base 后 freeze LM、只练 projector 才是真正的弱对齐
+
 ## W2 — S1 弱对齐训练

 - [ ] 拉 LibriSpeech 100h（HF mirror），预提 Whisper-base encoder 特征落盘 webdataset
@@ -0,0 +1,153 @@
+# W1 — 音频通路 forward smoke 总结
+
+> 阶段: W1（参考 [`doc/todo.md`](todo.md) / [`doc/research_feasibility.md`](research_feasibility.md) §1.2）
+> 作者: @mochi
+> 日期: 2026-05-05
+> 状态: ✅ 通路打通，CI 接入留作 W2 同步
+
+---
+
+## 0. 目标
+
+W1 的唯一目标是 **proof of plumbing**：
+
+```
+wav → WhisperEncoder (frozen) → Projector → prepend 到 text embedding
+    → 随机初始化 d6 GPT → CE loss on text only
+```
+
+跑通这条链，并验证梯度能从 loss 反传到 Projector。**不**追求 alignment 质量、
+不追求 transcribe 能力——这些是 W2/W3 的事。
+
+Pass criterion 故意做得宽：50 步训练后 `loss[-1] < loss[0] - 0.5`。
+
+---
+
+## 1. 实现切片
+
+W1 的全部代码改动落在三个 commit 里，逻辑互不耦合：
+
+| Commit | 范围 | 核心改动 |
+|---|---|---|
+| [`d760915`](../../../commit/d760915) | `nanochat/gpt.py` | `forward()` 加 `audio_features` 关键字参数 |
+| [`7cc94cf`](../../../commit/7cc94cf) | `nanochat/audio.py`（新） | `WhisperEncoder` + `Projector` 模块 |
+| [`3c1cc33`](../../../commit/3c1cc33) | `scripts/`、`data/` | 合成数据集 + 50 步 smoke 脚本 |
+
+### 1.1 GPT.forward 的 audio prepend
+
+LLaVA-style：把 projector 输出的 soft tokens 拼在 text embedding **前面**，
+其余照旧。改动 18 行，要点：
+
+- **prepend 时机**：smear 之后、transformer trunk 之前。smear 的 prev-token
+  语义对 soft tokens 没定义，所以让它仍然是严格的 "text-only" 操作。
+- **rotary 位置**：原来按 text 长度切片 cos/sin；audio 在场时改切 `[0, T_a + T_text)`。
+  rotary 缓存在 `__init__` 时已经按 `sequence_len * 10` 过度分配，覆盖得过来。
+- **value embedding**：`self.value_embeds[i](idx)` 需要 token id；audio 位置用 0
+  填充。`ve_gate` 是 input-dependent 的，会自己学到压制这些假行——W1 不操心。
+- **targets 对齐**：传入 targets 时自动在 audio 位置 prepend `-1`（`ignore_index`），
+  loss 只统计文本位置。
+- **不支持的路径**：`kv_cache is not None` 时直接 `assert`。KV-cache 是 prefill
+  + decode 的协议，给它写 audio 语义需要重新设计，W1 只跑 train-style forward，
+  现在无所谓。
+
+### 1.2 nanochat/audio.py
+
+两个类，零隐式状态。
+
+**`WhisperEncoder`**
+
+- 权重加载顺序：`WHISPER_MS_ID` 设了就先 ModelScope（CN 镜像政策，详见
+  `doc/todo.md` 决定事项），失败/没设就 HF（`WHISPER_HF_ID`，默认
+  `openai/whisper-base`）。HF 路径自动 honor `HF_ENDPOINT=hf-mirror.com`，
+  跟 `scripts/smoke.sh` 现有 env 兼容。
+- `__init__` 即 freeze（`requires_grad = False` + `eval()`），调用方不需要
+  记得"冻结一下"——少一种忘记踩坑的方式。
+- `preprocess(audios)` 走 transformers' `WhisperFeatureExtractor`，
+  `forward(input_features)` 走 encoder.last_hidden_state。
+- **设计妥协**：Whisper 把每段音频 pad 到 30 s → encoder 输出 1500 帧不变。
+  5 s 的样本浪费 6× 算力，W1 不优化；W2 可以换 streaming chunking。
+
+**`Projector`**
+
+- 两层 MLP：`in_dim → out_dim → out_dim`，GELU 激活，bias 全无。
+- 用 `nanochat.gpt.Linear`，master 权重 fp32、forward 时按输入 dtype（bf16）
+  cast，对齐 nanochat 主模型风格。
+- **`fc2` 零初始化**：模型在第 0 步对音频"完全无视"——从一个干净的 baseline
+  起步，audio 路径是 opt-out by default、训练后才 opt-in。这对 debug 很友好：
+  forward 走通了但 loss 不动？立刻就能定位到 projector 没在学。
+
+### 1.3 W1 smoke
+
+**合成数据**（`scripts/audio_smoke_data.py`）：5 段 5 s 正弦（220/330/440/660/
+880 Hz，加二次谐波 + 1% 高斯噪声防止纯音 log-mel 退化），文字标签依次是
+"low / mid low / middle / mid high / high tone"。文件在 `data/audio_smoke/wavs/`
+（gitignored），`manifest.jsonl` 入 git。stdlib `wave` 写 PCM16，零额外依赖。
+
+为什么是合成而不是 LibriSpeech？W1 是 forward proof，不需要数据真实——网络
+依赖反而会让 smoke 不稳定。W2 上真数据。
+
+**对齐脚本**（`scripts/audio_align_smoke.py`）：
+- **字节级 tokenizer**：UTF-8 字节 + 单独的 `<BOS>`，vocab=257。绕开 nanochat
+  BPE 的 `tok_train` 前置依赖，让 W1 完全 standalone。W2 切回真 BPE。
+- 流程：load 5 个 wav → 一次性预提 Whisper 特征（encoder 冻结，每步重算就
+  是浪费）→ 50 步 AdamW，projector + LM 一起练。
+- pass 判据：`losses[0] - losses[-1] >= 0.5`，宽容到挡不住任何真实失败。
+
+---
+
+## 2. 实测结果
+
+cpc-i7（RTX 4090 24G，bf16，CUDA 12.8）：
+
+```
+input_features: (5, 80, 3000)         # batch × n_mels × T_mel
+whisper features: (5, 1500, 512)      # batch × T_a × d_audio (whisper-base)
+GPT: depth=6 n_embd=384 n_head=6
+text idx: (5, 13)                     # max(transcript) - 1
+
+step 000 | loss 5.5533
+step 005 | loss 1.7214
+step 010 | loss 0.9479
+...
+step 049 | loss 0.1658
+
+Done 50 steps in 0.9s | start=5.5533 end=0.1658 drop=5.3875
+PASS
+```
+
+- 50 步训练 ~0.9 s（不含 Whisper 首次下载和编码）
+- `tests/` 13 passed / 10 skipped — `forward()` 改动没破坏既有路径
+- 显存峰值未测；whisper-base + d6 + B=5 应该 < 2 GB
+
+---
+
+## 3. 已知限制 / 留给后续
+
+按"留意度"降序：
+
+1. **loss 下降并不能证明对齐学会了**。LM 也在训，5 段短字符串完全可以靠
+   LM 死记。projector 是不是真的在传递音频信息，得 W2 freeze LM 才能验证
+   ——那时候唯一能改 loss 的路径就是 projector → audio 通路。
+2. **`last_hidden_state` 是 Whisper 最偏文本语义的层**。"质感感知" 这个项目
+   定位要求 timbre / prosody / emotion 等非文本信号能传到 LM；W3+/W5+ 时
+   应当切到中间层、多层 weighted sum，或者干脆换 wav2vec2 / w2v-bert
+   （`facebook/w2v-bert-2.0` 在 cpc-i7 cache 里就有）。
+3. **30 s 强 padding** 让短音频浪费 6× 算力。短期是 W2 数据准备的常数代价；
+   长期需要 streaming-style chunking 或者直接换非-Whisper backbone。
+4. **CI smoke job 暂未接入**。W1 在 4090 本地跑通，按计划在 W2 同步把 audio
+   smoke 加进 `scripts/smoke.sh` + `.gitea/workflows/smoke.yml`，统一在 ailab
+   runner 上跑。
+
+---
+
+## 4. 衔接 W2
+
+[`doc/todo.md`](todo.md) W2 段已经写好，关键交接点：
+
+- **数据**：LibriSpeech 100h（HF mirror），预提 Whisper-base 特征落 webdataset
+- **冻结策略**：Whisper + LM **都**冻结，只训 Projector —— 这是真正的
+  弱对齐，也是验证 W1 第 1 条限制的实验
+- **可视化**：projector 输出对 LM 嵌入空间做 PCA，看不同样本是否在文本嵌入
+  空间里聚类
+- **wandb**：项目名分到 `nanochat-omni-audio`，跟 nanochat 文本 base 的
+  `nanochat` 互不污染
@@ -0,0 +1,116 @@
+"""
+Audio modality for nanochat-omni (W1).
+
+Frozen Whisper encoder produces soft tokens; Projector maps them into nanochat's
+residual stream (n_embd) so they can be prepended to text token embeddings
+LLaVA-style. Output remains text-only.
+
+Weights:
+- ModelScope first when WHISPER_MS_ID is set (e.g. iic/Whisper-small,
+  iic/Whisper-large-v3) — preferred path on CN boxes (ailab/zy/etc).
+- HuggingFace fallback (honors HF_ENDPOINT for hf-mirror).
+
+The encoder is held frozen; only Projector is trained.
+"""
+
+import os
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from nanochat.gpt import Linear
+
+
+def _load_whisper_via_modelscope(ms_id):
+    from modelscope import snapshot_download
+    local_path = snapshot_download(ms_id)
+    from transformers import WhisperModel, WhisperFeatureExtractor
+    extractor = WhisperFeatureExtractor.from_pretrained(local_path)
+    model = WhisperModel.from_pretrained(local_path)
+    return extractor, model.encoder
+
+
+def _load_whisper_via_hf(hf_id):
+    from transformers import WhisperModel, WhisperFeatureExtractor
+    extractor = WhisperFeatureExtractor.from_pretrained(hf_id)
+    model = WhisperModel.from_pretrained(hf_id)
+    return extractor, model.encoder
+
+
+def load_whisper(hf_id="openai/whisper-base", ms_id=None):
+    """Load (feature_extractor, encoder). Tries ModelScope if ms_id is given,
+    falls back to HuggingFace. Returns the .encoder submodule (no decoder)."""
+    ms_id = ms_id or os.environ.get("WHISPER_MS_ID")
+    hf_id = os.environ.get("WHISPER_HF_ID", hf_id)
+    errors = []
+    if ms_id:
+        try:
+            return _load_whisper_via_modelscope(ms_id)
+        except Exception as e:
+            errors.append(f"modelscope({ms_id}): {e}")
+    try:
+        return _load_whisper_via_hf(hf_id)
+    except Exception as e:
+        errors.append(f"hf({hf_id}): {e}")
+        raise RuntimeError("Failed to load Whisper encoder. Tried: " + " | ".join(errors))
+
+
+class WhisperEncoder(nn.Module):
+    """Frozen Whisper encoder. Forward takes log-mel input_features
+    (B, n_mels, T_mel) and returns (B, T_enc, d_model)."""
+
+    def __init__(self, hf_id="openai/whisper-base", ms_id=None, device=None, dtype=None):
+        super().__init__()
+        extractor, encoder = load_whisper(hf_id=hf_id, ms_id=ms_id)
+        self.feature_extractor = extractor
+        self.encoder = encoder
+        for p in self.encoder.parameters():
+            p.requires_grad = False
+        self.encoder.eval()
+        self._d_model = encoder.config.d_model
+        self.sampling_rate = extractor.sampling_rate
+        if device is not None or dtype is not None:
+            self.encoder.to(device=device, dtype=dtype)
+
+    @property
+    def d_model(self):
+        return self._d_model
+
+    def preprocess(self, audio_arrays):
+        """audio_arrays: list of 1D np.float32 (mono, sampling_rate Hz).
+        Returns input_features tensor (B, n_mels, T_mel)."""
+        out = self.feature_extractor(
+            audio_arrays,
+            sampling_rate=self.sampling_rate,
+            return_tensors="pt",
+        )
+        return out.input_features
+
+    @torch.no_grad()
+    def forward(self, input_features):
+        out = self.encoder(input_features=input_features)
+        return out.last_hidden_state
+
+
+class Projector(nn.Module):
+    """LLaVA-style 2-layer MLP: audio_d -> hidden -> n_embd.
+
+    Uses nanochat's Linear so master weights stay fp32 while forward runs in
+    the activation dtype (typically bf16). Matches the convention in gpt.py.
+    """
+
+    def __init__(self, in_dim, out_dim, hidden_dim=None):
+        super().__init__()
+        hidden_dim = hidden_dim or out_dim
+        self.fc1 = Linear(in_dim, hidden_dim, bias=False)
+        self.fc2 = Linear(hidden_dim, out_dim, bias=False)
+        s = (3.0 / in_dim) ** 0.5
+        torch.nn.init.uniform_(self.fc1.weight, -s, s)
+        torch.nn.init.zeros_(self.fc2.weight)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = F.gelu(x)
+        x = self.fc2(x)
+        return x
@@ -413,7 +413,7 @@ class GPT(nn.Module):
            group["initial_lr"] = group["lr"]
        return optimizer

-    def forward(self, idx, targets=None, kv_cache=None, loss_reduction='mean'):
+    def forward(self, idx, targets=None, kv_cache=None, loss_reduction='mean', audio_features=None):
        B, T = idx.size()

        # Grab the rotary embeddings for the current sequence length (they are of shape (1, seq_len, 1, head_dim/2))
@@ -448,6 +448,22 @@ class GPT(nn.Module):
                gate = self.smear_lambda.to(x.dtype) * torch.sigmoid(self.smear_gate(x[:, :, :24]))
                x = x + gate * x_pre_smear

+        # Audio soft-token prepend (LLaVA-style): audio_features must already be projected to n_embd.
+        idx_for_ve = idx
+        if audio_features is not None:
+            assert kv_cache is None, "audio_features prepend not supported with kv_cache"
+            audio = norm(audio_features.to(COMPUTE_DTYPE))
+            T_a = audio.size(1)
+            x = torch.cat([audio, x], dim=1)
+            T_full = T_a + T
+            assert T_full <= self.cos.size(1), f"Sequence length grew beyond rotary cache: {T_full} > {self.cos.size(1)}"
+            cos_sin = self.cos[:, :T_full], self.sin[:, :T_full]
+            idx_pad = torch.zeros((B, T_a), dtype=idx.dtype, device=idx.device)
+            idx_for_ve = torch.cat([idx_pad, idx], dim=1)
+            if targets is not None:
+                pad = torch.full((B, T_a), -1, dtype=targets.dtype, device=targets.device)
+                targets = torch.cat([pad, targets], dim=1)
+
        # Forward the trunk of the Transformer
        x0 = x  # save initial normalized embedding for x0 residual
        n_layer = self.config.n_layer
@@ -455,7 +471,7 @@ class GPT(nn.Module):
        x_backout = None
        for i, block in enumerate(self.transformer.h):
            x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0
-            ve = self.value_embeds[str(i)](idx).to(x.dtype) if str(i) in self.value_embeds else None
+            ve = self.value_embeds[str(i)](idx_for_ve).to(x.dtype) if str(i) in self.value_embeds else None
            x = block(x, ve, cos_sin, self.window_sizes[i], kv_cache)
            if i == backout_layer:
                x_backout = x
@@ -0,0 +1,179 @@
+"""
+W1 smoke: prove the audio path works end-to-end.
+
+Pipeline:
+    wav -> WhisperEncoder (frozen) -> Projector -> prepend to text embeddings
+        -> tiny d6 GPT (random init) -> CE loss on text tokens only
+
+The model is randomly initialized and the dataset is 5 synthetic sine clips,
+so the only thing this validates is that gradients flow through the projector
+into a decreasing loss. Pass criterion: end loss < start loss by a clear margin.
+
+Standalone tokenizer (UTF-8 bytes + a single BOS) so the smoke does not depend
+on the nanochat BPE tokenizer being trained yet — that prerequisite belongs to
+W2 onwards.
+
+Usage:
+    python -m scripts.audio_align_smoke
+"""
+
+import argparse
+import json
+import time
+import wave
+from pathlib import Path
+
+import numpy as np
+import torch
+
+from nanochat.audio import Projector, WhisperEncoder
+from nanochat.common import (
+    COMPUTE_DTYPE,
+    autodetect_device_type,
+    compute_cleanup,
+    compute_init,
+)
+from nanochat.gpt import GPT, GPTConfig
+
+
+# Byte-level tokenizer: vocab[0..255] = raw UTF-8 byte, 256 = <BOS>.
+BOS_ID = 256
+VOCAB_SIZE = 257
+
+
+def encode(text):
+    return [BOS_ID] + list(text.encode("utf-8"))
+
+
+def load_manifest(data_dir):
+    items = []
+    with open(Path(data_dir) / "manifest.jsonl") as f:
+        for line in f:
+            items.append(json.loads(line))
+    return items
+
+
+def load_wav_mono16k(path):
+    """Read a mono PCM16 WAV (matches scripts/audio_smoke_data.py output)."""
+    with wave.open(str(path), "rb") as w:
+        assert w.getnchannels() == 1, f"expected mono, got {w.getnchannels()} channels"
+        assert w.getsampwidth() == 2, f"expected pcm16, got sampwidth {w.getsampwidth()}"
+        sr = w.getframerate()
+        frames = w.readframes(w.getnframes())
+    audio = np.frombuffer(frames, dtype=np.int16).astype(np.float32) / 32768.0
+    return audio, sr
+
+
+def build_gpt(depth, head_dim, max_seq_len, device):
+    base_dim = depth * 64  # nanochat's default aspect ratio
+    model_dim = ((base_dim + head_dim - 1) // head_dim) * head_dim
+    n_head = model_dim // head_dim
+    config = GPTConfig(
+        sequence_len=max_seq_len,
+        vocab_size=VOCAB_SIZE,
+        n_layer=depth,
+        n_head=n_head,
+        n_kv_head=n_head,
+        n_embd=model_dim,
+        window_pattern="L",
+    )
+    with torch.device("meta"):
+        model = GPT(config)
+    model.to_empty(device=device)
+    model.init_weights()
+    return model, config
+
+
+def pack_text_batch(text_ids_list, device):
+    """idx[i, t] is input token; targets[i, t] is the next token (or -1 to ignore).
+    Right-pad to the longest sequence with 0/-1.
+    """
+    in_len = max(len(ids) for ids in text_ids_list) - 1
+    B = len(text_ids_list)
+    idx = torch.zeros((B, in_len), dtype=torch.long, device=device)
+    targets = torch.full((B, in_len), -1, dtype=torch.long, device=device)
+    for i, ids in enumerate(text_ids_list):
+        L = len(ids) - 1
+        idx[i, :L] = torch.tensor(ids[:-1], dtype=torch.long, device=device)
+        targets[i, :L] = torch.tensor(ids[1:], dtype=torch.long, device=device)
+    return idx, targets
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data-dir", default="data/audio_smoke")
+    parser.add_argument("--depth", type=int, default=6)
+    parser.add_argument("--head-dim", type=int, default=64)
+    parser.add_argument("--max-seq-len", type=int, default=2048)
+    parser.add_argument("--num-iters", type=int, default=50)
+    parser.add_argument("--lr", type=float, default=3e-3)
+    parser.add_argument("--whisper", default="openai/whisper-base",
+                        help="HF Whisper id (override via WHISPER_HF_ID env)")
+    parser.add_argument("--loss-drop-min", type=float, default=0.5,
+                        help="end loss must be at least this much lower than start loss")
+    args = parser.parse_args()
+
+    device_type = autodetect_device_type()
+    ddp, _, _, _, device = compute_init(device_type)
+    assert not ddp, "smoke is single-process"
+
+    # Synthetic audio + manifest: regenerate if missing so the script is self-contained.
+    if not (Path(args.data_dir) / "manifest.jsonl").exists():
+        from scripts.audio_smoke_data import generate_synthetic
+        generate_synthetic(args.data_dir)
+
+    items = load_manifest(args.data_dir)
+    audios = [load_wav_mono16k(Path(args.data_dir) / it["wav"])[0] for it in items]
+    texts = [it["text"] for it in items]
+    print(f"loaded {len(items)} samples: {texts}")
+
+    # Frozen Whisper encoder + Projector to nanochat n_embd
+    print(f"loading Whisper encoder ({args.whisper})...")
+    whisper = WhisperEncoder(hf_id=args.whisper, device=device, dtype=COMPUTE_DTYPE)
+
+    # Pre-extract Whisper input_features and encoder outputs once; encoder is frozen
+    # so its output never changes across training steps -> hoist out of the loop.
+    input_features = whisper.preprocess(audios).to(device=device, dtype=COMPUTE_DTYPE)
+    print(f"input_features: {tuple(input_features.shape)}")
+    audio_feats = whisper(input_features).detach()
+    print(f"whisper features: {tuple(audio_feats.shape)} (T_a soft tokens)")
+
+    # GPT (random init, d6 by default) and Projector
+    gpt, config = build_gpt(args.depth, args.head_dim, args.max_seq_len, device)
+    print(f"GPT: depth={config.n_layer} n_embd={config.n_embd} n_head={config.n_head}")
+    projector = Projector(in_dim=whisper.d_model, out_dim=config.n_embd).to(device=device)
+
+    # Tokenize transcripts and pack into a batch
+    text_ids_list = [encode(t) for t in texts]
+    idx, targets = pack_text_batch(text_ids_list, device=device)
+    print(f"text idx: {tuple(idx.shape)} (max_text_len-1)")
+
+    # Single AdamW over projector + LM. Whisper stays frozen (requires_grad=False
+    # was set in WhisperEncoder.__init__, so its params won't appear here anyway).
+    train_params = list(projector.parameters()) + [p for p in gpt.parameters() if p.requires_grad]
+    optim = torch.optim.AdamW(train_params, lr=args.lr, betas=(0.9, 0.95), weight_decay=0.0)
+
+    losses = []
+    t0 = time.time()
+    for step in range(args.num_iters):
+        soft_tokens = projector(audio_feats)  # (B, T_a, n_embd)
+        loss = gpt(idx, targets=targets, audio_features=soft_tokens)
+        optim.zero_grad(set_to_none=True)
+        loss.backward()
+        torch.nn.utils.clip_grad_norm_(train_params, 1.0)
+        optim.step()
+        losses.append(loss.item())
+        if step % 5 == 0 or step == args.num_iters - 1:
+            print(f"step {step:03d} | loss {loss.item():.4f}")
+    dt = time.time() - t0
+
+    drop = losses[0] - losses[-1]
+    print(f"\nDone {args.num_iters} steps in {dt:.1f}s | start={losses[0]:.4f} end={losses[-1]:.4f} drop={drop:.4f}")
+    assert drop >= args.loss_drop_min, f"loss did not drop enough: {drop:.4f} < {args.loss_drop_min}"
+    print("PASS: audio path forward+backward works, loss is descending.")
+
+    compute_cleanup()
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,78 @@
+"""
+Generate the W1 audio smoke dataset: a handful of 5s sine-wave clips paired
+with deterministic transcripts.
+
+Why synthetic instead of real speech: W1 only proves the forward path
+(WhisperEncoder -> Projector -> GPT prepend) and that the projector's gradient
+flows into a decreasing loss on a tiny fixed set. Real speech adds a network
+dependency to a step that should be reproducible offline. W2 swaps in
+LibriSpeech.
+
+Audio files land under data/audio_smoke/wavs/ (gitignored). The manifest
+data/audio_smoke/manifest.jsonl is the only artifact committed.
+
+Usage:
+    python -m scripts.audio_smoke_data
+"""
+
+import argparse
+import json
+import wave
+from pathlib import Path
+
+import numpy as np
+
+
+SAMPLES = [
+    (220.0, "low tone"),
+    (330.0, "mid low tone"),
+    (440.0, "middle tone"),
+    (660.0, "mid high tone"),
+    (880.0, "high tone"),
+]
+SR = 16000
+DURATION_S = 5.0
+
+
+def synth_sine(freq_hz, duration_s=DURATION_S, sr=SR):
+    """Sine + 2nd harmonic + a sliver of noise so Whisper sees non-degenerate
+    frames (a pure tone collapses to a near-constant log-mel)."""
+    t = np.arange(int(sr * duration_s)) / sr
+    x = 0.5 * np.sin(2 * np.pi * freq_hz * t) + 0.25 * np.sin(2 * np.pi * 2 * freq_hz * t)
+    rng = np.random.default_rng(int(freq_hz))
+    x = x + 0.01 * rng.standard_normal(len(x))
+    return x.astype(np.float32)
+
+
+def write_wav_pcm16(path, audio, sr=SR):
+    """Write mono PCM16 WAV using the stdlib (no scipy/soundfile dependency)."""
+    pcm = np.clip(audio, -1.0, 1.0)
+    pcm = (pcm * 32767.0).astype(np.int16)
+    with wave.open(str(path), "wb") as w:
+        w.setnchannels(1)
+        w.setsampwidth(2)
+        w.setframerate(sr)
+        w.writeframes(pcm.tobytes())
+
+
+def generate_synthetic(data_dir):
+    data_dir = Path(data_dir)
+    wav_dir = data_dir / "wavs"
+    wav_dir.mkdir(parents=True, exist_ok=True)
+    manifest_path = data_dir / "manifest.jsonl"
+    with open(manifest_path, "w") as f:
+        for freq, text in SAMPLES:
+            name = f"sine_{int(freq):04d}.wav"
+            wav_path = wav_dir / name
+            if not wav_path.exists():
+                write_wav_pcm16(wav_path, synth_sine(freq))
+            f.write(json.dumps({"wav": f"wavs/{name}", "text": text, "sr": SR}) + "\n")
+    print(f"Wrote {len(SAMPLES)} samples to {data_dir}")
+    return manifest_path
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data-dir", default="data/audio_smoke")
+    args = parser.parse_args()
+    generate_synthetic(args.data_dir)
Author	SHA1	Message	Date
Fam Zheng	8c7210abeb	doc: W1 audio smoke summary smoke / nanochat-smoke (push) Successful in 33s Details Walkthrough of the three W1 commits, the 4090 result (50 steps in ~1s, loss 5.55 → 0.17), and the limitations to keep in mind before reading into the loss-down (LM is also random + tiny vocab, so the drop is mostly memorisation, not Whisper-Projector alignment — W2 freezes the LM specifically to test that). Includes the W2 hand-off checklist.	2026-05-05 22:40:57 +01:00
Fam Zheng	3c1cc3302f	omni: W1 audio align smoke — synthetic dataset + 50-step script End-to-end smoke proving the audio path: wav -> WhisperEncoder (frozen) -> Projector -> prepend to text embeddings -> tiny d6 GPT (random init) -> CE loss on text only Pass criterion is a plain "loss drops by at least 0.5". On a 4090 the run finishes in ~1 s and goes 5.55 -> 0.17 over 50 steps, so the threshold has plenty of headroom against false positives. Two design calls worth keeping in mind: 1. Synthetic sine clips, not LibriSpeech. W1 is forward-path proof, not alignment quality, and a deterministic offline dataset means no network on the smoke path. data/audio_smoke/manifest.jsonl is the only thing committed; wavs are regenerated by audio_smoke_data.py and gitignored. W2 swaps in real LibriSpeech. 2. Standalone byte-level tokenizer (UTF-8 bytes + a single BOS, vocab=257). Avoids depending on a trained nanochat BPE — the d6 GPT is random anyway, so vocab choice doesn't matter for "does the gradient flow" smoke. W2 onwards uses the real BPE on a real base. Caveat documented in doc/todo.md: because the LM is also random and being trained, the loss-down here mostly reflects the LM memorising 5 short strings, not Whisper-Projector alignment. That's fine for proving plumbing; W2 freezes the LM so projector-only gradient is the only path to lower loss.	2026-05-05 22:39:20 +01:00
Fam Zheng	7cc94cf584	omni: nanochat/audio.py — frozen Whisper encoder + Projector The audio modality module that pairs with the gpt.forward audio_features hook. Two things live here: WhisperEncoder: thin wrapper around transformers' WhisperModel.encoder. - Weight loading prefers ModelScope when WHISPER_MS_ID is set (matches the CN-mirror policy in doc/todo.md — modelscope is first-class for model weights, hf-mirror is fallback). Otherwise falls back to plain HF, with WHISPER_HF_ID as the override and `openai/whisper-base` as the default (the smallest variant that still produces useful features for smoke). - Encoder params have requires_grad=False from __init__ so they never appear in the optimizer's param list. Caller does not need to remember to freeze it. - preprocess() runs the feature extractor; forward() takes (B, n_mels, T_mel) and returns last_hidden_state (B, T_enc, d_model). Whisper pads every clip to 30 s, so T_enc is a constant 1500 regardless of input duration — handy for batching, wasteful for short clips. We accept the waste at W1; W2 can switch to streaming-style chunking. - Note for W3+/W5+: last_hidden_state is the most text-semantic layer. When we start caring about timbre / prosody / emotion ("质感感知"), we should expose middle layers or a learnable weighted sum across layers. Projector: 2-layer MLP (in_dim → out_dim → out_dim) with GELU and the nanochat Linear class so master weights stay fp32 while forward runs in the activation dtype (bf16). fc2 is zero-initialized so the model starts ignoring audio entirely, which gives a clean baseline before any training signal flows through (audio path is opt-out by default, opt-in by training).	2026-05-05 22:39:05 +01:00
Fam Zheng	d760915daa	omni: gpt.forward — optional audio_features for soft-token prepend W1 needs the GPT to consume Whisper-projector outputs as a prefix of "soft tokens" sitting in front of the text token embeddings (LLaVA-style). The change is intentionally minimal: - forward() takes a new keyword-only arg `audio_features` of shape (B, T_a, n_embd). They must already be projected to n_embd by the caller (Projector lives in nanochat.audio, kept out of GPT itself). - The audio rows are normed (matches the post-wte norm convention) and concatenated after smear so smear stays a strictly text-side op (its prev-token semantics aren't defined for soft tokens, and revisiting that belongs to a later phase). - Rotary embeddings are re-sliced for T_a + T_text. Audio gets positions 0..T_a-1, text 0-shifts to T_a..T_full-1. The 10× over-allocated rotary cache in __init__ already covers this. - value_embeds lookup uses an idx padded with 0 for audio positions. They feed the v residual but the gate (`ve_gate`) is input-dependent and will learn to suppress the dummy rows; for W1 smoke this is fine. - targets are auto-padded with -1 (ignore_index) over audio positions so the LM is only graded on text predictions. Not yet supported: audio_features with kv_cache. The KV-cache path is a prefill+decode protocol that would need its own audio-aware semantics; W1 runs train-style forwards only, so we just assert.	2026-05-05 22:38:49 +01:00