Compare commits
4 Commits
9cae824aa5
...
8c7210abeb
| Author | SHA1 | Date | |
|---|---|---|---|
| 8c7210abeb | |||
| 3c1cc3302f | |||
| 7cc94cf584 | |||
| d760915daa |
@@ -14,3 +14,6 @@ wandb/
|
|||||||
|
|
||||||
# Claude Code runtime
|
# Claude Code runtime
|
||||||
.claude/
|
.claude/
|
||||||
|
|
||||||
|
# W1 audio smoke: regenerated by scripts/audio_smoke_data.py, only manifest is committed
|
||||||
|
data/audio_smoke/wavs/
|
||||||
|
|||||||
@@ -0,0 +1,5 @@
|
|||||||
|
{"wav": "wavs/sine_0220.wav", "text": "low tone", "sr": 16000}
|
||||||
|
{"wav": "wavs/sine_0330.wav", "text": "mid low tone", "sr": 16000}
|
||||||
|
{"wav": "wavs/sine_0440.wav", "text": "middle tone", "sr": 16000}
|
||||||
|
{"wav": "wavs/sine_0660.wav", "text": "mid high tone", "sr": 16000}
|
||||||
|
{"wav": "wavs/sine_0880.wav", "text": "high tone", "sr": 16000}
|
||||||
+9
-4
@@ -19,12 +19,17 @@
|
|||||||
|
|
||||||
参考 research §1.2 模块图。
|
参考 research §1.2 模块图。
|
||||||
|
|
||||||
- [ ] `nanochat/audio.py`:WhisperEncoder wrapper(冻结,权重优先走 ModelScope,例如 `iic/Whisper-large-v3` / `iic/Whisper-small`;HF mirror 留作 fallback)+ Projector(MLP,输出维度对齐 nanochat `model_dim`)
|
- [x] `nanochat/audio.py`:WhisperEncoder wrapper(冻结,ModelScope 优先经 `WHISPER_MS_ID`,HF fallback 默认 `openai/whisper-base`)+ Projector(MLP,输出维度对齐 nanochat `n_embd`)
|
||||||
- [ ] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数,作为 soft tokens prepend 到 text embedding 前面
|
- [x] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数,作为 soft tokens prepend 到 text embedding 前面(kv_cache 路径暂不支持,audio 位置 targets 自动 -1 mask)
|
||||||
- [ ] mini dataset:1–10 段 5s wav + 字幕,落 `data/audio_smoke/`(git 内不存音频,仅清单 + 下载脚本)
|
- [x] mini dataset:5 段 5s 合成正弦 + 字幕,落 `data/audio_smoke/`(wav 由 `scripts/audio_smoke_data.py` 生成,gitignore 排除)
|
||||||
- [ ] `scripts/audio_align_smoke.py`:50 步、d6 nanochat base、loss 下降即过
|
- [x] `scripts/audio_align_smoke.py`:50 步、d6 随机初始化 GPT、字节级 tokenizer、loss 下降即过(4090 实测 ~1s,5.55→0.17)
|
||||||
- [ ] CI 加 audio smoke job(ailab runner 装 ffmpeg;whisper 走 transformers 即可)
|
- [ ] CI 加 audio smoke job(ailab runner 装 ffmpeg;whisper 走 transformers 即可)
|
||||||
|
|
||||||
|
W1 后续可改进(暂搁,留给 W3+/W5+ 质感任务):
|
||||||
|
|
||||||
|
- 当前用 `last_hidden_state`(最偏文本语义的层);为质感感知应切到中间层 / 多层 weighted sum / w2v-bert
|
||||||
|
- d6 GPT 是随机初始化,alignment 信号其实在练 LM 而非 projector;W2 上真 base 后 freeze LM、只练 projector 才是真正的弱对齐
|
||||||
|
|
||||||
## W2 — S1 弱对齐训练
|
## W2 — S1 弱对齐训练
|
||||||
|
|
||||||
- [ ] 拉 LibriSpeech 100h(HF mirror),预提 Whisper-base encoder 特征落盘 webdataset
|
- [ ] 拉 LibriSpeech 100h(HF mirror),预提 Whisper-base encoder 特征落盘 webdataset
|
||||||
|
|||||||
@@ -0,0 +1,153 @@
|
|||||||
|
# W1 — 音频通路 forward smoke 总结
|
||||||
|
|
||||||
|
> 阶段: W1(参考 [`doc/todo.md`](todo.md) / [`doc/research_feasibility.md`](research_feasibility.md) §1.2)
|
||||||
|
> 作者: @mochi
|
||||||
|
> 日期: 2026-05-05
|
||||||
|
> 状态: ✅ 通路打通,CI 接入留作 W2 同步
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. 目标
|
||||||
|
|
||||||
|
W1 的唯一目标是 **proof of plumbing**:
|
||||||
|
|
||||||
|
```
|
||||||
|
wav → WhisperEncoder (frozen) → Projector → prepend 到 text embedding
|
||||||
|
→ 随机初始化 d6 GPT → CE loss on text only
|
||||||
|
```
|
||||||
|
|
||||||
|
跑通这条链,并验证梯度能从 loss 反传到 Projector。**不**追求 alignment 质量、
|
||||||
|
不追求 transcribe 能力——这些是 W2/W3 的事。
|
||||||
|
|
||||||
|
Pass criterion 故意做得宽:50 步训练后 `loss[-1] < loss[0] - 0.5`。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. 实现切片
|
||||||
|
|
||||||
|
W1 的全部代码改动落在三个 commit 里,逻辑互不耦合:
|
||||||
|
|
||||||
|
| Commit | 范围 | 核心改动 |
|
||||||
|
|---|---|---|
|
||||||
|
| [`d760915`](../../../commit/d760915) | `nanochat/gpt.py` | `forward()` 加 `audio_features` 关键字参数 |
|
||||||
|
| [`7cc94cf`](../../../commit/7cc94cf) | `nanochat/audio.py`(新) | `WhisperEncoder` + `Projector` 模块 |
|
||||||
|
| [`3c1cc33`](../../../commit/3c1cc33) | `scripts/`、`data/` | 合成数据集 + 50 步 smoke 脚本 |
|
||||||
|
|
||||||
|
### 1.1 GPT.forward 的 audio prepend
|
||||||
|
|
||||||
|
LLaVA-style:把 projector 输出的 soft tokens 拼在 text embedding **前面**,
|
||||||
|
其余照旧。改动 18 行,要点:
|
||||||
|
|
||||||
|
- **prepend 时机**:smear 之后、transformer trunk 之前。smear 的 prev-token
|
||||||
|
语义对 soft tokens 没定义,所以让它仍然是严格的 "text-only" 操作。
|
||||||
|
- **rotary 位置**:原来按 text 长度切片 cos/sin;audio 在场时改切 `[0, T_a + T_text)`。
|
||||||
|
rotary 缓存在 `__init__` 时已经按 `sequence_len * 10` 过度分配,覆盖得过来。
|
||||||
|
- **value embedding**:`self.value_embeds[i](idx)` 需要 token id;audio 位置用 0
|
||||||
|
填充。`ve_gate` 是 input-dependent 的,会自己学到压制这些假行——W1 不操心。
|
||||||
|
- **targets 对齐**:传入 targets 时自动在 audio 位置 prepend `-1`(`ignore_index`),
|
||||||
|
loss 只统计文本位置。
|
||||||
|
- **不支持的路径**:`kv_cache is not None` 时直接 `assert`。KV-cache 是 prefill
|
||||||
|
+ decode 的协议,给它写 audio 语义需要重新设计,W1 只跑 train-style forward,
|
||||||
|
现在无所谓。
|
||||||
|
|
||||||
|
### 1.2 nanochat/audio.py
|
||||||
|
|
||||||
|
两个类,零隐式状态。
|
||||||
|
|
||||||
|
**`WhisperEncoder`**
|
||||||
|
|
||||||
|
- 权重加载顺序:`WHISPER_MS_ID` 设了就先 ModelScope(CN 镜像政策,详见
|
||||||
|
`doc/todo.md` 决定事项),失败/没设就 HF(`WHISPER_HF_ID`,默认
|
||||||
|
`openai/whisper-base`)。HF 路径自动 honor `HF_ENDPOINT=hf-mirror.com`,
|
||||||
|
跟 `scripts/smoke.sh` 现有 env 兼容。
|
||||||
|
- `__init__` 即 freeze(`requires_grad = False` + `eval()`),调用方不需要
|
||||||
|
记得"冻结一下"——少一种忘记踩坑的方式。
|
||||||
|
- `preprocess(audios)` 走 transformers' `WhisperFeatureExtractor`,
|
||||||
|
`forward(input_features)` 走 encoder.last_hidden_state。
|
||||||
|
- **设计妥协**:Whisper 把每段音频 pad 到 30 s → encoder 输出 1500 帧不变。
|
||||||
|
5 s 的样本浪费 6× 算力,W1 不优化;W2 可以换 streaming chunking。
|
||||||
|
|
||||||
|
**`Projector`**
|
||||||
|
|
||||||
|
- 两层 MLP:`in_dim → out_dim → out_dim`,GELU 激活,bias 全无。
|
||||||
|
- 用 `nanochat.gpt.Linear`,master 权重 fp32、forward 时按输入 dtype(bf16)
|
||||||
|
cast,对齐 nanochat 主模型风格。
|
||||||
|
- **`fc2` 零初始化**:模型在第 0 步对音频"完全无视"——从一个干净的 baseline
|
||||||
|
起步,audio 路径是 opt-out by default、训练后才 opt-in。这对 debug 很友好:
|
||||||
|
forward 走通了但 loss 不动?立刻就能定位到 projector 没在学。
|
||||||
|
|
||||||
|
### 1.3 W1 smoke
|
||||||
|
|
||||||
|
**合成数据**(`scripts/audio_smoke_data.py`):5 段 5 s 正弦(220/330/440/660/
|
||||||
|
880 Hz,加二次谐波 + 1% 高斯噪声防止纯音 log-mel 退化),文字标签依次是
|
||||||
|
"low / mid low / middle / mid high / high tone"。文件在 `data/audio_smoke/wavs/`
|
||||||
|
(gitignored),`manifest.jsonl` 入 git。stdlib `wave` 写 PCM16,零额外依赖。
|
||||||
|
|
||||||
|
为什么是合成而不是 LibriSpeech?W1 是 forward proof,不需要数据真实——网络
|
||||||
|
依赖反而会让 smoke 不稳定。W2 上真数据。
|
||||||
|
|
||||||
|
**对齐脚本**(`scripts/audio_align_smoke.py`):
|
||||||
|
- **字节级 tokenizer**:UTF-8 字节 + 单独的 `<BOS>`,vocab=257。绕开 nanochat
|
||||||
|
BPE 的 `tok_train` 前置依赖,让 W1 完全 standalone。W2 切回真 BPE。
|
||||||
|
- 流程:load 5 个 wav → 一次性预提 Whisper 特征(encoder 冻结,每步重算就
|
||||||
|
是浪费)→ 50 步 AdamW,projector + LM 一起练。
|
||||||
|
- pass 判据:`losses[0] - losses[-1] >= 0.5`,宽容到挡不住任何真实失败。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. 实测结果
|
||||||
|
|
||||||
|
cpc-i7(RTX 4090 24G,bf16,CUDA 12.8):
|
||||||
|
|
||||||
|
```
|
||||||
|
input_features: (5, 80, 3000) # batch × n_mels × T_mel
|
||||||
|
whisper features: (5, 1500, 512) # batch × T_a × d_audio (whisper-base)
|
||||||
|
GPT: depth=6 n_embd=384 n_head=6
|
||||||
|
text idx: (5, 13) # max(transcript) - 1
|
||||||
|
|
||||||
|
step 000 | loss 5.5533
|
||||||
|
step 005 | loss 1.7214
|
||||||
|
step 010 | loss 0.9479
|
||||||
|
...
|
||||||
|
step 049 | loss 0.1658
|
||||||
|
|
||||||
|
Done 50 steps in 0.9s | start=5.5533 end=0.1658 drop=5.3875
|
||||||
|
PASS
|
||||||
|
```
|
||||||
|
|
||||||
|
- 50 步训练 ~0.9 s(不含 Whisper 首次下载和编码)
|
||||||
|
- `tests/` 13 passed / 10 skipped — `forward()` 改动没破坏既有路径
|
||||||
|
- 显存峰值未测;whisper-base + d6 + B=5 应该 < 2 GB
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. 已知限制 / 留给后续
|
||||||
|
|
||||||
|
按"留意度"降序:
|
||||||
|
|
||||||
|
1. **loss 下降并不能证明对齐学会了**。LM 也在训,5 段短字符串完全可以靠
|
||||||
|
LM 死记。projector 是不是真的在传递音频信息,得 W2 freeze LM 才能验证
|
||||||
|
——那时候唯一能改 loss 的路径就是 projector → audio 通路。
|
||||||
|
2. **`last_hidden_state` 是 Whisper 最偏文本语义的层**。"质感感知" 这个项目
|
||||||
|
定位要求 timbre / prosody / emotion 等非文本信号能传到 LM;W3+/W5+ 时
|
||||||
|
应当切到中间层、多层 weighted sum,或者干脆换 wav2vec2 / w2v-bert
|
||||||
|
(`facebook/w2v-bert-2.0` 在 cpc-i7 cache 里就有)。
|
||||||
|
3. **30 s 强 padding** 让短音频浪费 6× 算力。短期是 W2 数据准备的常数代价;
|
||||||
|
长期需要 streaming-style chunking 或者直接换非-Whisper backbone。
|
||||||
|
4. **CI smoke job 暂未接入**。W1 在 4090 本地跑通,按计划在 W2 同步把 audio
|
||||||
|
smoke 加进 `scripts/smoke.sh` + `.gitea/workflows/smoke.yml`,统一在 ailab
|
||||||
|
runner 上跑。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. 衔接 W2
|
||||||
|
|
||||||
|
[`doc/todo.md`](todo.md) W2 段已经写好,关键交接点:
|
||||||
|
|
||||||
|
- **数据**:LibriSpeech 100h(HF mirror),预提 Whisper-base 特征落 webdataset
|
||||||
|
- **冻结策略**:Whisper + LM **都**冻结,只训 Projector —— 这是真正的
|
||||||
|
弱对齐,也是验证 W1 第 1 条限制的实验
|
||||||
|
- **可视化**:projector 输出对 LM 嵌入空间做 PCA,看不同样本是否在文本嵌入
|
||||||
|
空间里聚类
|
||||||
|
- **wandb**:项目名分到 `nanochat-omni-audio`,跟 nanochat 文本 base 的
|
||||||
|
`nanochat` 互不污染
|
||||||
@@ -0,0 +1,116 @@
|
|||||||
|
"""
|
||||||
|
Audio modality for nanochat-omni (W1).
|
||||||
|
|
||||||
|
Frozen Whisper encoder produces soft tokens; Projector maps them into nanochat's
|
||||||
|
residual stream (n_embd) so they can be prepended to text token embeddings
|
||||||
|
LLaVA-style. Output remains text-only.
|
||||||
|
|
||||||
|
Weights:
|
||||||
|
- ModelScope first when WHISPER_MS_ID is set (e.g. iic/Whisper-small,
|
||||||
|
iic/Whisper-large-v3) — preferred path on CN boxes (ailab/zy/etc).
|
||||||
|
- HuggingFace fallback (honors HF_ENDPOINT for hf-mirror).
|
||||||
|
|
||||||
|
The encoder is held frozen; only Projector is trained.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
|
||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
import torch.nn.functional as F
|
||||||
|
|
||||||
|
from nanochat.gpt import Linear
|
||||||
|
|
||||||
|
|
||||||
|
def _load_whisper_via_modelscope(ms_id):
|
||||||
|
from modelscope import snapshot_download
|
||||||
|
local_path = snapshot_download(ms_id)
|
||||||
|
from transformers import WhisperModel, WhisperFeatureExtractor
|
||||||
|
extractor = WhisperFeatureExtractor.from_pretrained(local_path)
|
||||||
|
model = WhisperModel.from_pretrained(local_path)
|
||||||
|
return extractor, model.encoder
|
||||||
|
|
||||||
|
|
||||||
|
def _load_whisper_via_hf(hf_id):
|
||||||
|
from transformers import WhisperModel, WhisperFeatureExtractor
|
||||||
|
extractor = WhisperFeatureExtractor.from_pretrained(hf_id)
|
||||||
|
model = WhisperModel.from_pretrained(hf_id)
|
||||||
|
return extractor, model.encoder
|
||||||
|
|
||||||
|
|
||||||
|
def load_whisper(hf_id="openai/whisper-base", ms_id=None):
|
||||||
|
"""Load (feature_extractor, encoder). Tries ModelScope if ms_id is given,
|
||||||
|
falls back to HuggingFace. Returns the .encoder submodule (no decoder)."""
|
||||||
|
ms_id = ms_id or os.environ.get("WHISPER_MS_ID")
|
||||||
|
hf_id = os.environ.get("WHISPER_HF_ID", hf_id)
|
||||||
|
errors = []
|
||||||
|
if ms_id:
|
||||||
|
try:
|
||||||
|
return _load_whisper_via_modelscope(ms_id)
|
||||||
|
except Exception as e:
|
||||||
|
errors.append(f"modelscope({ms_id}): {e}")
|
||||||
|
try:
|
||||||
|
return _load_whisper_via_hf(hf_id)
|
||||||
|
except Exception as e:
|
||||||
|
errors.append(f"hf({hf_id}): {e}")
|
||||||
|
raise RuntimeError("Failed to load Whisper encoder. Tried: " + " | ".join(errors))
|
||||||
|
|
||||||
|
|
||||||
|
class WhisperEncoder(nn.Module):
|
||||||
|
"""Frozen Whisper encoder. Forward takes log-mel input_features
|
||||||
|
(B, n_mels, T_mel) and returns (B, T_enc, d_model)."""
|
||||||
|
|
||||||
|
def __init__(self, hf_id="openai/whisper-base", ms_id=None, device=None, dtype=None):
|
||||||
|
super().__init__()
|
||||||
|
extractor, encoder = load_whisper(hf_id=hf_id, ms_id=ms_id)
|
||||||
|
self.feature_extractor = extractor
|
||||||
|
self.encoder = encoder
|
||||||
|
for p in self.encoder.parameters():
|
||||||
|
p.requires_grad = False
|
||||||
|
self.encoder.eval()
|
||||||
|
self._d_model = encoder.config.d_model
|
||||||
|
self.sampling_rate = extractor.sampling_rate
|
||||||
|
if device is not None or dtype is not None:
|
||||||
|
self.encoder.to(device=device, dtype=dtype)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def d_model(self):
|
||||||
|
return self._d_model
|
||||||
|
|
||||||
|
def preprocess(self, audio_arrays):
|
||||||
|
"""audio_arrays: list of 1D np.float32 (mono, sampling_rate Hz).
|
||||||
|
Returns input_features tensor (B, n_mels, T_mel)."""
|
||||||
|
out = self.feature_extractor(
|
||||||
|
audio_arrays,
|
||||||
|
sampling_rate=self.sampling_rate,
|
||||||
|
return_tensors="pt",
|
||||||
|
)
|
||||||
|
return out.input_features
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
def forward(self, input_features):
|
||||||
|
out = self.encoder(input_features=input_features)
|
||||||
|
return out.last_hidden_state
|
||||||
|
|
||||||
|
|
||||||
|
class Projector(nn.Module):
|
||||||
|
"""LLaVA-style 2-layer MLP: audio_d -> hidden -> n_embd.
|
||||||
|
|
||||||
|
Uses nanochat's Linear so master weights stay fp32 while forward runs in
|
||||||
|
the activation dtype (typically bf16). Matches the convention in gpt.py.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, in_dim, out_dim, hidden_dim=None):
|
||||||
|
super().__init__()
|
||||||
|
hidden_dim = hidden_dim or out_dim
|
||||||
|
self.fc1 = Linear(in_dim, hidden_dim, bias=False)
|
||||||
|
self.fc2 = Linear(hidden_dim, out_dim, bias=False)
|
||||||
|
s = (3.0 / in_dim) ** 0.5
|
||||||
|
torch.nn.init.uniform_(self.fc1.weight, -s, s)
|
||||||
|
torch.nn.init.zeros_(self.fc2.weight)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
x = self.fc1(x)
|
||||||
|
x = F.gelu(x)
|
||||||
|
x = self.fc2(x)
|
||||||
|
return x
|
||||||
+18
-2
@@ -413,7 +413,7 @@ class GPT(nn.Module):
|
|||||||
group["initial_lr"] = group["lr"]
|
group["initial_lr"] = group["lr"]
|
||||||
return optimizer
|
return optimizer
|
||||||
|
|
||||||
def forward(self, idx, targets=None, kv_cache=None, loss_reduction='mean'):
|
def forward(self, idx, targets=None, kv_cache=None, loss_reduction='mean', audio_features=None):
|
||||||
B, T = idx.size()
|
B, T = idx.size()
|
||||||
|
|
||||||
# Grab the rotary embeddings for the current sequence length (they are of shape (1, seq_len, 1, head_dim/2))
|
# Grab the rotary embeddings for the current sequence length (they are of shape (1, seq_len, 1, head_dim/2))
|
||||||
@@ -448,6 +448,22 @@ class GPT(nn.Module):
|
|||||||
gate = self.smear_lambda.to(x.dtype) * torch.sigmoid(self.smear_gate(x[:, :, :24]))
|
gate = self.smear_lambda.to(x.dtype) * torch.sigmoid(self.smear_gate(x[:, :, :24]))
|
||||||
x = x + gate * x_pre_smear
|
x = x + gate * x_pre_smear
|
||||||
|
|
||||||
|
# Audio soft-token prepend (LLaVA-style): audio_features must already be projected to n_embd.
|
||||||
|
idx_for_ve = idx
|
||||||
|
if audio_features is not None:
|
||||||
|
assert kv_cache is None, "audio_features prepend not supported with kv_cache"
|
||||||
|
audio = norm(audio_features.to(COMPUTE_DTYPE))
|
||||||
|
T_a = audio.size(1)
|
||||||
|
x = torch.cat([audio, x], dim=1)
|
||||||
|
T_full = T_a + T
|
||||||
|
assert T_full <= self.cos.size(1), f"Sequence length grew beyond rotary cache: {T_full} > {self.cos.size(1)}"
|
||||||
|
cos_sin = self.cos[:, :T_full], self.sin[:, :T_full]
|
||||||
|
idx_pad = torch.zeros((B, T_a), dtype=idx.dtype, device=idx.device)
|
||||||
|
idx_for_ve = torch.cat([idx_pad, idx], dim=1)
|
||||||
|
if targets is not None:
|
||||||
|
pad = torch.full((B, T_a), -1, dtype=targets.dtype, device=targets.device)
|
||||||
|
targets = torch.cat([pad, targets], dim=1)
|
||||||
|
|
||||||
# Forward the trunk of the Transformer
|
# Forward the trunk of the Transformer
|
||||||
x0 = x # save initial normalized embedding for x0 residual
|
x0 = x # save initial normalized embedding for x0 residual
|
||||||
n_layer = self.config.n_layer
|
n_layer = self.config.n_layer
|
||||||
@@ -455,7 +471,7 @@ class GPT(nn.Module):
|
|||||||
x_backout = None
|
x_backout = None
|
||||||
for i, block in enumerate(self.transformer.h):
|
for i, block in enumerate(self.transformer.h):
|
||||||
x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0
|
x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0
|
||||||
ve = self.value_embeds[str(i)](idx).to(x.dtype) if str(i) in self.value_embeds else None
|
ve = self.value_embeds[str(i)](idx_for_ve).to(x.dtype) if str(i) in self.value_embeds else None
|
||||||
x = block(x, ve, cos_sin, self.window_sizes[i], kv_cache)
|
x = block(x, ve, cos_sin, self.window_sizes[i], kv_cache)
|
||||||
if i == backout_layer:
|
if i == backout_layer:
|
||||||
x_backout = x
|
x_backout = x
|
||||||
|
|||||||
@@ -0,0 +1,179 @@
|
|||||||
|
"""
|
||||||
|
W1 smoke: prove the audio path works end-to-end.
|
||||||
|
|
||||||
|
Pipeline:
|
||||||
|
wav -> WhisperEncoder (frozen) -> Projector -> prepend to text embeddings
|
||||||
|
-> tiny d6 GPT (random init) -> CE loss on text tokens only
|
||||||
|
|
||||||
|
The model is randomly initialized and the dataset is 5 synthetic sine clips,
|
||||||
|
so the only thing this validates is that gradients flow through the projector
|
||||||
|
into a decreasing loss. Pass criterion: end loss < start loss by a clear margin.
|
||||||
|
|
||||||
|
Standalone tokenizer (UTF-8 bytes + a single BOS) so the smoke does not depend
|
||||||
|
on the nanochat BPE tokenizer being trained yet — that prerequisite belongs to
|
||||||
|
W2 onwards.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python -m scripts.audio_align_smoke
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
import wave
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from nanochat.audio import Projector, WhisperEncoder
|
||||||
|
from nanochat.common import (
|
||||||
|
COMPUTE_DTYPE,
|
||||||
|
autodetect_device_type,
|
||||||
|
compute_cleanup,
|
||||||
|
compute_init,
|
||||||
|
)
|
||||||
|
from nanochat.gpt import GPT, GPTConfig
|
||||||
|
|
||||||
|
|
||||||
|
# Byte-level tokenizer: vocab[0..255] = raw UTF-8 byte, 256 = <BOS>.
|
||||||
|
BOS_ID = 256
|
||||||
|
VOCAB_SIZE = 257
|
||||||
|
|
||||||
|
|
||||||
|
def encode(text):
|
||||||
|
return [BOS_ID] + list(text.encode("utf-8"))
|
||||||
|
|
||||||
|
|
||||||
|
def load_manifest(data_dir):
|
||||||
|
items = []
|
||||||
|
with open(Path(data_dir) / "manifest.jsonl") as f:
|
||||||
|
for line in f:
|
||||||
|
items.append(json.loads(line))
|
||||||
|
return items
|
||||||
|
|
||||||
|
|
||||||
|
def load_wav_mono16k(path):
|
||||||
|
"""Read a mono PCM16 WAV (matches scripts/audio_smoke_data.py output)."""
|
||||||
|
with wave.open(str(path), "rb") as w:
|
||||||
|
assert w.getnchannels() == 1, f"expected mono, got {w.getnchannels()} channels"
|
||||||
|
assert w.getsampwidth() == 2, f"expected pcm16, got sampwidth {w.getsampwidth()}"
|
||||||
|
sr = w.getframerate()
|
||||||
|
frames = w.readframes(w.getnframes())
|
||||||
|
audio = np.frombuffer(frames, dtype=np.int16).astype(np.float32) / 32768.0
|
||||||
|
return audio, sr
|
||||||
|
|
||||||
|
|
||||||
|
def build_gpt(depth, head_dim, max_seq_len, device):
|
||||||
|
base_dim = depth * 64 # nanochat's default aspect ratio
|
||||||
|
model_dim = ((base_dim + head_dim - 1) // head_dim) * head_dim
|
||||||
|
n_head = model_dim // head_dim
|
||||||
|
config = GPTConfig(
|
||||||
|
sequence_len=max_seq_len,
|
||||||
|
vocab_size=VOCAB_SIZE,
|
||||||
|
n_layer=depth,
|
||||||
|
n_head=n_head,
|
||||||
|
n_kv_head=n_head,
|
||||||
|
n_embd=model_dim,
|
||||||
|
window_pattern="L",
|
||||||
|
)
|
||||||
|
with torch.device("meta"):
|
||||||
|
model = GPT(config)
|
||||||
|
model.to_empty(device=device)
|
||||||
|
model.init_weights()
|
||||||
|
return model, config
|
||||||
|
|
||||||
|
|
||||||
|
def pack_text_batch(text_ids_list, device):
|
||||||
|
"""idx[i, t] is input token; targets[i, t] is the next token (or -1 to ignore).
|
||||||
|
Right-pad to the longest sequence with 0/-1.
|
||||||
|
"""
|
||||||
|
in_len = max(len(ids) for ids in text_ids_list) - 1
|
||||||
|
B = len(text_ids_list)
|
||||||
|
idx = torch.zeros((B, in_len), dtype=torch.long, device=device)
|
||||||
|
targets = torch.full((B, in_len), -1, dtype=torch.long, device=device)
|
||||||
|
for i, ids in enumerate(text_ids_list):
|
||||||
|
L = len(ids) - 1
|
||||||
|
idx[i, :L] = torch.tensor(ids[:-1], dtype=torch.long, device=device)
|
||||||
|
targets[i, :L] = torch.tensor(ids[1:], dtype=torch.long, device=device)
|
||||||
|
return idx, targets
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--data-dir", default="data/audio_smoke")
|
||||||
|
parser.add_argument("--depth", type=int, default=6)
|
||||||
|
parser.add_argument("--head-dim", type=int, default=64)
|
||||||
|
parser.add_argument("--max-seq-len", type=int, default=2048)
|
||||||
|
parser.add_argument("--num-iters", type=int, default=50)
|
||||||
|
parser.add_argument("--lr", type=float, default=3e-3)
|
||||||
|
parser.add_argument("--whisper", default="openai/whisper-base",
|
||||||
|
help="HF Whisper id (override via WHISPER_HF_ID env)")
|
||||||
|
parser.add_argument("--loss-drop-min", type=float, default=0.5,
|
||||||
|
help="end loss must be at least this much lower than start loss")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
device_type = autodetect_device_type()
|
||||||
|
ddp, _, _, _, device = compute_init(device_type)
|
||||||
|
assert not ddp, "smoke is single-process"
|
||||||
|
|
||||||
|
# Synthetic audio + manifest: regenerate if missing so the script is self-contained.
|
||||||
|
if not (Path(args.data_dir) / "manifest.jsonl").exists():
|
||||||
|
from scripts.audio_smoke_data import generate_synthetic
|
||||||
|
generate_synthetic(args.data_dir)
|
||||||
|
|
||||||
|
items = load_manifest(args.data_dir)
|
||||||
|
audios = [load_wav_mono16k(Path(args.data_dir) / it["wav"])[0] for it in items]
|
||||||
|
texts = [it["text"] for it in items]
|
||||||
|
print(f"loaded {len(items)} samples: {texts}")
|
||||||
|
|
||||||
|
# Frozen Whisper encoder + Projector to nanochat n_embd
|
||||||
|
print(f"loading Whisper encoder ({args.whisper})...")
|
||||||
|
whisper = WhisperEncoder(hf_id=args.whisper, device=device, dtype=COMPUTE_DTYPE)
|
||||||
|
|
||||||
|
# Pre-extract Whisper input_features and encoder outputs once; encoder is frozen
|
||||||
|
# so its output never changes across training steps -> hoist out of the loop.
|
||||||
|
input_features = whisper.preprocess(audios).to(device=device, dtype=COMPUTE_DTYPE)
|
||||||
|
print(f"input_features: {tuple(input_features.shape)}")
|
||||||
|
audio_feats = whisper(input_features).detach()
|
||||||
|
print(f"whisper features: {tuple(audio_feats.shape)} (T_a soft tokens)")
|
||||||
|
|
||||||
|
# GPT (random init, d6 by default) and Projector
|
||||||
|
gpt, config = build_gpt(args.depth, args.head_dim, args.max_seq_len, device)
|
||||||
|
print(f"GPT: depth={config.n_layer} n_embd={config.n_embd} n_head={config.n_head}")
|
||||||
|
projector = Projector(in_dim=whisper.d_model, out_dim=config.n_embd).to(device=device)
|
||||||
|
|
||||||
|
# Tokenize transcripts and pack into a batch
|
||||||
|
text_ids_list = [encode(t) for t in texts]
|
||||||
|
idx, targets = pack_text_batch(text_ids_list, device=device)
|
||||||
|
print(f"text idx: {tuple(idx.shape)} (max_text_len-1)")
|
||||||
|
|
||||||
|
# Single AdamW over projector + LM. Whisper stays frozen (requires_grad=False
|
||||||
|
# was set in WhisperEncoder.__init__, so its params won't appear here anyway).
|
||||||
|
train_params = list(projector.parameters()) + [p for p in gpt.parameters() if p.requires_grad]
|
||||||
|
optim = torch.optim.AdamW(train_params, lr=args.lr, betas=(0.9, 0.95), weight_decay=0.0)
|
||||||
|
|
||||||
|
losses = []
|
||||||
|
t0 = time.time()
|
||||||
|
for step in range(args.num_iters):
|
||||||
|
soft_tokens = projector(audio_feats) # (B, T_a, n_embd)
|
||||||
|
loss = gpt(idx, targets=targets, audio_features=soft_tokens)
|
||||||
|
optim.zero_grad(set_to_none=True)
|
||||||
|
loss.backward()
|
||||||
|
torch.nn.utils.clip_grad_norm_(train_params, 1.0)
|
||||||
|
optim.step()
|
||||||
|
losses.append(loss.item())
|
||||||
|
if step % 5 == 0 or step == args.num_iters - 1:
|
||||||
|
print(f"step {step:03d} | loss {loss.item():.4f}")
|
||||||
|
dt = time.time() - t0
|
||||||
|
|
||||||
|
drop = losses[0] - losses[-1]
|
||||||
|
print(f"\nDone {args.num_iters} steps in {dt:.1f}s | start={losses[0]:.4f} end={losses[-1]:.4f} drop={drop:.4f}")
|
||||||
|
assert drop >= args.loss_drop_min, f"loss did not drop enough: {drop:.4f} < {args.loss_drop_min}"
|
||||||
|
print("PASS: audio path forward+backward works, loss is descending.")
|
||||||
|
|
||||||
|
compute_cleanup()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,78 @@
|
|||||||
|
"""
|
||||||
|
Generate the W1 audio smoke dataset: a handful of 5s sine-wave clips paired
|
||||||
|
with deterministic transcripts.
|
||||||
|
|
||||||
|
Why synthetic instead of real speech: W1 only proves the forward path
|
||||||
|
(WhisperEncoder -> Projector -> GPT prepend) and that the projector's gradient
|
||||||
|
flows into a decreasing loss on a tiny fixed set. Real speech adds a network
|
||||||
|
dependency to a step that should be reproducible offline. W2 swaps in
|
||||||
|
LibriSpeech.
|
||||||
|
|
||||||
|
Audio files land under data/audio_smoke/wavs/ (gitignored). The manifest
|
||||||
|
data/audio_smoke/manifest.jsonl is the only artifact committed.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python -m scripts.audio_smoke_data
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import wave
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
|
||||||
|
SAMPLES = [
|
||||||
|
(220.0, "low tone"),
|
||||||
|
(330.0, "mid low tone"),
|
||||||
|
(440.0, "middle tone"),
|
||||||
|
(660.0, "mid high tone"),
|
||||||
|
(880.0, "high tone"),
|
||||||
|
]
|
||||||
|
SR = 16000
|
||||||
|
DURATION_S = 5.0
|
||||||
|
|
||||||
|
|
||||||
|
def synth_sine(freq_hz, duration_s=DURATION_S, sr=SR):
|
||||||
|
"""Sine + 2nd harmonic + a sliver of noise so Whisper sees non-degenerate
|
||||||
|
frames (a pure tone collapses to a near-constant log-mel)."""
|
||||||
|
t = np.arange(int(sr * duration_s)) / sr
|
||||||
|
x = 0.5 * np.sin(2 * np.pi * freq_hz * t) + 0.25 * np.sin(2 * np.pi * 2 * freq_hz * t)
|
||||||
|
rng = np.random.default_rng(int(freq_hz))
|
||||||
|
x = x + 0.01 * rng.standard_normal(len(x))
|
||||||
|
return x.astype(np.float32)
|
||||||
|
|
||||||
|
|
||||||
|
def write_wav_pcm16(path, audio, sr=SR):
|
||||||
|
"""Write mono PCM16 WAV using the stdlib (no scipy/soundfile dependency)."""
|
||||||
|
pcm = np.clip(audio, -1.0, 1.0)
|
||||||
|
pcm = (pcm * 32767.0).astype(np.int16)
|
||||||
|
with wave.open(str(path), "wb") as w:
|
||||||
|
w.setnchannels(1)
|
||||||
|
w.setsampwidth(2)
|
||||||
|
w.setframerate(sr)
|
||||||
|
w.writeframes(pcm.tobytes())
|
||||||
|
|
||||||
|
|
||||||
|
def generate_synthetic(data_dir):
|
||||||
|
data_dir = Path(data_dir)
|
||||||
|
wav_dir = data_dir / "wavs"
|
||||||
|
wav_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
manifest_path = data_dir / "manifest.jsonl"
|
||||||
|
with open(manifest_path, "w") as f:
|
||||||
|
for freq, text in SAMPLES:
|
||||||
|
name = f"sine_{int(freq):04d}.wav"
|
||||||
|
wav_path = wav_dir / name
|
||||||
|
if not wav_path.exists():
|
||||||
|
write_wav_pcm16(wav_path, synth_sine(freq))
|
||||||
|
f.write(json.dumps({"wav": f"wavs/{name}", "text": text, "sr": SR}) + "\n")
|
||||||
|
print(f"Wrote {len(SAMPLES)} samples to {data_dir}")
|
||||||
|
return manifest_path
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--data-dir", default="data/audio_smoke")
|
||||||
|
args = parser.parse_args()
|
||||||
|
generate_synthetic(args.data_dir)
|
||||||
Reference in New Issue
Block a user