3c1cc3302f
End-to-end smoke proving the audio path:
wav -> WhisperEncoder (frozen) -> Projector -> prepend to text embeddings
-> tiny d6 GPT (random init) -> CE loss on text only
Pass criterion is a plain "loss drops by at least 0.5". On a 4090 the run
finishes in ~1 s and goes 5.55 -> 0.17 over 50 steps, so the threshold has
plenty of headroom against false positives.
Two design calls worth keeping in mind:
1. Synthetic sine clips, not LibriSpeech. W1 is forward-path proof, not
alignment quality, and a deterministic offline dataset means no network
on the smoke path. data/audio_smoke/manifest.jsonl is the only thing
committed; wavs are regenerated by audio_smoke_data.py and gitignored.
W2 swaps in real LibriSpeech.
2. Standalone byte-level tokenizer (UTF-8 bytes + a single BOS, vocab=257).
Avoids depending on a trained nanochat BPE — the d6 GPT is random
anyway, so vocab choice doesn't matter for "does the gradient flow"
smoke. W2 onwards uses the real BPE on a real base.
Caveat documented in doc/todo.md: because the LM is also random and being
trained, the loss-down here mostly reflects the LM memorising 5 short
strings, not Whisper-Projector alignment. That's fine for proving
plumbing; W2 freezes the LM so projector-only gradient is the only path
to lower loss.
20 lines
254 B
Plaintext
20 lines
254 B
Plaintext
.venv/
|
|
__pycache__/
|
|
*.pyc
|
|
dev-ignore/
|
|
report.md
|
|
eval_bundle/
|
|
|
|
# Secrets
|
|
.env
|
|
|
|
# Local setup
|
|
CLAUDE.md
|
|
wandb/
|
|
|
|
# Claude Code runtime
|
|
.claude/
|
|
|
|
# W1 audio smoke: regenerated by scripts/audio_smoke_data.py, only manifest is committed
|
|
data/audio_smoke/wavs/
|