Commit Graph

387 Commits

Author SHA1 Message Date
Fam Zheng 7cc94cf584 omni: nanochat/audio.py — frozen Whisper encoder + Projector
The audio modality module that pairs with the gpt.forward audio_features
hook. Two things live here:

WhisperEncoder: thin wrapper around transformers' WhisperModel.encoder.
- Weight loading prefers ModelScope when WHISPER_MS_ID is set (matches the
  CN-mirror policy in doc/todo.md — modelscope is first-class for model
  weights, hf-mirror is fallback). Otherwise falls back to plain HF, with
  WHISPER_HF_ID as the override and `openai/whisper-base` as the default
  (the smallest variant that still produces useful features for smoke).
- Encoder params have requires_grad=False from __init__ so they never
  appear in the optimizer's param list. Caller does not need to remember
  to freeze it.
- preprocess() runs the feature extractor; forward() takes (B, n_mels,
  T_mel) and returns last_hidden_state (B, T_enc, d_model). Whisper pads
  every clip to 30 s, so T_enc is a constant 1500 regardless of input
  duration — handy for batching, wasteful for short clips. We accept the
  waste at W1; W2 can switch to streaming-style chunking.
- Note for W3+/W5+: last_hidden_state is the most text-semantic layer.
  When we start caring about timbre / prosody / emotion ("质感感知"), we
  should expose middle layers or a learnable weighted sum across layers.

Projector: 2-layer MLP (in_dim → out_dim → out_dim) with GELU and the
nanochat Linear class so master weights stay fp32 while forward runs in
the activation dtype (bf16). fc2 is zero-initialized so the model starts
ignoring audio entirely, which gives a clean baseline before any training
signal flows through (audio path is opt-out by default, opt-in by
training).
2026-05-05 22:39:05 +01:00
Fam Zheng d760915daa omni: gpt.forward — optional audio_features for soft-token prepend
W1 needs the GPT to consume Whisper-projector outputs as a prefix of "soft
tokens" sitting in front of the text token embeddings (LLaVA-style). The
change is intentionally minimal:

- forward() takes a new keyword-only arg `audio_features` of shape
  (B, T_a, n_embd). They must already be projected to n_embd by the caller
  (Projector lives in nanochat.audio, kept out of GPT itself).
- The audio rows are normed (matches the post-wte norm convention) and
  concatenated *after* smear so smear stays a strictly text-side op (its
  prev-token semantics aren't defined for soft tokens, and revisiting that
  belongs to a later phase).
- Rotary embeddings are re-sliced for T_a + T_text. Audio gets positions
  0..T_a-1, text 0-shifts to T_a..T_full-1. The 10× over-allocated rotary
  cache in __init__ already covers this.
- value_embeds lookup uses an idx padded with 0 for audio positions. They
  feed the v residual but the gate (`ve_gate`) is input-dependent and will
  learn to suppress the dummy rows; for W1 smoke this is fine.
- targets are auto-padded with -1 (ignore_index) over audio positions so
  the LM is only graded on text predictions.

Not yet supported: audio_features with kv_cache. The KV-cache path is a
prefill+decode protocol that would need its own audio-aware semantics; W1
runs train-style forwards only, so we just assert.
2026-05-05 22:38:49 +01:00
fam 9cae824aa5 Merge pull request 'doc: prefer ModelScope for Whisper encoder weights (closes #4)' (#5) from mochi/issue-4 into main
smoke / nanochat-smoke (push) Successful in 32s
Reviewed-on: https://famzheng.me/gitea/fam/nanochat-omni/pulls/5
2026-05-05 21:26:21 +00:00
mochi 62642b805b doc: prefer ModelScope for Whisper encoder weights (closes #4)
smoke / nanochat-smoke (push) Successful in 33s
W1 todo 里 audio.py 的 WhisperEncoder 之前写的是从 HF mirror 拉权重,
国内拉 HF(哪怕走 hf-mirror)经常被卡。改成首选 ModelScope(例如
iic/Whisper-large-v3 / iic/Whisper-small),HF mirror 留作 fallback。
infra 决定那条也顺手把 mirror 列表对齐到 pip / 模型权重 / HF 数据集
三条线,写清楚 modelscope 是模型权重首选。
2026-05-05 22:25:38 +01:00
Fam Zheng b585e07dc2 omni: CI smoke + docs + README preamble
smoke / nanochat-smoke (push) Successful in 2m30s
- .gitea/workflows/smoke.yml: gitea CI on ailab gpu runner (manual git
  clone since actions/checkout@v4 mis-resolves subpath gitea); injects
  WANDB_API_KEY + CI_RUN_TAG=smoke-$run_number
- scripts/smoke.sh: in-place smoke (uv sync + 1 shard + tokenizer +
  d=6 50-step base_train); idempotent cache at /data/nanochat-smoke/
- doc/research_feasibility.md: voice-first multimodal feasibility study (mochi)
- doc/todo.md: phase-by-phase roadmap (W1 Whisper smoke → W4 MVP)
- README.md: omni preamble pointing at upstream nanochat README
- .gitignore: exclude .claude/ runtime files
2026-05-05 22:21:31 +01:00
Fam Zheng 7939990181 patch: CN mirrors for pytorch-wheels and HF datasets
- pyproject.toml + uv.lock: pytorch-cu128/cpu indexes → mirror.sjtu.edu.cn
  (aliyun lacks 2.9.1, sjtu has it)
- nanochat/dataset.py: climbmix BASE_URL → hf-mirror.com

For ailab (CN, RTX 5090) where direct pytorch.org and huggingface.co
are unreachable. Override at uv-sync time with UV_DEFAULT_INDEX env.
2026-05-05 22:21:21 +01:00
Andrej Karpathy dc54a1a307 tried and failed at DyT 2026-05-05 03:17:21 +00:00
Andrej 0aaca56805 Merge pull request #706 from svlandeg/fix/cpu
Add setuptools for CPU run
2026-04-14 11:33:14 -07:00
Andrej b9b6ce137b Merge pull request #686 from marcinbogdanski/fix/init-smear-backout-lambdas
Initialize smear and backout lambdas in init_weights()
2026-04-13 16:08:04 -07:00
Sofie Van Landeghem a3ca42a678 add comment 2026-04-13 14:17:23 +02:00
Sofie Van Landeghem 9822cc7424 use nn.init and initialize smear gate's weight as well 2026-04-13 14:03:18 +02:00
svlandeg 12839c11e3 update uv lock 2026-04-13 11:20:38 +02:00
svlandeg 8ef90bc154 add setuptools for CPU run 2026-04-13 10:50:57 +02:00
Marcin Bogdanski 94b73ad29a fix: initialize smear and backout lambdas in init_weights 2026-04-03 20:39:55 +00:00
Andrej Karpathy a445144d39 create a group for dev dependencies, there is no need to install all this other stuff just for speedrun and it's exposing people to dependency chain attacks. we need to delete more dependencies. dependencies bad bad bad 2026-03-26 03:41:28 +00:00
Andrej Karpathy 03be953668 delete non-essential deps from legacy use 2026-03-26 03:41:28 +00:00
Andrej 7808dc7159 Merge pull request #595 from svlandeg/fix/typo
Small fixes
2026-03-25 14:40:25 -07:00
Andrej a4ed96687b Merge pull request #634 from 2bitbit/fix-docs-and-comments
fix: correct minor typos in help text, README, and comments
2026-03-25 14:31:49 -07:00
Andrej 7b70f6b411 Merge pull request #639 from mathieu-lacage/master
Verified: both paths return identical data (99,842 rows), and all splits under subset='all' load cleanly.
2026-03-25 14:29:30 -07:00
RoomWithOutRoof 47e983eea7 fix: use meta device in disable_fp8 to avoid VRAM spike (#616)
When swapping Float8Linear to Linear in disable_fp8 context manager,
using device=fp8_module.weight.device directly allocates new tensors
on GPU, causing unnecessary VRAM spike (~1GB for large models).

This fix uses device='meta' to avoid physical memory allocation,
then swaps in the weight tensor reference. This eliminates the
unnecessary VRAM spike during evaluation phase.

Fixes issue #592

Co-authored-by: RoomWithOutRoof <roomwithoutroof@sparklab.ai>
2026-03-25 14:24:57 -07:00
Andrej Karpathy c0dbf1f3ff use COMPUTE_DTYPE-aware cast in Muon polar express step
The bf16 cast is intentional for speed on Hopper+ GPUs, but should be
skipped on other platforms rather than blindly applied. fp16 is unstable
here due to its limited exponent range, and fp32 platforms don't benefit
from the cast. Now: bf16 when COMPUTE_DTYPE is bf16, no cast otherwise.

Inspired by PR #667.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:19:14 +00:00
Andrej Karpathy 4e1694cc95 bunch of ideas tried from openai/parameter-golf, all negative results for nanochat 2026-03-24 22:13:13 +00:00
Andrej Karpathy 1cd94d768f bump D:N ratio to 12 per recent scaling laws re-run 2026-03-24 19:25:50 +00:00
Andrej Karpathy c16db281ff fix small bug with params logging and batch size 2026-03-24 19:25:34 +00:00
svlandeg dfe7d39ce8 Merge branch 'master' into fix/typo 2026-03-18 17:01:45 +01:00
Andrej Karpathy 5019accc5b fix scaling laws scripts after the bigram embeddings were removed 2026-03-17 16:55:56 +00:00
svlandeg 51f42a4406 ~1.5h :-) 2026-03-15 22:29:27 +01:00
svlandeg 1f9e42a855 two more typos, from PR 645 2026-03-15 22:27:18 +01:00
svlandeg bd6e9c8d5f fix numbering 2026-03-15 22:18:18 +01:00
svlandeg 02e865c2ab Merge branch 'master' into fix/typo 2026-03-15 22:18:01 +01:00
Andrej Karpathy 1b1cc3c599 submit new time to GPT-2 leaderboard entry: 99 minutes 2026-03-14 17:15:01 +00:00
Andrej Karpathy a825e63f81 Autoresearch round 2: smear, backout, and hyperparameter tuning
New architectural features:
- Smear: mix previous token embedding into current position via learned
  gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
  projection to remove low-level features

Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 17:03:06 +00:00
svlandeg 6405b26d24 Merge branch 'master' into fix/typo 2026-03-13 13:56:50 +01:00
svlandeg 1052d25d45 we only need to wait 2h now! 2026-03-13 13:46:16 +01:00
Mathieu Lacage a641b6ca96 MMLU main split is named auxiliary_train, not train 2026-03-13 13:19:10 +01:00
2bitbit 2bb93b2ae4 fix: correct minor typos in help text, README, and comments 2026-03-12 17:03:26 +08:00
svlandeg d96558bcb0 fix heading, cf #622 2026-03-10 09:57:30 +01:00
Andrej Karpathy f068604948 new leaderboard entry coming from improvements of autoresearch round 1, time to gpt-2 from 2.02 hours to 1.80 hours 2026-03-10 06:26:39 +00:00
Andrej Karpathy 6ed7d1d82c All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately.
Optimizer & schedule changes:
- Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28
- Per-group Adam betas and weight decay (instead of shared global betas)
- Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps
- Warmup: ratio-based -> absolute steps (default 40)
- Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05
- Weight decay schedule: linear -> cosine decay
- Polar express norm factor 1.02 -> 1.01

Architecture & init changes:
- VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive
- Add post-QK-norm scaling (q,k *= 1.15) for sharper attention
- Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller
- RoPE base theta 10K -> 100K
- Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile)
- Logit softcap 20 -> 15
2026-03-09 20:45:17 +00:00
svlandeg f8ff0439b9 two more small typos 2026-03-06 11:03:00 +01:00
Andrej Karpathy 1076f97059 delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
Sofie Van Landeghem 752abc836e Ensure that inputs and targets are contiguous (#569)
* call reshape instead of view in case the tensors are not contiguous

* fix directly in data loader instead
2026-03-04 13:58:27 -08:00
Andrej Karpathy 4b4077425b Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously 2026-03-04 20:02:07 +00:00
Andrej Karpathy 324e69c45d big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise 2026-03-04 19:47:12 +00:00
Andrej Karpathy b07604ebaa document the legacy fineweb100b dataset and the new climbmix400b dataset 2026-03-03 17:24:31 +00:00
Andrej Karpathy aba30cb037 tune logit softcap? 2026-03-03 00:38:53 +00:00
Anish 83dccc20ae Restore completion-only loss masking in SFT dataloader (#582)
* printing steps count

* adding reply only loss for chat

* using the mask by render_conversation function of tokeniser

* undoing some changes

* putting back the comment which got removed accidently, no functionality change
2026-03-02 16:37:47 -08:00
Dipesh Babu c7ba252142 docs: fix typos in experiment log (#547) 2026-02-20 08:03:45 -08:00
Andrej Karpathy 2dffdc8cf6 document MoE exploration 2026-02-19 02:53:47 +00:00
Andrej Karpathy 48804bff3a report negative result on fineweb dataset 2026-02-18 23:45:31 +00:00