nanochat-omni

fam/nanochat-omni

Fork 0

Commit Graph

Select branches

Hide Pull Requests

main

mochi/issue-4

pre-fork

#5

8c7210abeb doc: W1 audio smoke summary main Fam Zheng 2026-05-05 22:40:57 +01:00
3c1cc3302f omni: W1 audio align smoke — synthetic dataset + 50-step script Fam Zheng 2026-05-05 22:39:20 +01:00
7cc94cf584 omni: nanochat/audio.py — frozen Whisper encoder + Projector Fam Zheng 2026-05-05 22:39:05 +01:00
d760915daa omni: gpt.forward — optional audio_features for soft-token prepend Fam Zheng 2026-05-05 22:38:49 +01:00
9cae824aa5 Merge pull request 'doc: prefer ModelScope for Whisper encoder weights (closes #4)' (#5) from mochi/issue-4 into main fam 2026-05-05 21:26:21 +00:00
62642b805b doc: prefer ModelScope for Whisper encoder weights (closes #4) mochi/issue-4 mochi 2026-05-05 22:25:38 +01:00
b585e07dc2 omni: CI smoke + docs + README preamble Fam Zheng 2026-05-05 22:21:31 +01:00
7939990181 patch: CN mirrors for pytorch-wheels and HF datasets Fam Zheng 2026-05-05 22:21:21 +01:00
9dba7422bf doc: rename docs/ → doc/ and add todo.md pre-fork Fam Zheng 2026-05-05 22:17:36 +01:00
3fa1d6adf5 chore: gitignore .claude/ runtime files Fam Zheng 2026-05-05 22:08:50 +01:00
f487ffee80 ci: wire wandb logging via WANDB_API_KEY secret Fam Zheng 2026-05-05 22:08:13 +01:00
94f115094e ci: replace actions/checkout with manual clone Fam Zheng 2026-05-05 22:04:43 +01:00
c1108ae01f ci: nanochat smoke on ailab gpu runner Fam Zheng 2026-05-05 21:57:22 +01:00
b33e6af06c ci: add dummy workflow targeting ailab runner Fam Zheng 2026-05-05 21:37:15 +01:00
7b29fff2c8 docs: add NanoChat-Omni feasibility research (closes #2) mochi 2026-05-05 21:13:25 +01:00
f278524d9d chore: add upstream nanochat as submodule at upstream/nanochat mochi 2026-05-05 21:02:46 +01:00
147b21c31b Initial commit fam 2026-05-05 20:00:43 +00:00
dc54a1a307 tried and failed at DyT Andrej Karpathy 2026-05-05 03:17:21 +00:00
0aaca56805 Merge pull request #706 from svlandeg/fix/cpu Andrej 2026-04-14 11:33:14 -07:00
b9b6ce137b Merge pull request #686 from marcinbogdanski/fix/init-smear-backout-lambdas Andrej 2026-04-13 16:08:04 -07:00
a3ca42a678 add comment Sofie Van Landeghem 2026-04-13 14:17:23 +02:00
9822cc7424 use nn.init and initialize smear gate's weight as well Sofie Van Landeghem 2026-04-13 14:03:18 +02:00
12839c11e3 update uv lock svlandeg 2026-04-13 11:20:38 +02:00
8ef90bc154 add setuptools for CPU run svlandeg 2026-04-13 10:50:57 +02:00
94b73ad29a fix: initialize smear and backout lambdas in init_weights Marcin Bogdanski 2026-04-03 20:39:55 +00:00
a445144d39 create a group for dev dependencies, there is no need to install all this other stuff just for speedrun and it's exposing people to dependency chain attacks. we need to delete more dependencies. dependencies bad bad bad Andrej Karpathy 2026-03-26 03:40:55 +00:00
03be953668 delete non-essential deps from legacy use Andrej Karpathy 2026-03-26 03:10:08 +00:00
7808dc7159 Merge pull request #595 from svlandeg/fix/typo Andrej 2026-03-25 14:40:25 -07:00
a4ed96687b Merge pull request #634 from 2bitbit/fix-docs-and-comments Andrej 2026-03-25 14:31:49 -07:00
7b70f6b411 Merge pull request #639 from mathieu-lacage/master Andrej 2026-03-25 14:29:30 -07:00
47e983eea7 fix: use meta device in disable_fp8 to avoid VRAM spike (#616) RoomWithOutRoof 2026-03-26 05:24:57 +08:00
c0dbf1f3ff use COMPUTE_DTYPE-aware cast in Muon polar express step Andrej Karpathy 2026-03-25 20:19:14 +00:00
4e1694cc95 bunch of ideas tried from openai/parameter-golf, all negative results for nanochat Andrej Karpathy 2026-03-24 22:13:13 +00:00
1cd94d768f bump D:N ratio to 12 per recent scaling laws re-run Andrej Karpathy 2026-03-24 19:25:50 +00:00
c16db281ff fix small bug with params logging and batch size Andrej Karpathy 2026-03-24 19:25:34 +00:00
dfe7d39ce8 Merge branch 'master' into fix/typo svlandeg 2026-03-18 17:01:45 +01:00
5019accc5b fix scaling laws scripts after the bigram embeddings were removed Andrej Karpathy 2026-03-17 16:55:56 +00:00
51f42a4406 ~1.5h :-) svlandeg 2026-03-15 22:29:27 +01:00
1f9e42a855 two more typos, from PR 645 svlandeg 2026-03-15 22:27:18 +01:00
bd6e9c8d5f fix numbering svlandeg 2026-03-15 22:18:18 +01:00
02e865c2ab Merge branch 'master' into fix/typo svlandeg 2026-03-15 22:18:01 +01:00
1b1cc3c599 submit new time to GPT-2 leaderboard entry: 99 minutes Andrej Karpathy 2026-03-14 17:15:01 +00:00
a825e63f81 Autoresearch round 2: smear, backout, and hyperparameter tuning Andrej Karpathy 2026-03-14 17:03:06 +00:00
6405b26d24 Merge branch 'master' into fix/typo svlandeg 2026-03-13 13:56:50 +01:00
1052d25d45 we only need to wait 2h now! svlandeg 2026-03-13 13:46:16 +01:00
a641b6ca96 MMLU main split is named auxiliary_train, not train Mathieu Lacage 2026-03-13 13:19:10 +01:00
2bb93b2ae4 fix: correct minor typos in help text, README, and comments 2bitbit 2026-03-12 17:03:26 +08:00
d96558bcb0 fix heading, cf #622 svlandeg 2026-03-10 09:57:30 +01:00
f068604948 new leaderboard entry coming from improvements of autoresearch round 1, time to gpt-2 from 2.02 hours to 1.80 hours Andrej Karpathy 2026-03-10 06:26:39 +00:00
6ed7d1d82c All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately. Andrej Karpathy 2026-03-09 20:45:17 +00:00
f8ff0439b9 two more small typos svlandeg 2026-03-06 11:03:00 +01:00
1076f97059 delete autocast, an unnecessary thorn in my side, manage dtypes directly Andrej Karpathy 2026-03-04 23:55:24 +00:00
752abc836e Ensure that inputs and targets are contiguous (#569) Sofie Van Landeghem 2026-03-04 22:58:27 +01:00
4b4077425b Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously Andrej Karpathy 2026-03-04 20:02:07 +00:00
324e69c45d big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise Andrej Karpathy 2026-03-04 19:47:12 +00:00
b07604ebaa document the legacy fineweb100b dataset and the new climbmix400b dataset Andrej Karpathy 2026-03-03 17:24:31 +00:00
aba30cb037 tune logit softcap? Andrej Karpathy 2026-03-02 18:19:37 +00:00
83dccc20ae Restore completion-only loss masking in SFT dataloader (#582) Anish 2026-03-03 06:07:47 +05:30
c7ba252142 docs: fix typos in experiment log (#547) Dipesh Babu 2026-02-20 11:03:45 -05:00
2dffdc8cf6 document MoE exploration Andrej Karpathy 2026-02-19 02:53:47 +00:00
48804bff3a report negative result on fineweb dataset Andrej Karpathy 2026-02-18 23:45:31 +00:00
bb5137860e fix comment Andrej Karpathy 2026-02-18 23:26:22 +00:00
458555117b Merge branch 'Chetter2-patch-1' Andrej Karpathy 2026-02-18 23:17:39 +00:00
bac5a35dd7 fix minor bug in fp8 application to skip tiny matmuls Andrej Karpathy 2026-02-18 23:17:29 +00:00
ad55575326 Fix bug in setting precision (#538) George Shakan 2026-02-18 10:42:11 -05:00
cac43e8511 Fix MockModel's device definition (#535) Sofie Van Landeghem 2026-02-18 01:03:46 +01:00
f5fe7925ed update dev log with recent Andrej Karpathy 2026-02-17 15:44:54 +00:00
1415fb7617 tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft Andrej Karpathy 2026-02-16 20:23:04 +00:00
77f8fb8303 a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps Andrej Karpathy 2026-02-16 14:41:53 +00:00
0a23f87643 Fix bug in setting precision (#538) George Shakan 2026-02-18 10:42:11 -05:00
4800c62f6e Fix MockModel's device definition (#535) Sofie Van Landeghem 2026-02-18 01:03:46 +01:00
4a6e47b0c6 update dev log with recent Andrej Karpathy 2026-02-17 15:44:54 +00:00
8180e1d8c1 tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft Andrej Karpathy 2026-02-16 20:23:04 +00:00
788dadeb88 a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps Andrej Karpathy 2026-02-16 14:41:53 +00:00
124f49be98 Removed redundant qunatization of gradients Alan 2026-02-15 15:41:33 +00:00
d9678ff0f9 Save FP8 tensors in autograd ctx instead of full-precision inputs Alan 2026-02-15 14:31:54 +00:00
2f09686724 clarify that this is bf16 mfu we're talking about Andrej Karpathy 2026-02-10 23:35:00 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm Andrej Karpathy 2026-02-10 18:46:39 +00:00
1ec0a34779 at 28 and above we start to need batch size 8 Andrej Karpathy 2026-02-08 18:26:34 +00:00
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing Andrej Karpathy 2026-02-08 17:54:12 +00:00
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon Andrej Karpathy 2026-02-06 19:22:28 +00:00
685271dc8d new optimal ratio for d26 training Andrej Karpathy 2026-02-06 19:21:27 +00:00
e527521a3f briefly mention batch ramp experimentation too, too weak to merge in my few attempts Andrej Karpathy 2026-02-05 22:21:03 +00:00
96522798f1 docs docs docs Andrej Karpathy 2026-02-05 20:27:07 +00:00
5fdd5cdb24 new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier Andrej Karpathy 2026-02-05 20:11:32 +00:00
2c062aaa94 nit: don't mutate args, create new var for total_batch_size Andrej Karpathy 2026-02-05 19:59:46 +00:00
f41dd3cbd7 auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on Andrej Karpathy 2026-02-05 19:40:37 +00:00
98eed6df18 bring back an assert guarding against bad param sizing Andrej Karpathy 2026-02-05 18:14:09 +00:00
012da1a78b Typo fixes (#480) Sofie Van Landeghem 2026-02-05 19:12:50 +01:00
75b302f331 fix hash commit on leaderboard and a paragraph clarification Andrej Karpathy 2026-02-05 16:14:28 +00:00
1144d186ed try and fail relu^2 -> swiglu Andrej Karpathy 2026-02-05 02:42:46 +00:00
d63b7ab9ac try and fail relu^2 -> swiglu Andrej Karpathy 2026-02-05 02:41:46 +00:00
718e5e9d67 correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt Andrej Karpathy 2026-02-05 01:39:26 +00:00
542beb0c8c bump speedrun to be the up to date leaderboard run Andrej Karpathy 2026-02-04 02:12:04 +00:00
d510b1385b quick experiments to log Andrej Karpathy 2026-02-03 23:21:39 +00:00
16b8ac7da3 oops forgot to attach leaderboard file too Andrej Karpathy 2026-02-03 21:06:12 +00:00
fe55b092b8 minor cosmetics for the table Andrej Karpathy 2026-02-03 21:05:28 +00:00
a67eba35dc add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2 Andrej Karpathy 2026-02-03 20:54:30 +00:00
6079f78fc3 add fp8 training with torchao Andrej Karpathy 2026-02-03 20:51:26 +00:00
8ebc14b348 small touchups to the eval script, re-order items etc, cosmetic Andrej Karpathy 2026-02-03 20:25:48 +00:00

1 2 3 4

Commit Graph Select branches Hide Pull Requests main mochi/issue-4 pre-fork #5 Mono Color

Commit Graph

Select branches

Hide Pull Requests

main

mochi/issue-4

pre-fork

#5