Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously
This commit is contained in:
@@ -147,3 +147,47 @@ Minimum validation bpb: 0.74645
|
||||
```
|
||||
|
||||
The big change here is that the batch size was doubled from 0.5M to 1M, which works better for a d26 model and allowed me to decrease the number of optimization steps a bit via `--target-param-data-ratio` from 8.5 to 8.25. The TLDR is that the original batch size of 0.5M was tuned for d12, but bigger models (e.g. d26) prefer larger total batch size. I determined in experiments that d26 prefers 1M. Then I implemented and merged a principled way to calculate the optimal batch size given depth so that all nanochat models of all depths benefit. See [dev/LOG.md](dev/LOG.md) entry "2026-02-05: Auto Batch Size Scaling" for more detail.
|
||||
|
||||
## Run 4
|
||||
|
||||
Achived Mar 3 2026 on commit `324e69c`. The big change is the switch from HuggingFace FineWeb-EDU to NVIDIA ClimbMix dataset. `@karpathy` has tried to swap the dataset many times, each time with a negative result (FineWeb, DCLM, Olmo), but ClimbMix produced clear and immediate gains. Credit to `@ddudek` for originally discovering ClimbMix for nanochat and reporting the improvements, which kicked off the followup investigation.
|
||||
|
||||
To reproduce, use the commit above, download at least 150 data shards, train the tokenizer:
|
||||
|
||||
```
|
||||
python -m nanochat.dataset -n 150
|
||||
python -m scripts.tok_train
|
||||
```
|
||||
|
||||
Then kick off the run in the typical way, using a slightly lower than compute optimal ratio of 9.5 (vs compute optimal 10.5), meaning the d24 is slightly undertrained.
|
||||
|
||||
```
|
||||
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
|
||||
--depth=24 \
|
||||
--run="d24-climbmix" \
|
||||
--model-tag="d24-climbmix" \
|
||||
--sample-every=-1 \
|
||||
--save-every=-1 \
|
||||
--core-metric-max-per-task=-1 \
|
||||
--core-metric-every=999999 \
|
||||
--target-param-data-ratio=9.5 \
|
||||
--device-batch-size=16 \
|
||||
--fp8
|
||||
```
|
||||
|
||||
I ran this command 7 individual times. Because our training is mildly non-deterministic, we get a spread of CORE scores, e.g.:
|
||||
|
||||
```
|
||||
0.25373
|
||||
0.2584
|
||||
0.25489
|
||||
0.2568
|
||||
0.25732
|
||||
0.26765
|
||||
0.25119
|
||||
```
|
||||
|
||||
Mean is 0.25714 (higher than the GPT-2 threshold needed), max-min is 0.01646. Something to investigate in the future is that even slightly better results can be obtained by randomly shuffling the the data shards (i.e. just going in a different order). This is unexpected because the documents were completely fully shuffled during data construction, so one would expect a relatively uniform data distribution. Indeed, the current default order is unfortunately among the worse ("unlucky") ones you can obtain with different shuffle seeds, but it suffices to beat GPT-2 for now so I am merging. TODO investing a bit more later.
|
||||
|
||||
NOTE: The `val_bpb` is as of this run *NOT* comparable due to the data distribution change to the previous 3 runs. This run happens to be at `0.71854` validation bpb. If the dataset is not changed, the `val_bpb` number is a great, smooth metric to track relative performance w.r.t. and has less noise than CORE.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user