Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training

The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.
This commit is contained in:
Andrej Karpathy
2026-01-13 20:05:47 +00:00
parent 23985413aa
commit 43c29dd9d5
7 changed files with 330 additions and 106 deletions
+3 -2
View File
@@ -17,8 +17,9 @@ if [ -z "$SKIP_SETUP" ]; then
uv sync --extra gpu
source .venv/bin/activate
# Tokenizer
python -m nanochat.dataset -n 240
# Tokenizer, download 1000 shards for pretraining
# (probably this can be reduced but it's tricky to determine the exact right number, TODO).
python -m nanochat.dataset -n 1000
python -m scripts.tok_train --max_chars=2000000000 --vocab_size=32768
else
source .venv/bin/activate