Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training

The new DataLoader ensures that every token sequence in train/val batches has a BOS token at the beginning. Therefore, no token streams start abruptly in the middle of a document, which could be confusing for the model. Note that this changes the loss scale because there are fewer confusing tokens in the train/val batches. The main downside is that we now waste about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md entry for this change for a lot more information.
2026-01-13 20:05:47 +00:00
parent 23985413aa
commit 43c29dd9d5
7 changed files with 330 additions and 106 deletions
@@ -17,8 +17,9 @@ if [ -z "$SKIP_SETUP" ]; then
    uv sync --extra gpu
    source .venv/bin/activate

-    # Tokenizer
-    python -m nanochat.dataset -n 240
+    # Tokenizer, download 1000 shards for pretraining
+    # (probably this can be reduced but it's tricky to determine the exact right number, TODO).
+    python -m nanochat.dataset -n 1000
    python -m scripts.tok_train --max_chars=2000000000 --vocab_size=32768
 else
    source .venv/bin/activate