Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
The new DataLoader ensures that every token sequence in train/val batches has a BOS token at the beginning. Therefore, no token streams start abruptly in the middle of a document, which could be confusing for the model. Note that this changes the loss scale because there are fewer confusing tokens in the train/val batches. The main downside is that we now waste about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md entry for this change for a lot more information.
This commit is contained in:
+3
-2
@@ -17,8 +17,9 @@ if [ -z "$SKIP_SETUP" ]; then
|
||||
uv sync --extra gpu
|
||||
source .venv/bin/activate
|
||||
|
||||
# Tokenizer
|
||||
python -m nanochat.dataset -n 240
|
||||
# Tokenizer, download 1000 shards for pretraining
|
||||
# (probably this can be reduced but it's tricky to determine the exact right number, TODO).
|
||||
python -m nanochat.dataset -n 1000
|
||||
python -m scripts.tok_train --max_chars=2000000000 --vocab_size=32768
|
||||
else
|
||||
source .venv/bin/activate
|
||||
|
||||
Reference in New Issue
Block a user