nanochat-omni/nanochat/dataloader.py at babde18ce1cb59cb3d36f8874d1248983c7ba9c3

Files

T

Andrej Karpathy 43c29dd9d5 Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training

The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.

2026-01-13 20:05:47 +00:00

8.3 KiB

Raw Blame History

View Raw

8.3 KiB Raw Blame History

8.3 KiB

Raw Blame History