Andrej Karpathy
324e69c45d
big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise
2026-03-04 19:47:12 +00:00
Andrej Karpathy
e8fec97d4c
slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector
2026-02-02 01:17:30 +00:00
Sofie Van Landeghem
43078c347e
clean up original tokenizing_distributed_data_loader ( #478 )
2026-01-31 19:44:12 -08:00
Andrej Karpathy
6a341f2ecf
contiguous views and single HtoD transfer for inputs/targets much cleaner
2026-01-30 00:23:01 +00:00
Yamahammer
e1dafc510f
Reduce token waste in BOS bestfit by cropping shortest doc ( #445 )
...
When no document fits the remaining row space, crop the shortest
document in the buffer instead of the first. This minimizes
discarded tokens.
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-16 18:50:34 -08:00
Andrej Karpathy
43c29dd9d5
Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
...
The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.
2026-01-13 20:05:47 +00:00
Andrej Karpathy
21608ec51e
allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway
2026-01-12 03:10:13 +00:00
Matěj Kripner
f1bf69d562
feat: pad vocab size to 64 for DDP optimizers and efficiency
2025-12-09 12:38:18 +01:00
sunyujun03
01ea71be39
Fix distributed Parquet dataloader resume for multi-epoch training
2025-12-08 00:10:19 -06:00
Andrej Karpathy
c6abcdfe3a
big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster.
2025-11-13 15:34:40 +00:00
Luke Stanley
defd1246aa
Fix Torch crash caused by pinning on CPU
2025-10-21 20:28:10 +00:00
Andrej Karpathy
dfcb1c16f1
Merge branch 'master' into cpu-mps-dev
2025-10-21 17:15:53 +00:00
Andrej Karpathy
bb71c64579
fix silly issue in dataloader, this version is much faster and more portable to mps too
2025-10-21 17:12:50 +00:00
Andrej Karpathy
722da4f543
trying to add basic cpu support, will try mps too
2025-10-16 16:14:38 +00:00
karpathy
3a5e0bc50b
initial commit
2025-10-13 06:49:24 -07:00