tune logit softcap?
This commit is contained in:
@@ -4,6 +4,10 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026
|
||||
|
||||
---
|
||||
|
||||
## 2026-03-02: SoftCap tuning
|
||||
|
||||
Quick experiment to tune logit softcap on d24 scale. Tried 5..30. 5 was terrible, the rest of them were all about equal with the exception of 20, which was the best. Minor but solid improvement: val loss improved by ~1e-3 (0.716 -> 0.715). Setting as default.
|
||||
|
||||
## 2026-02-19: Mixture of Experts (negative)
|
||||
|
||||
Implemented a DeepSeekV3-style Mixture of Experts layer as a drop-in replacement for the dense MLP. The MoE branch works and improves per-step validation loss, but is not a net improvement on wall clock time due to MoE overhead (at least for our scale of interest of approx GPT-2 capability).
|
||||
|
||||
Reference in New Issue
Block a user