add engram-lite, add log, tune scaling laws analysis scripts

2026-01-27 22:31:17 +00:00
parent 59e36cc727
commit c8d93beed2
5 changed files with 346 additions and 35 deletions
@@ -4,6 +4,140 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026

 ---

+## 2026-01-27: Bigram Hash Embeddings (Engram-lite)
+
+Explored N-gram memory modules inspired by the [DeepSeek Engram paper](https://arxiv.org/abs/2506.08046) and [modded-nanogpt PR #201](https://github.com/KellerJordan/modded-nanogpt/pull/201).
+
+### Background
+
+The Engram paper introduces "conditional memory" as a complement to MoE - using O(1) hash lookups to retrieve static N-gram patterns instead of reconstructing them through computation. Key insight: transformers waste early layers "simulating retrieval through computation" for patterns like named entities and formulaic phrases that could be simple table lookups.
+
+### What We Tried
+
+**1. Full Engram module with context-aware gating (paper design)**
+```python
+# Hash bigrams to retrieve embeddings, then gate with hidden state
+e = embed(hash(prev_token, curr_token))
+q = RMSNorm(h)           # hidden state as query
+k = RMSNorm(W_k @ e)     # projected embedding as key
+v = W_v @ e
+α = sigmoid(q · k / √d)  # scalar gate per position
+output = α * v
+```
+- Injected after block 1 (paper found early injection optimal)
+- Slight improvement, but quite a bit of complexity added.
+
+**2. Early-layer only injection**
+- Only inject bigram signal in first 4 layers (where paper claims static pattern offloading helps most)
+- **Result:** Actually hurt performance. The model seems to need uniform injection across all layers.
+
+**3. Trigrams**
+- Extended to hash both 2-grams and 3-grams, concatenating embeddings
+- **Result:** No improvement over bigrams alone. Dilutes capacity from more frequent 2-gram patterns.
+
+**4. Bigram-only with x0-style injection (modded-nanogpt engram-lite approach)**
+- Simple hash: `(36313 * curr) XOR (27191 * prev) mod table_size`
+- Zero-init embedding table, learned per-layer lambdas
+- Add to residual at every layer: `x = resid_λ[i]*x + x0_λ[i]*x0 + bigram_λ[i]*x0_bigram`
+- **Result:** This simple approach works and provides a consistent improvement.
+
+TLDR The winning approach follows modded-nanogpt's "engram-lite", simply adding the following module and feeding its output into the residual branch (gated by a per-layer learnable \lambda) before every single block:
+
+```python
+class BigramEmbed(nn.Module):
+    def __init__(self, vocab_size, embed_dim, table_multiplier=5):
+        self.embed = nn.Embedding(vocab_size * table_multiplier, embed_dim)
+
+    def forward(self, idx):
+        h = (36313 * idx[:, 1:]) ^ (27191 * idx[:, :-1]) % (table_size - 1)
+        return self.embed(h)
+```
+
+As for optimal hyperparameters:
+
+- **Table size:** `vocab_size * 5` (~164K entries for 32K vocab). Swept a number of settings and 5 was optimal.
+- **Injection:** Every layer via learned `bigram_lambdas` (init 0.1 was better than 0.0).
+- **Normalization:** Also tried adding a `norm()` to the embeddings (mirroring the token embeddings), this was slightly worse.
+- **Init:** Zero-init embedding, so starts as identity (tried small noisy init, it's worse)
+- **Optimizer:** AdamW with same LR as token embeddings
+
+### Key Learnings
+
+1. **Gating didn't help at our scale.** The paper's context-aware gating mechanism (sigmoid dot-product gate) added parameters and complexity without improvement. modded-nanogpt found the same: "simple direct addition to the residual stream outperformed by a decent margin."
+
+2. **Uniform injection beats early-only.** Despite the paper's finding that early layers benefit most, restricting injection to early layers hurt. The x0-style "add everywhere with learned lambda" pattern works better for our architecture/scale.
+
+3. **Bigrams are sufficient.** Trigrams didn't help - the extra context doesn't pay for the diluted capacity.
+
+4. **Scale matters.** The Engram paper's results are at 27B params with MoE. At our ~100M-1B scale, the simpler approach wins. The elaborate gating mechanism may become useful at larger scales where collision handling matters more.
+
+### Parameters Added
+
+For d12 model with `table_multiplier=5`:
+- Bigram embedding: 32768 × 5 × 768 = ~126M params
+- Per-layer lambdas: 12 scalars (negligible)
+
+If you're keeping track, we now have *a lot* of parameters, a significant amount of them in embeddings (token embeddings, bigram embeddings, value embeddings). For example, for a d12 we now have:
+
+```
+Parameter counts:
+wte                     : 25,165,824
+bigram_embed            : 125,829,120
+value_embeds            : 150,994,944
+lm_head                 : 25,165,824
+transformer_matrices    : 84,935,808
+scalars                 : 36
+total                   : 412,091,556
+```
+
+In other words, only about a quarter of parameters are now weight projections and the vast majority is embedding tables.
+
+Still, on all axes (steps, wall clock time, flops), this somewhat parameter-bloated architecture beats the baseline and will now become the default.
+
+After adding the engram-lite, I re-ran the scaling laws to determine the new optimal tokens:params ratio. I swept FLOPs in the range 1e18..1e19, exponentially strided in 4 settings (1e18, 2e18, 5e18, 1e19). I looked at a number of ways of determining the effective parameter count for the purposes of the scaling laws. The results looked like this:
+
+```
+Kaplan-style (all projections including lm_head and no embeddings)
+
+Optimal configurations (from quadratic fits):
+FLOPs        Eff Params      Tokens          Ratio      Val BPB
+-----------------------------------------------------------------
+1e+18        110,678,115     1,241,505,403   11.2       0.8972
+2e+18        167,797,457     1,785,336,422   10.7       0.8616
+5e+18        250,650,865     2,642,234,152   10.8       0.8293
+1e+19        381,758,347     3,806,871,243   10.3       0.7999
+
+N \propto C^0.54, D \propto C^0.49
+
+Chinchilla-style (all parameters, period.)
+
+Optimal configurations (from quadratic fits):
+FLOPs        Eff Params      Tokens          Ratio      Val BPB
+-----------------------------------------------------------------
+1e+18        416,320,605     1,232,157,011   3.0        0.8974
+2e+18        560,239,841     1,763,669,281   3.2        0.8616
+5e+18        741,495,903     2,629,909,368   3.6        0.8291
+1e+19        988,644,331     3,884,841,895   4.0        0.7999
+
+N \propto C^0.37, D \propto C^0.50
+
+Transformer-only-style (only the projections inside the transformer)
+
+Optimal configurations (from quadratic fits):
+FLOPs        Eff Params      Tokens          Ratio      Val BPB
+-----------------------------------------------------------------
+1e+18        80,259,665      1,315,639,547   17.2       0.8966
+2e+18        131,488,566     1,864,134,141   14.5       0.8622
+5e+18        220,985,474     2,595,328,843   12.1       0.8302
+1e+19        401,213,504     3,328,704,512   8.5        0.7994
+
+N \propto C^0.70, D \propto C^0.41
+```
+
+Clearly, the Kaplan-style ratios are most consistent and produce stable ~0.5 exponents for both params and tokens, meaning we can have a single fixed ratio of tokens:params for compute optimal models. This turns out to be about ~10.5, which now becomes the new default.
+
+---
+
 ## 2026-01-19 to 2026-01-22: Optimizer Hyperparameter Sweep

 Ran ~320 experiments across 6 rounds, scaling from d12→d16→d20 to find optimal optimizer hyperparameters. Added granular per-component control to `setup_optimizers()` — separate LRs and betas for embedding, unembedding, value_embeds, resid_lambdas, x0_lambdas, and Muon matrix params.