March 22, 2026
So we can copy by reference arbitrary blocks of layers, wire up KV cache, and maybe train adapters. Bigger models without full VRAM penalty. If we’re lucky, perhaps layer block “circuit edges” align with attention layer patterns in GDN or Mamba hybrid architectures and the KV cache increase is small, too? Really nice work, @dnhkng!
15/n The duplicated layers use no extra VRAM — they're pointer copies. More compute and KV cache, yes, but no extra memory. And I suspect fine-tuning just the two junction layers (where the loop reconnects) is all you really need to clean it up.