The tweet discusses Gemma 4's use of shared KV cache layers, which allows it to run on a laptop but also highlights a limitation in cache reuse for llama.cpp. This insight into architecture could be relevant for engineers working on efficient AI system designs.
There is a catch nobody is talking about.
Gemma 4 uses shared KV cache layers - the last layers reuse K/V tensors from earlier layers instead of computing their own. That is why it fits on a laptop.
But that same architecture breaks cache reuse in llama.cpp. Every request