Inference is splitting in two — Nvidia’s $20B Groq bet explains its next act

Nvidia’s $20 billion strategic licensing deal with Groq represents one of the first clear moves in a four-front fight over the future AI stack. 2026 is when that fight becomes obvious to enterprise builders. For the technical decision-makers we talk to every day — the people building the AI applications and the data pipelines that drive them — this deal is a signal that the era of the one-size-fits-all GPU as the default AI inference answer is ending. We are entering the age of the Disaggregated Inference Architecture , where the silicon itself is being split into two different types to accommodate a world that demands both massive context and instantaneous reasoning. Why inference is breaking the GPU architecture in two To understand why Nvidia CEO Jensen Huang dropped one-third of his reported $60 billion cash pile on a licensing deal, you have to look at the existential threats converging on his company’s reported 92% market share . The industry reached a tipping point in late 2025: For the first time, inference — the phase where trained models actually run — surpassed training in terms of total data center revenue , according to Deloitte. In this new "Inference Flip," the metrics have changed. While accuracy remains the baseline, the battle is now being fought over latency and the ability to maintain "state" in autonomous agents. There are four fronts of that battle, and each front points to the same conclusion: Inference workloads are fragmenting faster than GPUs can generalize. 1. Breaking the GPU in two: Prefill vs. decode Gavin Baker, an investor in Groq (and therefore biased, but also unusually fluent on the architecture), summarized the core driver of the Groq deal cleanly: “Inference is disaggregating into prefill and decode.” Prefill and decode are two distinct phases: The prefill phase: Think of this as the user’s "prompt" stage. The model must ingest massive amounts of data — whether it's a 100,000-line codebase or an hour of video — and compute a contextual understanding. This is "compute-bound," requiring massive matrix multiplication that Nvidia’s GPUs are historically excellent at. The generation (decode) phase: This is the actual token-by-token "generation.” Once the prompt is ingested, the model generates one word (or token) at a time, feeding each one back into the system to predict the next. This is "memory-bandwidth bound." If the data can't move from the memory to the processor fast enough, the model stutters, no matter how powerful the GPU is. (This is where Nvidia was weak, and where Groq’s special language processing unit (LPU) and its related SRAM memory, shines. More on that in a bit.) Nvidia has announced an upcoming Vera Rubin family of chips that it’s architecting specifically to handle this split. The Rubin CPX component of this family is the designated "prefill" workhorse, optimized for massive context windows of 1 million tokens or more. To handle this scale affordably, it moves away from the eye-watering expense of high bandwidth memory (HBM) — Nvidia’s current gold-standard memory that sits right next to the GPU die — and instead utilizes 128GB of a new kind of memory, GDDR7 . While HBM provides extreme speed (though not as quick as Groq’s static random-access memory (SRAM)), its supply on GPUs is limited and its cost is a barrier to scale; GDDR7 provides a more cost-effective way to ingest massive datasets. Meanwhile, the "Groq-flavored" silicon, which Nvidia is integrating into its inference roadmap, will serve as the high-speed "decode" engine. This is about neutralizing a threat from alternative architectures like Google's TPUs and maintaining the dominance of CUDA, Nvidia’s software ecosystem that has served as its primary moat for over a decade. All of this was enough for Baker, the Groq investor, to predict that Nvidia’s move to license Groq will cause all other specialized AI chips to be canceled — that is, outside of Google’s TPU, Tesla’s AI5, and AWS’s Trainium. 2. The differentiated power of SRAM At the heart of Groq’s technology is SRAM . Unlike the DRAM found in your PC or the HBM on an Nvidia H100 GPU, SRAM is etched directly into the logic of the processor. Michael Stewart, managing partner of Microsoft’s venture fund, M12, describes SRAM as the best for moving data over short distances with minimal energy. "The energy to move a bit in SRAM is like 0.1 picojoules or less," Stewart said. "To move it between DRAM and the processor is more like 20 to 100 times worse." In the world of 2026, where agents must reason in real-time, SRAM acts as the ultimate "scratchpad": a high-speed workspace where the model can manipulate symbolic operations and complex reasoning processes without the "wasted cycles" of external memory shuttling. However, SRAM has a major drawback: it is physically bulky and expensive to manufacture, meaning its capacity is limited compared to DRAM. This is where Val Bercovici, chief AI officer at Weka, another company offering memory for GPUs, sees the market segmenting. Groq-friendly AI workloads — where SRAM has the advantage — are those that use small models of 8 billion parameters and below, Bercovici said. This isn’t a small market, though. “It’s just a giant market segment that was not served by Nvidia, which was edge inference, low latency, robotics, voice, IoT devices — things we want running on our phones without the cloud for convenience, performance, or privacy," he said. This 8B "sweet spot" is significant because 2025 saw an explosion in model distillation , where many enterprise companies are shrinking massive models into highly efficient smaller versions . While SRAM isn't practical for the trillion-parameter "frontier" models, it is perfect for these smaller, high-velocity models. 3. The Anthropic threat: The rise of the ‘portable stack’ Perhaps the most under-appreciated driver of this deal is Anthropic’s success in making its stack portable across accelerators. The company has pioneered a portable engineering approach for training and inference — basically a software layer that allows its Claude models to run across multiple AI accelerator families — including Nvidia’s GPUs and Google’s Ironwood TPUs . Until recently, Nvidia's dominance was protected because running high-performance models outside of the Nvidia stack was a technical nightmare. “It’s Anthropic,” Weka’s Bercovici told me. “The fact that Anthropic was able to … build up a software stack that could work on TPUs as well as on GPUs, I don’t think that’s being appreciated enough in the marketplace.” (Disclosure: Weka has been a sponsor of VentureBeat events.) Anthropic recently committed to accessing up to 1 million TPUs from Google, representing over a gigawatt of compute capacity. This multi-platform approach ensures the company isn't held hostage by Nvidia's pricing or supply constraints. So for Nvidia, the Groq deal is equally a defensive move. By integrating Groq’s ultra-fast inference IP, Nvidia is making sure that the most performance-sensitive workloads — like those running small models or as part of real-time agents — can be accommodated within Nvidia’s CUDA ecosystem, even as competitors try to jump ship to Google's Ironwood TPUs. CUDA is the special software Nvidia provides to developers to integrate GPUs. 4. The agentic ‘statehood’ war: Manus and the KV Cache The timing of this Groq deal coincides with Meta’s acquisition of the agent pioneer Manus just two days ago . The significance of Manus was partly its obsession with statefulness . If an agent can’t remember what it did 10 steps ago, it is useless for real-world tasks like market research or software development. KV Cache (Key-Value Cache) is the "short-term memory" that an LLM builds during the prefill phase. Manus reported that for production-grade agents, the ratio of input tokens to output tokens can reach 100:1 . This means for every word an agent says, it is "thinking" and "remembering" 100 others. In this environment, the KV Cache hit rate is the single most important metric for a production agent, Manus said. If that cache is "evicted" from memory, the agent loses its train of thought, and the model must burn massive energy to recompute the prompt. Groq’s SRAM can be a "scratchpad" for these agents — although, again, mostly for smaller models — because it allows for the near-instant retrieval of that state. Combined with Nvidia's Dynamo framework and the KVBM, Nvidia is building an "inference operating system" that can tier this state across SRAM, DRAM, and other flash-based offerings like that from Bercovici’s Weka. Thomas Jorgensen, senior director of Technology Enablement at Supermicro, which specializes in building clusters of GPUs for large enterprise companies, told me in September that compute is no longer the primary bottleneck for advanced clusters. Feeding data to GPUs was the bottleneck, and breaking that bottleneck requires memory. "The whole cluster is now the computer," Jorgensen said. "Networking becomes an internal part of the beast … feeding the beast with data is becoming harder because the bandwidth between GPUs is growing faster than anything else." This is why Nvidia is pushing into disaggregated inference. By separating the workloads, enterprise applications can use specialized storage tiers to feed data at memory-class performance, while the specialized "Groq-inside" silicon handles the high-speed token generation. The verdict for 2026 We are entering an era of extreme specialization. For decades, incumbents could win by shipping one dominant general-purpose architecture — and their blind spot was often what they ignored on the edges. Intel’s long neglect of low-power is the classic example, Michael Stewart, managing partner of Microsoft’s venture fund M12, told me. Nvidia is signaling it won’t repeat that mistake. “If even the leader, even the lion of the jungle will acquire talent, will acquire technology — it’s a sign that the whole market is just wanting more options,” Stewart said. For technical leaders, the message is to stop architecting your stack like it’s one rack, one accelerator, one answer . In 2026, advantage will go to the teams that label workloads explicitly — and route them to the right tier: prefill-heavy vs. decode-heavy long-context vs. short-context interactive vs. batch small-model vs. large-model edge constraints vs. data-center assumptions Your architecture will follow those labels. In 2026, “GPU strategy” stops being a purchasing decision and becomes a routing decision. The winners won’t ask which chip they bought — they’ll ask where every token ran, and why.