Qualcomm's HBC architecture stacks DRAM over XPU chips to rival GPUs in AI infra
Qualcomm unveiled its High-Bandwidth Compute (HBC) architecture at its 2026 Investor Day, stacking multiple layers of DRAM directly on top of its XPU chips to create a unified compute-and-memory module for AI inference. The company claims HBC delivers SRAM-level performance with HBM-like density and capacity, making it more economical than current GPU-based systems. It will launch in 2026 as part of Qualcomm's AI250-series Dragonfly rack systems.
Full text
Qualcomm is finally getting serious about AI infrastructure, but its push into the datacenter hinges on the success of an ambitious near-memory compute architecture designed to deliver better inference economics than today's GPUs. Announced during its 2026 investor day last week, the tech will see Qualcomm stack layer upon layer of DRAM on top of its XPUs to form a single unified compute and memory module it's calling high-bandwidth compute (HBC). “We offer all of the performance advantages of SRAM, but with the density and the memory capacity that HBM (high-bandwidth memory) stacks offer,” Tony Pialis, Qualcomm’s EVP of datacenter, claimed during last week's investor presentation. This technology is set to launch next year as part of Qualcomm’s AI250-series of Dragonfly rack systems, and marks a distinct shift in Qualcomm’s AI infrastructure strategy. The handset giant is no stranger to AI accelerators. Essentially every Snapdragon processor sold today ships with an NPU on board. But in the datacenter, the company has struggled to garner the same excitement as Nvidia, AMD, and even startups like Cerebras. Compared to the big two’s GPUs, Qualcomm’s AI-series accelerators haven’t compared that favorably, but that could soon change as the company looks to make its mark on the datacenter. With the AI250, the SoC maker is claiming 768 GB of memory capacity and up to 133 TB/s of effective memory bandwidth per card. For reference, Nvidia's Groq 3 LPUs offer just 500 MB of SRAM and 150 TB/s of bandwidth. If that seems too good to be true, that’s because it is. Qualcomm is leaning heavily on the word “effective.” We know that because for the AI200-based Dragonfly systems rolling out this year, they claimed 414 TB/s of “effective” memory bandwidth across all 56 chips. On its face, that seems more realistic, but actually achieving that with 8800 MT/s LPDDR5x alone would require a 6,720-bit-wide bus, which it almost certainly does not possess. Qualcomm insists that this is the "pure physical bandwidth of the LPDDR interface," but declined to offer any specifics as to how it's somehow managed to achieve what Nvidia needed eight HBM3e stacks to do. In any case, according to Qualcomm’s marketing materials, with the move to HBC, the AI250 will offer 18x the effective bandwidth of the AI200, while the forthcoming AI300 will deliver 54x the bandwidth. Given the context, these seem like outlandish claims, but these "effective" multipliers are really a feature of Qualcomm's HBC architecture. Unpacking high-bandwidth compute Amplifying “effective” bandwidth isn’t the only party trick from these HBC-based accelerators. Qualcomm claims that by moving some of the XPU’s compute under the DRAM, it can significantly reduce the amount of power its chips consume. On a conventional datacenter GPU, data is rapidly shuffled between HBM and the compute dies. Even using advanced packaging technologies like TSMC’s CoWoS, the power required to move this data back and forth is significant. By stacking the DRAM directly on top of some of the logic and connecting them using through-silicon vias (TSVs), the path from compute to memory is shortened considerably. "Imagine working in the same building that you live in so you only travel up and down," Pialis said. "What does that mean for the highways and the roads that connect the suburbs to the city? Guess what? The roads are clear. The value this brings to the industry is lower power consumption, less heat, and that expensive road of silicon interposer that HBM solutions use is no longer needed." Performing bandwidth-bound operations on the base die also has the benefit of reducing the amount of data that needs to be shuttled to and from the HBC to the SoC. In effect, memory bandwidth is amplified. This is why Qualcomm is using “effective bandwidth" so liberally. Compared to doing all of that work on a conventional GPU or XPU with distinct HBM and compute dies, the effective bandwidth would be significantly higher, which also achieves better density than SRAM-only designs, like Nvidia’s LPUs or Cerebras’ dinner plate sized accelerators. With that said, Qualcomm probably won’t be running its entire AI software stack on HBC. Higher memory bandwidth primarily benefits decode, when the entirety of the model’s active weights are streamed autoregressively from memory one token after another. Decode isn’t particularly compute-intensive. As such, doing decode partially or entirely in HBC starts to make a lot of sense because it also avoids the thermal constraints associated with burying the compute under multiple layers of DRAM. Qualcomm tells us that the AI250 can be used as a standalone AI accelerator, but notes it is heavily optimized around addressing bandwidth bottlenecks. So, in addition to being a dedicated inference chip, it can be used in disaggregated inference architectures that use GPUs or other Qualcomm parts for prompt processing and the AI250 to speed up memory intensive decode operations. Peak FLOPS are notably missing from Qualcomm’s AI250 disclosures — the company declined to share specifics upon our request. Is HBC actually a competitive advantage? While Qualcomm is early among chip designers to make a fuss about near-memory or HBC, it’s not the first, nor is the technology beyond the means of Nvidia or AMD. In fact, both Nvidia and AMD are rumored to be working with HBM suppliers and TSMC to develop custom base dies to boost the performance of their next-gen chips, though it's still not clear how much, if any, compute has been integrated into them. Qualcomm tells us its HBC "uses LPDDR memory in a purpose-built near-memory computing architecture that combines compute and highly-accelerated memory bandwidth within a 3D-stacked silicon design. While both HBC and HBM use stacked-memory concepts, HBC is a distinct architecture designed to address AI’s data-movement bottleneck by bringing compute and memory closer together, increasing memory bandwidth efficiency and improving energy efficiency for AI inference workloads. HBM has more stacks of DRAM, uses 2.5D interposer to route more wires, and does not do computing in the base logic die." AI chip startup d-Matrix is also developing accelerators that will use 3D stacked DRAM to extend their in-memory compute capabilities. The underlying technology described by Pialis may not be as unique as Qualcomm would like investors to believe, but it shows the company hasn’t missed the boat. However, Qualcomm’s ability to work with Nvidia and AMD may end up doing more to sell customers on its tech than anything. As we previously wrote, in a disaggregated AI world, Nvidia can be both a friend and an enemy. Qualcomm finds its Mojo In addition to teasing its upcoming AI250 and AI300 accelerators, Qualcomm’s investor day also coincided with the acquisition of AI software startup Modular. Modular was founded by Tim Davis and Chris Lattner, the latter of whom you may recognize as the creator of LLVM, Clang, the Swift programming language, and the multi-level intermediate representation (MLIR) compiler infrastructure. At Modular, Lattner and crew developed Mojo, a low-level programming interface for GPUs, which offered a high-performance alternative to Nvidia’s CUDA or AMD’s HIP and ROCm stacks. The big idea is that users should be able to write highly performant AI apps that’ll run regardless of the underlying hardware. For Qualcomm, Mojo presents an opportunity to sidestep the CUDA moat, which has dogged AMD for so long. With Mojo, Qualcomm’s customers won’t need to choose one platform; they can develop their apps and run them on whatever compute is handy at the time. It’s not all or nothing either. Modular should help to support heterogeneous deployments similar to what Nvidia is doing with Groq’s LPU tech, where GPUs might be used for prefill and AI250s are used for decode in whatever ratio makes the most sense for that specific application. However, the acquisition doesn’t just buy Qualcomm a vendor-neutral programming model. The folks buying these systems are primarily concerned with one AI workload in particular: LLM model serving. For this, Modular developed a serving platform called Max. Max is a bit like SGLang or vLLM in that it’ll run interchangeably on AMD or Nvidia hardware, but because it’s built atop Mojo, it, at least in theory, shouldn’t require nearly as much hand tuning. The offering should help Qualcomm compete in a landscape where software has become even more important than the hardware it runs on, if it manages to close the acquisition this year without regulators stepping in. In any case, we won’t have to wait much longer to see the HBC in action. After launching its AI200-series racks later this year, Qualcomm plans to push its first-gen HBC-based AI250 out the door beginning in 2027, while its second-gen HBC platform is slated for 2028. While you wait, why not read up on Qualcomm’s new datacenter CPU, which we explored in more detail last week. ®
Can Qualcomm's HBC architecture realistically challenge GPU dominance in AI datacenters?
Comments
No comments yet
Comments
No comments yet — be the first to weigh in 👇
No comments yet. Be the first!