ENZH

TPU vs GPU: What a Former Google Engineer Taught Me

📊 Slides

I recently listened to a two-hour podcast (Silicon Valley 101, Episode 228) featuring Henry, a former Google TPU engineer who spent six years on the team and worked on TPU V7 (Ironwood) and V8.

I've always been curious about the TPU vs GPU debate but never had a clear mental model for it. This conversation changed that. The takeaway isn't "which chip is better." It's that they represent two fundamentally different design philosophies — and understanding the difference explains a lot about where the AI hardware market is heading.

Two Kitchens, Two Philosophies

Henry gave an analogy that stuck with me.

GPU uses SIMT — Single Instruction Multiple Threading. Imagine a kitchen with a thousand independent chefs. Each one thinks for themselves, grabs ingredients from the fridge, cooks, and serves. Massive parallelism. The downside: each chef sometimes stands idle waiting for ingredients to arrive.

TPU is an assembly line. The first person grabs ingredients, hands them to the second who preps, hands to the third who cooks. Each person does one thing, but nobody waits. The data flows through like blood pumped by a heart — every beat pushes computation forward.

The key insight: GPU hardware is smart. TPU hardware is dumb — but TPU software is very smart.

Google's XLA compiler acts as an omniscient scheduler. It sees the entire computation graph ahead of time and plans exactly what every processing unit should do at every clock cycle. No runtime prediction needed. No idle periods. The hardware just executes mechanically while the software handles all the complexity.

This is the opposite of Nvidia's approach, where intelligent hardware (branch prediction, dynamic scheduling) compensates for less prescriptive software.

The ASIC Bet: High Reward, High Risk

TPU is fundamentally an ASIC — an Application-Specific Integrated Circuit. It's built for matrix multiplication, optimized for Transformer workloads.

When your workload is known and stable, this is an enormous advantage. Henry described how XLA performs operator fusion (combining multiple operations into one to avoid memory round-trips), global memory management, and system-level optimization across an entire TPU Pod of thousands of chips. The result: near-peak utilization.

But ASICs carry an inherent risk: you have to bet on the right architecture.

During the V4-V5 era, TPU's primary workload was Google's internal recommendation and ranking systems — sparse matrix operations. When ChatGPT triggered the LLM explosion, the demand shifted to dense matrix computation. TPU's paper specs fell behind GPU for a period.

It wasn't until V6-V7 that the team pivoted fully to LLM training workloads and closed the gap. Henry noted that Ironwood (V7) is now roughly on par with Nvidia's Blackwell generation in raw specs.

Here's the interesting part: Google could afford this bet because Transformer was invented at Google. The TPU team had insider knowledge of what the dominant workload would look like before anyone else. That's a structural advantage that's hard to replicate.

But Henry was candid about the risk. A chip takes 2-3 years from design to production. Model architectures iterate every 6 months. If a fundamentally new paradigm replaces Transformer, TPU could be caught flat-footed while GPU adapts with software updates.

System-Level Design: The Real Differentiator

The single biggest insight from this podcast: TPU's competitive advantage isn't the chip. It's the system.

Nvidia sells individual GPUs. To build a training cluster, you buy cards plus NVLink, NV-Switch, and other networking infrastructure. That infrastructure is expensive — Henry called it an "infrastructure tax."

TPU was designed as a cluster from day one. The TPU Pod architecture connects chips directly via copper interconnects, using optical switches only at key junctions. The 3D Torus topology means any chip can communicate with any other chip through software-configurable routing.

Two concrete benefits:

  • Lower networking cost. No expensive switches for most of the interconnect fabric.
  • Higher training efficiency. System-level optimization rather than single-card optimization.

Henry's claim: for training a Gemini-class model, TPU's total cost of ownership (TCO) is better than GPU — provided your software stack can fully exploit TPU's architecture.

That "provided" is doing a lot of heavy lifting.

The CUDA Moat Is Real

This might be the most important section of this post.

CUDA's ecosystem is enormous. Every AI researcher defaults to PyTorch + CUDA. Thousands of operators, libraries, and toolchains are built around it. It's the lingua franca of ML engineering.

TPU requires JAX + XLA. XLA is a static compiler — great for global optimization, terrible for debugging. Henry was blunt: external developers cannot independently fix XLA bugs. They need Google's engineers. This is a fundamentally different model from CUDA's open ecosystem where the community self-serves.

The companies that use TPU well all share one thing: they have engineers who came from Google and understand the JAX + XLA stack deeply.

  • Anthropic: Founded by ex-Google researchers. Their engineering team can work directly with TPU at a deep level. They're the only external customer that buys TPU racks directly from Broadcom rather than using Google Cloud.
  • Apple: Pang Ruoming, who led parts of Apple Intelligence, came from Google and brought the entire software stack with him.
  • Meta: Still on Google Cloud for TPU access. Their entire stack is PyTorch-native, making deep TPU integration much harder.

And here's the kicker. If you just use TPU through Google Cloud without deep optimization, Henry estimated you might only hit 50-60% utilization. You're paying for 100% of the hardware but getting half the performance. At that point, the TCO advantage over GPU evaporates.

TPU's value proposition requires human capital that's scarce and concentrated. That's not a scalable moat.

The Supply Chain Bottleneck

Even if you solve the software problem, you hit the supply chain wall.

HBM (High Bandwidth Memory) is controlled by three companies: SK Hynix, Samsung, and Micron. Nvidia is the largest customer by far. Capacity is locked 1-2 years in advance. TPU has always been the secondary customer, making it hard to secure the latest HBM at scale.

CoWoS (TSMC's advanced packaging) is another bottleneck. Google can't do it. Broadcom can't do it. Only TSMC can. And TSMC allocates capacity based on order volume — which means Nvidia gets priority.

Yield is a unique problem for TPU. Because TPU is designed for system-level coherence, every chip in a pod needs to perform at roughly the same level. GPU can handle yield issues by selling lower-binned chips (the classic H100 to cut-down version pipeline). TPU can't — a defective chip is scrap. That directly constrains production volume.

Three bottlenecks, all outside Google's control.

Groq: The Compiler-First Alternative

The podcast also covered Groq, now acquired by Nvidia. Its founder Jonathan Ross came from Google's TPU compiler team.

Henry's framing was sharp: Groq is a compiler company, not a chip company. Its hardware is even simpler than TPU — the compiler controls everything down to individual clock cycles. No runtime decisions at all.

Groq caught three waves: the inference market explosion, the ASIC trend, and the agent era. Agent workloads are latency-obsessed — every step in an agent's chain adds latency, and Groq's architecture delivers extremely low per-token latency for single users.

But Groq trades throughput for latency. It's ideal for real-time voice, high-frequency trading, and agent chains where a single user needs sub-millisecond response. It's not designed for the high-throughput, batched inference that Google and Nvidia excel at.

This points to a broader trend: the AI chip market is stratifying. Large-scale training and high-throughput inference will remain GPU + TPU territory. Low-latency inference for agents and real-time applications goes to specialized players. Edge deployment is yet another layer.

What This Means for Builders

Here's my practical takeaway from all of this.

The GPU monopoly is cracking, but it's not breaking. TPU is a credible alternative in specific conditions:

  • Large-scale deployment with known, stable workloads
  • Teams with deep JAX/XLA expertise (usually ex-Google engineers)
  • Willingness to accept a less flexible, more opaque software ecosystem

For most teams building AI products today, CUDA + PyTorch remains the safe bet. The ecosystem depth, debuggability, and talent pool are unmatched.

But the cost implications are real. Anthropic's API prices dropped 67%, partly attributed to TPU-based inference. Google Cloud's API pricing has been consistently 1/10th of competitors. If you're a consumer of AI APIs rather than a trainer of models, the TPU vs GPU debate affects you indirectly through pricing.

The future of AI infrastructure is not a single winner. It's a layered market:

  • Training frontier models: GPU and TPU, depending on team capability and scale
  • High-throughput inference: GPU and TPU, with TPU having a cost edge at massive scale
  • Low-latency inference (agents, real-time): Groq and similar specialized architectures
  • Edge/local deployment: A whole different category

My recommendation for fellow builders: don't bet your architecture on any single hardware platform's assumptions. The companies that thrive will be the ones with hardware-agnostic software layers that can shift between platforms as economics and capabilities evolve.


Source: Silicon Valley 101, Episode 228 — featuring host Hong Jun and guest Henry, former Google TPU engineer who worked on V7 and V8.


© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0