1. Latency Is the New Intelligence
Since the rise of LLM serving, the bottlenecks of inference systems have shifted several times.
The earliest challenge was capacity. Models were too large, GPU memory was limited, and the primary systems question was whether models could run at all. Over time, the bottleneck shifted toward throughput. Systems optimized for larger batches, deeper queues, continuous batching, and higher GPU utilization.
These optimizations worked because systems had enough room to accumulate requests and amortize fixed overheads such as kernel launches, synchronization, runtime scheduling, and memory movement.
But real-time AI interaction is steadily shrinking that space.
ChatGPT
Cursor · Claude Code · Codex
factories · agents · finance · vehicles
Agents, voice interaction, code completion, tool calling, and Test-Time Scaling increasingly push inference into latency-first — and often near-BS=1 — workloads. Users are no longer waiting for aggregate throughput; they are waiting for the next token, the next tool invocation, or the next reasoning step.
Low-latency inference is therefore not simply a smaller version of high-throughput inference.
As batch sizes shrink, overheads previously hidden by batching re-enter the latency-critical path, fundamentally changing the system’s optimal execution strategy.
At the same time, Test-Time Scaling is making system speed directly affect model capability itself. Under a fixed latency budget, faster inference enables more rollouts, deeper reasoning paths, and more self-verification.
Historically, system performance primarily affected cost. Increasingly, however, it also determines how much reasoning a model can complete within bounded time.
Modern AI systems are therefore starting to behave less like offline compute clusters and more like real-time systems.
This is precisely the problem TileRT was designed to address.
Today, TileRT is powering a model vendor’s launch for the first time — the GLM-5.1 native ultra-fast inference service on the MaaS platform. From an experimental prototype to bearing real production traffic, TileRT has taken another major step, and at the same time another rethink of the execution model of large-scale LLM inference itself.
2. The Gap Between Hardware Limits and Real-World Inference
Today, an 8×H200 NVL server provides nearly 38 TB/s of aggregate memory bandwidth.
For GLM-5.1, the activated parameter footprint per decode token is only around 42 GB. From a purely theoretical bandwidth perspective, decode throughput could approach 1000 token/s even without MTP enabled.
Yet real systems often deliver only a few dozen token/s.
This is not a 10% optimization gap. It is an order-of-magnitude gap.
Initially, we assumed the problem was simply insufficient kernel performance. But profiler traces gradually exposed something more counterintuitive: GPU utilization was high, theoretical FLOPS were not particularly low, and yet token latency remained poor.
GPUs are not short on compute. Compute is trapped between execution boundaries.
Runtime Begins Entering the Critical Path
Most inference frameworks still follow a classic execution model:
graph → operator → kernel
Models are decomposed into independent operators, each launched separately, synchronized separately, and responsible for its own memory movement.
This abstraction worked extremely well during the training era because kernels were large enough for compute to amortize launch latency, synchronization, and scheduling overhead.
Decode changes the timescale.
As batch sizes approach 1, attention kernels shrink into the tens-of-microseconds regime. Launch gaps, cross-kernel barriers, memory spills, and TP synchronization all begin re-entering the latency-critical path.
Profiler traces increasingly revealed a strange phenomenon: kernels were finishing before they had even fully “warmed up.”
The GPU repeatedly executes:
launch → load → compute → store → synchronize
before immediately starting over again.
Every kernel boundary interrupts dataflow, destroys locality, and forces re-synchronization. In many cases, the real bottleneck is no longer how fast a GEMM executes, but how quickly the next stage can begin.
Historically, runtime orchestration was primarily a convenience layer for GPU programming. Under ultra-low-latency inference, however, it increasingly resembles a performance wall itself.
The problem no longer exists only inside kernels. It increasingly exists between kernels.
That realization became the starting point for TileRT.
3. TileRT: Rethinking the Inference Execution Model
One of TileRT’s core observations is that once runtime orchestration enters the latency-critical path, simply “optimizing the runtime harder” is no longer sufficient.
Traditional inference frameworks execute a sequence of short-lived kernels. Execution is repeatedly interrupted by launches, synchronization points, operator boundaries, and memory round-trips.
TileRT explores a different approach: instead of continuously launching kernels, the GPU continuously executes a persistent pipeline.
From Runtime Scheduling to Persistent Execution
TileRT statically expands the model into a persistent Engine Kernel at compile time (AOT).
Throughout the decode lifecycle:
- the host launches only once,
- execution remains resident on the GPU,
- and much of runtime orchestration moves into compile time.
Traditional systems resemble:
graph → operator → kernel
TileRT instead reorganizes execution into a continuously advancing tile pipeline.
Tiles are not merely finer-grained work partitions. They become the scheduling abstraction itself: compute, communication, and asynchronous IO are all decomposed into tile-level tasks that continuously progress inside the GPU.
Warp Specialization and Tile Pipelines
To sustain this persistent execution flow, TileRT adopts aggressive Warp / Block Specialization inside the Engine Kernel.
Different warp groups assume different responsibilities:
- asynchronous data movement,
- tensor compute,
- communication overlap.
Previously, many stages executed serially:
load → barrier → compute → barrier
In TileRT, these stages overlap continuously at tile granularity. Intermediate results remain resident in registers, shared memory, and L2 cache instead of repeatedly spilling back to global memory.
From the profiler’s perspective, the difference becomes obvious: the GPU no longer behaves like a device repeatedly launching kernels, but more like a continuously running execution pipeline.
What TileRT Is Really Trying to Eliminate
As decode latency compresses further, many of the largest sources of latency turn out not to be compute itself, but idle intervals:
- between kernels,
- between communication and compute,
- and between runtime and device execution.
Persistent kernels, tile pipelines, and warp specialization all target the same problem: keeping compute continuously fed.
From Operator Runtime to GPU-Resident Orchestration
As more orchestration logic moves into kernels themselves, inference systems begin taking on a very different form.
Previously:
- scheduling happened inside the host runtime,
- synchronization happened at operator boundaries,
- communication was orchestrated externally.
Now, orchestration increasingly becomes GPU-resident.
The runtime no longer continuously “drives” the GPU. Instead, it initializes and maintains a continuously running execution pipeline.
4. From Warp Specialization to Heterogeneous Workers
Persistent execution solves a large portion of the idle gaps within a single GPU.
But once the system scales to an 8×NVL topology, another limitation begins to emerge: homogeneous parallelism itself.
Most Tensor Parallel (TP) frameworks assume that all GPU ranks execute identical logic synchronously. This abstraction worked naturally for dense training workloads with regular compute patterns and stable synchronization behavior.
Inference changes that assumption.
As sparse routing, Top-K selection, dynamic indexing, long-context attention, and MTP enter the system, more execution stages stop fitting naturally into homogeneous scale-out.
Many of these stages are not compute-heavy themselves, but depend heavily on global information and synchronization. Forcing every rank to execute identical logic introduces redundant computation, excessive broadcasts, and synchronization amplification.
Extending Specialization Beyond the SM
Eventually, we began asking a simple question: if warps can specialize, why can’t GPUs?
TileRT therefore extends specialization beyond the SM itself:
warp specialization → block specialization → GPU specialization
GPUs are no longer treated as fully symmetric execution units. Different devices assume different responsibilities depending on communication cost, compute density, and data dependencies.
In many ways, Heterogeneous Workers are simply Warp Specialization extended to the scale of the entire NVL domain.
Heterogeneous Execution in GLM-5.1
Inside GLM-5.1, the attention layer is split into two types of workers:
- GPU0: Sparse Indexer Worker
- GPU1–7: MLA Workers
The Sparse Indexer handles Top-K selection, sparse index construction, and routing decisions. MLA Workers execute the more compute-intensive stages such as RMSNorm, GEMM, Flash Sparse Attention, and AllReduce.
The key insight is not merely functional decomposition.
Different stages adopt different scale-out strategies. Some stages are dominated by synchronization and global dependencies, making centralized execution more efficient. Others are naturally tensor-parallel friendly and scale much better.
TileRT no longer forces all stages to share the same execution abstraction.
Communication Becomes Part of the Pipeline
Traditional inference systems still treat communication as an external stage — broadcasts, reductions, and synchronization are orchestrated separately by NCCL and the runtime.
In TileRT, communication is pushed directly into the execution pipeline itself.
At the host level, an entire attention layer corresponds to only a single Engine Kernel launch. Broadcasts, reductions, and synchronization execute directly inside the tile pipeline.
Execution therefore shifts from:
compute → sync → compute
toward:
compute ↔ communication ↔ compute
as part of a continuously overlapping pipeline.
5. Production-Ready: From Lab Benchmarks to Real Traffic
Extreme benchmarks are often fragile.
The real challenge is not achieving impressive numbers in controlled environments, but sustaining performance near hardware limits under real production traffic.
As TileRT evolved from prototype to production system, we increasingly realized that ultra-low-latency inference is fundamentally about maintaining execution stability under constantly changing conditions.
Many issues invisible in benchmark environments become dramatically amplified in production.
In benchmarks, sequence lengths are stable, routing patterns are idealized, request arrivals are regular, and KV cache lifetimes are short.
Production traffic is very different.
Short and long contexts continuously interleave. KV caches grow, fragment, and migrate over time. Routing behavior fluctuates across requests. Under MTP workloads, accept/reject divergence can dynamically reshape execution itself.
From “Running Fast” to “Continuously Running Fast”
Over the past several months, TileRT has gone through multiple major execution-model refactors.
Many of these changes were not aimed at increasing peak FLOPS, but at preserving execution stability.
For example, in v0.1.1, we further compressed idle intervals and introduced finer-grained overlap pipelines inside the Engine Kernel. These optimizations do not necessarily improve theoretical throughput dramatically, but they significantly improve tail latency.
Later, MTP (Multi-Token Prediction) introduced another challenge: execution itself became dynamic. Accept/reject paths continuously reshape the pipeline, while draft/verify stages introduce new synchronization dependencies.
As GLM-5 FP8 execution paths and ultra-long-context workloads entered production, KV fragmentation, memory locality degradation, and communication amplification emerged as new bottlenecks.
The Hardest Problems Are Increasingly Systemic
As performance approaches hardware limits, the hardest problems increasingly stop being GEMM optimization, kernel tuning, or operator fusion.
Very often, the issue is not that computation is too slow. It is that the execution pipeline can no longer remain stable under real workloads.
Today, TileRT officially serves production traffic for GLM-5 and GLM-5.1. But increasingly, benchmarking feels like one of the easier parts of the problem.
The harder challenges are moving toward:
- runtime architecture,
- distributed scheduling,
- memory system behavior,
- and model-system co-design.
Inference systems themselves are gradually evolving from collections of operator optimizations into true AI execution infrastructure.
6. The Next Stage: Co-Design
As performance approaches hardware limits, many bottlenecks can no longer be solved through local optimizations alone.
Historically, most LLM systems optimization happened at the operator and kernel level: faster GEMMs, more aggressive fusion, more efficient overlap, and increasingly sophisticated scheduling policies.
But many bottlenecks no longer exist inside individual operators. Increasingly, they emerge from the execution pipeline itself.
We repeatedly encounter structural boundaries:
- mismatches between memory hierarchy and model structure,
- conflicts between communication topology and routing patterns,
- tensions between KV cache growth and locality,
- and the breakdown of homogeneous execution under dynamic sparsity.
These problems can no longer be solved purely by the runtime.
In many cases, the system itself is already operating near hardware limits, while model architectures continue introducing new forms of fragmentation.
The next stage of performance gains therefore cannot come purely from inference frameworks themselves, but must emerge from deeper forms of co-design.
Models, Compilers, and Hardware Begin Coupling Again
Historically, models, compilers, and hardware formed a mostly one-directional stack:
model → compiler → hardware
Models were designed first. Compilers handled lowering, runtimes orchestrated execution, and kernel engineers later attempted to recover efficiency afterward.
Under ultra-low-latency inference, this process becomes increasingly fragile.
As execution pipelines compress further, locality disruption, synchronization amplification, and communication overhead can all directly enter the latency-critical path.
Model structure itself increasingly shapes execution behavior at the system level.
TileRT Is Only a Beginning
TileRT is not the endpoint of this direction.
It is better viewed as an attempt to stop thinking about inference systems as “a sequence of launched kernels,” and instead think of them as continuously running execution systems.
Historically, Scaling Laws focused on parameter count, dataset size, and training compute.
As inference becomes the core execution layer of AI products, however, speed itself increasingly becomes part of the scaling equation.
Inference speed now directly affects how much reasoning depth and interaction quality a model can achieve within fixed latency budgets.
7. The Speed Is All You Need
If the primary role of GPUs over the past decade was maximizing parallel compute throughput, the next several years may demand something different: sustaining execution pipelines under extremely tight latency budgets.
That shift will force model architectures, compilers, runtimes, and hardware architectures to evolve together.
Inference speed is no longer just a systems metric. It increasingly defines the reasoning budget itself.
Collaborate with Us
The TileRT team is building and exploring the foundations of high-performance AI systems at the limits of modern hardware.
If you are interested in GPU architecture, compiler optimization, distributed execution, or large-scale inference systems, we would love to talk.
- TileRT GitHub (selected modules open-sourced): github.com/tile-ai/TileRT
- Technical discussion & collaboration: tile-ai@outlook.com
The speed is all you need.