Latency vs Throughput: The Architecture of Fast

"Fast" is the most dangerous word in system design.

When a stakeholder asks for a "fast" API, they do not realize they are asking for two completely opposing concepts.

You cannot blindly optimize a system for speed. You must define what "speed" means for your specific workload before you configure your observability dashboards. If you fail to separate latency from throughput, you will measure the wrong metric.

Here is the technical reality of the trade-off.

1. The Two Definitions of Speed

System performance is divided into two distinct metrics that constantly fight for resources.

Latency: The time it takes to process a single request. This is the speed of execution.
Throughput: The number of requests processed over a given period. This is the volume of execution.

Think of moving people across a city. A Ferrari has incredibly low latency, but terrible throughput (two passengers). A city bus has high latency, but massive throughput (eighty passengers). You cannot build a vehicle that is both a Ferrari and a bus.

2. The Cost of Context Switching

High throughput often kills low latency. To handle thousands of concurrent requests, a CPU must constantly switch attention between them. This is called context switching.

Every time the CPU pauses one thread to give CPU cycles to another, it costs time. If a server accepts too many connections at once to maximize throughput, the system is doing so much juggling that individual request latency degrades significantly.

3. Understanding Little's Law

In queuing theory, Little's Law governs backend concurrency. It states that the number of requests inside your system equals your throughput multiplied by your latency.

If throughput increases but backend capacity stays the same, latency must go up.
Requests will queue up in the kernel backlog or connection pools.
The system is not "broken", it is simply bound by the laws of physics.

The Architectural Takeaway

Optimization is always a sacrifice. Before you configure an alert in Datadog or Prometheus, you must ask: "Are we trying to monitor the time per request, or the volume of requests?"

If you do not choose one, the runtime will choose for you. And you will not like the result.

Latency vs Throughput: The Architecture of "Fast"

1. The Two Definitions of Speed

2. The Cost of Context Switching

3. Understanding Little's Law

The Architectural Takeaway

Comments

Observability & Diagnostics

Where Bottlenecks Are Born: Stop Guessing

More from this blog

Infrastructure vs. Application: Stop Scaling Blindly

The End of Guesswork: Distributed Tracing and Correlation IDs

Where Bottlenecks Are Born: Stop Guessing

Why Uptime is a Vanity Metric: The Shift to True Observability

Command Palette

1. The Two Definitions of Speed

2. The Cost of Context Switching

3. Understanding Little's Law

The Architectural Takeaway

Comments

Observability & Diagnostics

Where Bottlenecks Are Born: Stop Guessing

More from this blog