Latency vs Throughput: The Architecture of "Fast"
Backend Engineer with experience building and scaling PHP applications in production environments.
I focus on performance, system behavior, and understanding how backend systems actually work beyond the framework layer.
Currently writing about PHP, backend performance, and production engineering.
"Fast" is the most dangerous word in system design.
When a stakeholder asks for a "fast" API, they do not realize they are asking for two completely opposing concepts.
You cannot blindly optimize a system for speed. You must define what "speed" means for your specific workload before you configure your observability dashboards. If you fail to separate latency from throughput, you will measure the wrong metric.
Here is the technical reality of the trade-off.
1. The Two Definitions of Speed
System performance is divided into two distinct metrics that constantly fight for resources.
Latency: The time it takes to process a single request. This is the speed of execution.
Throughput: The number of requests processed over a given period. This is the volume of execution.
Think of moving people across a city. A Ferrari has incredibly low latency, but terrible throughput (two passengers). A city bus has high latency, but massive throughput (eighty passengers). You cannot build a vehicle that is both a Ferrari and a bus.
2. The Cost of Context Switching
High throughput often kills low latency. To handle thousands of concurrent requests, a CPU must constantly switch attention between them. This is called context switching.
Every time the CPU pauses one thread to give CPU cycles to another, it costs time. If a server accepts too many connections at once to maximize throughput, the system is doing so much juggling that individual request latency degrades significantly.
3. Understanding Little's Law
In queuing theory, Little's Law governs backend concurrency. It states that the number of requests inside your system equals your throughput multiplied by your latency.
If throughput increases but backend capacity stays the same, latency must go up.
Requests will queue up in the kernel backlog or connection pools.
The system is not "broken", it is simply bound by the laws of physics.
The Architectural Takeaway
Optimization is always a sacrifice. Before you configure an alert in Datadog or Prometheus, you must ask: "Are we trying to monitor the time per request, or the volume of requests?"
If you do not choose one, the runtime will choose for you. And you will not like the result.