The fastest path between two points is a straight line, unless everyone follows the same line.

Latency vs. Bandwidth vs. Throughput

In computer networks, two metrics are critical to determining performance:

  • Bandwidth: the maximum things you can get done at once.
  • Latency: how long each thing takes.

It feels obvious to optimize bandwidth. Theoretically, you can get everything done at one time and not worry about how long it takes, but the impact of latency multiplies.

Think of a network like a bridge that connects one city to another: let’s say theoretically 50 cars can cross the bridge in one minute (that’s bandwidth), and it takes about 5 minutes to cross the bridge (that’s latency). So realistically, only 10 cars per minute can cross (that’s throughput). Throughput is the actual rate that the network can successfully transmit data.

You could go through the effort of making another similarly sized bridge, but with all that effort, the throughput only increases to 20 cars per minute. Let’s say the slowness is a result of a stoplight at the end of the bridge, causing a frequent backup. If you move the stoplight to a later connecting street (prioritizing the cars leaving the bridge), then the throughput can increase to about 50 cars per minute; that’s 5x faster. The effect of latency compounds, in both positive and negative ways.

Network Failures

Since the foundational concepts are covered, let’s talk about what the book mostly focuses on, network failures. The foundational concepts deal with the base network infrastructure, whereas network failures deal with the dreaded real world with real customer usage patterns. In the bridge analogy, think about what traffic actually looks like: during rush hour, a ton of cars try to cross at the same time, and in the middle of the night, almost no one.

At some point, you’ll need to make a plan to handle failures, in this case, a failure to cross the bridge. If a car is in traffic behind the bridge, and it’s not worth the wait right now, they might turn around and come back later. If enough cars do this, then you have a big problem because cars making several unsuccessful attempts to cross wastes everyone’s time. This can be thought of as a network failure, and it happens so often in real networks, entire fields of study focus on network failure management.

Exponential Backoff

Unlike the car scenario, in a real network, there’s no effort on the user to retry. If a webpage doesn’t load, you just click the refresh button, if it’s still slow, you might keep clicking it. This is where the bridge analogy starts to break down. It’s as if you drove away from the bridge traffic and immediately went to the back of the line, no one would do that.

Another difference is that in the network, all your requests are still sent, clogging up the network: your own requests might interfere with a request you make later. Of course, this is an oversimplification, because loading a simple web page involves many layers working together, so a network failure might not be the only problem causing slowness. But to deal with network failures, one thing you can try is exponential backoff.

Exponential backoff is an algorithm that waits longer after every failed attempt. So you might leave the bridge traffic after a few minutes of waiting, go get a coffee for 30 minutes to see if it gets better. If not, go home and wait a few hours, and so on.

Tail Drop

To get ahead of failures on the network side, you can use tail drop, an algorithm that proactively rejects any new requests when the queue is full. This reduces the time clients have to wait for a failed response, but if there’s no other option to get what they need, then they’ll probably just make another request, so a more sophisticated algorithm would be nice.

Additive Increase, Multiplicative Decrease

A more sophisticated method for dealing with network failures is called Additive Increase, Multiplicative Decrease (AIMD), a fundamental part of the TCP protocol, the backbone of most internet connections we use day to day.

This algorithm controls the overall traffic flow, not just the failures. The idea is that if the network is working well, you add another connection and keep adding more as time goes on until there’s a failure. Once there’s a failure, you reduce your connections by a multiplicative factor, like 1/2. This works well because the algorithm itself is uncovering the real limits of the system, you don’t have to know what your throughput is. It’s also self-healing for many issues because there’s a clear and immediate action to take when there’s a failure.

Quality Over Quantity

Everything has limits: networks, bridges, your time. Using the algorithms described in this post, you can find and work around these limits. When a failure happens from overloading something, it doesn’t make sense to keep pushing it just because you thought the capacity was higher. Acknowledge the failure and reset your expectations.