Pet Hate: MTR

MTR, also known as Matt’s Trace Route, is an enahanced traceroute utility which after making the initial run continues to rerun the traceroute and calculate hop-specific packet loss and latencies.

Unfortunately, virtually everytime someone calls me and mentions “packet loss” and “MTR” in the same breath, it’s because they do not understand the output.

I’m going to assume you already know what a traceroute is, and what it does. MTR runs a traceroute over and over for infinity in order to identify possible faulty routers or links. For example, this is an mtr from my server to www.linx.net:

My traceroute  [v0.72]
mashed (0.0.0.0)                                       Mon Sep  8 10:14:13 2008
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
Packets               Pings
Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
1. gw0.potato-people.com             0.0%    11    0.6   1.0   0.5   1.4   0.4
2. gi0-3.br1.heron.bytel.net.uk      0.0%    11   98.8  13.9   0.5  98.8  31.1
3. vlan1.br0.heron.bytel.net.uk      0.0%    11    2.3   1.9   1.0   2.6   0.5
4. collector.linx.net                0.0%    10   19.6  18.4  17.3  20.0   1.0
5. pink.linx.net                     0.0%    10   18.5  18.1  17.2  19.3   0.7

Pretty simple – each hop is identified, and then MTR repeats this (note the “Snt”, or sent packets column) and records the loss and latencies.

Packet Loss

If we saw a sudden jump to 50% loss at hop 3 and beyond, then we know there is a problem between hops 2 and 3, or at 3 itself. Eg:

My traceroute  [v0.72]
mashed (0.0.0.0)                                       Mon Sep  8 10:14:13 2008
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
Packets               Pings
Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
1. gw0.potato-people.com             0.0%    11    0.6   1.0   0.5   1.4   0.4
2. gi0-3.br1.heron.bytel.net.uk      0.0%    11   98.8  13.9   0.5  98.8  31.1
3. vlan1.br0.heron.bytel.net.uk     50.0%    11    2.3   1.9   1.0   2.6   0.5
4. collector.linx.net               50.0%    10   19.6  18.4  17.3  20.0   1.0
5. pink.linx.net                    50.0%    10   18.5  18.1  17.2  19.3   0.7

Measuring Routers

Unfortuantely what I more often than not see, is something like this:

My traceroute  [v0.72]
mashed (0.0.0.0)                                       Mon Sep  8 10:14:13 2008
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
Packets               Pings
Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
1. gw0.potato-people.com             0.0%    11    0.6   1.0   0.5   1.4   0.4
2. gi0-3.br1.heron.bytel.net.uk      9.0%    11   98.8  13.9   0.5  98.8  31.1
3. vlan1.br0.heron.bytel.net.uk      3.0%    11    2.3   1.9   1.0   2.6   0.5
4. collector.linx.net                0.0%    10   19.6  18.4  17.3  20.0   1.0
5. pink.linx.net                     0.0%    10   18.5  18.1  17.2  19.3   0.7

This example shows lost packets at hops 2 and 3 but – and here’s the important part – not beyond hops 2 or 3. In this case, the MTR is measuring the CPU load of the router at those hops, not the packet loss on the connection. Check hop 5 – no packets have been dropped at the actual destination.

You see, nearly all routers, much like computers, have a list of priorities of things they have to deal with. Forwarding packets between ports is the highest priority. Things such as routing protocols come second, the management interface (whether it be by web, telnet or serial console) come second. Responding to packets sent directly to the router comes long after everything else.

So, if a router is paticularly busy and has a lot of packets to forward, it’ll drop the lowest priority things to get a few more CPU cycles. This means the first thing to get dropped from it’s list of things to do, when under stress, is responding to packets sent directly to the router.

ICMP is lossy

Ping, traceroute and MTR all use the ICMP protocol, and ICMP is very, very lossy. That means that packets will and should be expected to drop. In the example below, we can see a level of packet loss across all hops. However, check the “Snt” column – this MTR has been running for some time, and sent over 1400 packets to each hop. This MTR measures nothing more than the lossy nature of ICMP over a long time period. Pure background noise.

My traceroute  [v0.72]
mashed (0.0.0.0)                                       Mon Sep  8 10:38:03 2008
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
Packets               Pings
Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
1. gw0.potato-people.com             0.8%  1439    2.6   1.0   0.5  12.0   1.1
2. gi0-3.br1.heron.bytel.net.uk      1.4%  1439    1.1   4.3   0.3 217.2  21.5
3. vlan1.br0.heron.bytel.net.uk      1.3%  1438    1.6   6.0   0.6 208.5  24.2
4. collector.linx.net                1.3%  1438   18.3  27.5  16.2 395.3  36.0
5. pink.linx.net                     1.5%  1438   18.9  18.3  16.2  27.9   1.0

Turning off ICMP

It’s for these very reasons that an increasing number of ISPs are disabling the ability to do traceroutes across their network. It used to be that this was done for security – it’s much harder to hack into someone’s network if you do not know the addresses of any of the routers or switches – but now it’s done for a combination of security and to stop calls from customers who don’t know how to interpret the results of a tool that, for example, some VoIP company said they should run.

Leave a Comment