On the Netflix Tech Blog I recently published Linux Performance Analysis in 60,000 Milliseconds, showing the commands we use in the first 60 seconds of a performance investigation. Most of the time we don't get this far, since we solve most issues using the Atlas and Vector open source GUIs. But this should be useful to share anyway, since it involves standard Linux commands that you can easily try out.
I just made a short video to show this command sequence in action (youtube):
It's not just about what the commands find, but also what they don't find, which directs follow-up investigation. In that video, this is what I learned:
- Load appeared steady
- No unusual system errors (dmesg)
- Heavy user-mode CPU time, evenly distributed at over 90% on all CPUs, and still some idle
- Main memory availability looked fine
- Network throughput looked low, and unlikely to be near any limits
- TCP retransmits were zero
- There was a rate of active connections
If I'm investigating a performance issue, my leads from these findings would be:
- Profile CPU usage using Linux perf and flame graphs
- Check those active connections: who it's for, and latency
I wouldn't dig deeper on memory usage, disk, or file system I/O, until I'd taken a good look at those two.
As a follow-on to the first 60 seconds, you can check out my 90 minute Linux Performance Tools tutorial from Velocity 2015, which has the video online. It's the best and most complete summary I've given on the topic.