I have been using Linux for several years now and although I have looked at the load averages from time to time (either using top
or uptime
), I never really understood what they meant. All I knew was that the three different numbers stood for averages over three different time spans (1, 5, and 15 minutes) and that under normal operation the numbers should stay under 1.00 (which I now know is only true for single-core CPUs).
Earlier this week at work I needed to figure out why a box was running slow. I was put in charge of determining the cause, whether it be excessive heat, low system resources, or something else. Here's what I saw for load averages when I ran the top
command on the box:
load average: 2.86, 3.00, 2.89
I knew that looked high, but I had no idea how to explain what "normal" was and why. I quickly realized that I needed a better understanding of what I was looking at before I could confidently explain what was going on. A quick Google search turned up this very detailed article about Linux load averages, including a look at some of the C functions that actually do the calculations (this was particularly interesting to me because I'm currently learning C).
To keep this post shorter than the aforementioned article, I'll simply quote the two sentences that gave me a clear-as-day explanation of how to read Linux load averages:
The point of perfect utilization, meaning that the CPUs are always busy and, yet, no process ever waits for one, is the average matching the number of CPUs. If there are four CPUs on a machine and the reported one-minute load average is 4.00, the machine has been utilizing its processors perfectly for the last 60 seconds.
The machine I was checking at work was a single-core Celeron machine. This meant with a continuous load of almost 3.00 the CPU was being stressed much higher than it should be. Theoretically, a dual-core machine would drop this load to around 1.50 and a quad-core would drop it to 0.75.
There is a lot more behind truly understanding the Linux load averages, but the most important thing to understand is that they do not represent CPU usage. Rather they represent the load on the CPU by processes waiting for their chance to use the CPU. If you still can't get your brain away from thinking in terms of percentages, consider 1.00 to be 100% load for single-core CPU's, 2.00 to be 100% load for dual-core CPUs, and so on.
Update: John Gilmartin had some insightful feedback and shared a link to Understanding Load Averages where there's a nice graphical description for how load averages work.
Hello,
Thanks for explaining it shortly, helpful.
You’re welcome! 🙂
I’ve been searching for a simplified explanation of this, and i found this page. Now it makes sense 🙂
Thank you!!
You’re welcome, Jake!
Thanks for the short and understandable explanation 🙂
You’re very welcome, Tejas! Thanks for stopping by! 🙂
Great explanation! Found pages full of text and graphs about this, your article beats them all.
Thanks!! That’s exactly the problem I hoped this post would solve! 🙂
I am not agree with your understanding of load average,the man document doesn’t say that the load average only stands for the load average of CPU.
The load average from the man is system load average.
The high load average may be caused by the bottlenecks of cpu,mem,disk IO,or network. In some case the cpu utilization is low but the system load is high.
Thanks, just a little different understanding of load average. Mike
Raam, I’m afraid your statement, “with a continuous load of almost 3.00 the CPU was being stressed much higher than it should be,” is misleading, or at least mistaken. (For a single core system) a load average of 3.00 means that there were on average 3.00 times as many jobs on the ‘run-queue’ as there was CPU capacity to run them. The run-queue is simply the total number of jobs on the CPU plus those waiting to get on the CPU. In other words, there were some jobs queueing up before the CPU could process them. The point is, the CPU (itself) isn’t, “stressed much higher than it should be.” I think wording it that way could imply that it could overheat or wear out or something along those lines! Yes, the system may be slower to respond, but it’s not ‘bad’ for the system itself, although it may be viewed as less than desirable by any users of it.
Another way of considering it is demand versus capacity. A load average of 3.00 means 3.00 times as much demand as there is capacity. (Capacity meaning the capacity of one CPU core – see below.)
You do touch on an important point, which is that load average needs to be considered alongside the number of CPU cores in the system. When considering load averages, 4x single-core CPUs is equivalent to 2x dual-core CPUs is equivalent to 1x quad-core CPU is equivalent to four cores. A load average of 3.00 on a quad-core system would mean there was some CPU time going unused: 3.00/4=0.75 or 75%.
However, another important point is that jobs may be slowed down for other reasons, such as I/O. e.g. A job may be waiting for a read from or write to disk to complete, or similarly for I/O on a network interface, etc.
I find the following page to be a very good explanation with helpful analogies: http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages
John,
Thanks so much for your explanation and for the link. I love the graphical description with cars/traffic. 🙂 I’ll go ahead and link to that blog post and this comment in the main body of the post.
Cheers!
Raam
You’re welcome Raam, I’m glad you found it helpful and worthwhile. John
Just yesterday my 1 5 and 15 min load averages were all close to or over 25.00
Wow, that’s crazy! What’s causing it?
I had no idea at all, my top was showing nothing abnormal. I couldn’t even really log on to a TTY session I kept on getting time out before the password prompt came up. I think something related to LXDE bugged out or something, the computer it was on has a Athlon x2 (somewhere between 800 MHz to 1 GHz each depending on the source) normally has a load average of under .10 and 1.00 on the high end.
I had a load of 12 in a server, Xeon 4 cores 2.4GHZ, it was sending mails with a 100K attachment, 6 mails to maybe 30 recipients each, and amavis was checking them all for viruses. But I’ve sent 2000 mails and the load stays at 5. This 12 thing was weird.
That explanation was very important for help me.
Thank you for sharing.
You’re most welcome, Eric! Thanks for the comment. 🙂
nice article
Thank you! 🙂