Never Ending Security

It starts all here

Slow Linux system? Perf to the rescue!

Perf at a glance

Quoting the original page, Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences in Linux performance measurements and presents a simple commandline interface. Perf is based on the perf_events interface exported by recent versions of the Linux kernel. Perf notation is somewhat similar to version control tools and strace combined. For example, perf stat will run a command and gather performance counter statistics. Per record will run a command and record its profile into a log file, by default Then, perf report will read the log file created by the perf record execution and display the profile. You may also want to analyze lock events, define probes, and more. There’s even a top mode.

All in all, I am bluffing like a pro, because it is impossible to learn perf in five minutes of reading, especially if you do not have prior Linux knowledge, significant one at that. You will both need to understand the Linux system behavior fairly well, and be somewhat proficient in the concepts of scheduling, memory management, various interfaces, and CPU architecture. Only then will perf make good sense. Still, let us explore a scenario where this tool proves its awesome merit.

Problem what we have

Say you have a Linux system with multiple users working on it, against a remote file system, e.g. NFS. One of the users reports a certain performance degradation in accessing the network objects, like files and directories. Other users have no such woes. As a smart lad, you immediately start a proper investigation, and you grab your favorite tool, strace.

You execute a summary run, with the -c flag, and in both cases, the quantity of system calls and errors, as well as their order is identical. The only difference is that a specific system call for one user takes longer than for the others. You narrow down your tests using the -e flag, which lets you trace individual calls. With the so-called problematic user, you get a consistently slow result for the stat system call:

strace -tt -T -e lstat /usr/bin/stat /nfs/object.txt
14:04:28.659680 lstat(“/nfs/object.txt”, {st_mode=S_IFREG|0755,
st_size=291, …}) = 0 <0.011364>

On the other hand, the so-called healthy user has no such worries:

strace -tt -T -e lstat /usr/bin/stat /nfs/object.txt
14:04:54.032616 lstat(“/nfs/object.txt”, {st_mode=S_IFREG|0755,
st_size=291, …}) = 0 <0.000069>

There is no difference in the environment settings of the two users. They both use identical shells. Everything is perfect except that one small difference in the system call time for stat. With the usual set of your tools, you have just reached a dead end.

Call the cavalry

Now, let us see what perf can do for us. Let us run the test wrapped around the stat command. It is a very good way to start your investigation, because you will get a neat summary of what happens in the kernel, which can then help point out the next lead in your troubleshooting. Indeed, for the healthy system, we get:

Performance counter stats for ‘/usr/bin/stat /nfs/object.txt’:

3.333125  task-clock-msecs         #      0.455 CPUs
6  context-switches         #      0.002 M/sec
0  CPU-migrations           #      0.000 M/sec
326  page-faults              #      0.098 M/sec
3947536  cycles                   #   1184.335 M/sec
2201811  instructions             #      0.558 IPC
45294  cache-references         #     13.589 M/sec
11828  cache-misses             #      3.549 M/sec

0.007327727  seconds time elapsed

For the baddie:

Performance counter stats for ‘/usr/bin/stat /nfs/object.txt’:

14.167143  task-clock-msecs         #      0.737 CPUs
7  context-switches         #      0.000 M/sec
0  CPU-migrations           #      0.000 M/sec
326  page-faults              #      0.023 M/sec
17699949  cycles                   #   1249.366 M/sec
4424158  instructions             #      0.250 IPC
304109  cache-references         #     21.466 M/sec
60553  cache-misses             #      4.274 M/sec

0.019216707  seconds time elapsed

There is a marked difference between the two, as you can observe. While the CPU speed is the same, and the number of migrations, context switches and page faults are identical, the bad user spins approx. five time longer, using more cycles and instructions and everything, resulting in more total time needed for the command to complete. This already shows us there’s something wrong afoot.

Let us explore a little deeper. Let us now record the run and then analyze the data using the report command. This will give us a far more detailed understanding of what really happened. So here’s the report for the good user:

# Samples: 56
# Overhead  Command      Shared Object  Symbol
# ……..  …….  ……………..  ……
5.36%     stat  [kernel]           [k] page_fault
5.36%     perf  /usr/bin/perf      [.] 0x0000000000d099
3.57%     stat  [kernel]           [k] flush_tlb_others_ipi
3.57%     stat  [kernel]           [k] handle_mm_fault
3.57%     stat  [kernel]           [k] find_vma
3.57%     stat  /lib64/libc-2…   [.] __GI_strcmp
3.57%     stat  /lib64/ld-2.11…  [.] _dl_lookup_symbol_x
3.57%     perf  [kernel]           [k] _spin_lock
1.79%     stat  [kernel]           [k] flush_tlb_mm
1.79%     stat  [kernel]           [k] finish_task_switch
1.79%     stat  [kernel]           [k] ktime_get_ts

And for our naughty misbehaver – do note the extra long names of symbols broken into multiple lines for brevity, so do your mental thing and assemble them back. Anyhow, the baddie:

# Samples: 143
# Overhead  Command      Shared Object  Symbol
# ……..  …….  ……………..  ……
57.34%     stat  [kernel]           [k] rpcauth_lookup_
2.80%     stat  [kernel]           [k] generic_match                                                    [sunrpc]
2.10%     stat  [kernel]           [k] clear_page_c
1.40%     stat  [kernel]           [k] flush_tlb_others_ipi
1.40%     stat  [kernel]           [k] __do_fault
1.40%     stat  [kernel]           [k] zap_pte_range
1.40%     stat  [kernel]           [k] handle_mm_fault
1.40%     stat  [kernel]           [k] __link_path_walk
1.40%     stat  [kernel]           [k] page_fault
1.40%     stat  /lib64/libc-2…   [.] __GI_strcmp
1.40%     stat  /lib64/ld-2.11…  [.] _dl_load_cache_lookup

Understanding the report

Before we discuss the results, we need to spend a moment or two to overview the report format. The first column indicates the percentage of the overall samples collected in the corresponding function, similar to what strace -c reports. The second column reports the process from which the samples were collected. In the per-thread/per-process mode, this will always be the name of the monitored command. In the CPU-wide mode, the command name can vary.

The third column shows the name of the ELF image where the samples came from. If a program is dynamically linked, then this may show the name of a shared library. When the samples come from the kernel, then the pseudo ELF image name [kernel.kallsyms] is used. The fourth column indicates the privilege level at which the sample was taken, i.e. when the program was running when it was interrupted. The following levels exist:

  • [.]: user level
  • [k]: kernel level
  • [g]: guest kernel level (virtualization)
  • [u]: guest os user space
  • [H]: hypervisor

The final column shows the symbol name. If you’re interested in learning more about what the specific symbol name does, you can consult The Linux Cross Reference site and search for the specific string entry, or if you have the kernel sources available and installed for your distro, normally under /usr/src/linux, then consult there. Don’t do this for fun, please. There are better ways to spend the afternoon.

Now, the results. We can see the difference immediately. The bad user wastes a lot of time fiddling withrpcauth_lookup_credcache, which is linked inside the sunrpc kernel module. At this point, you have all the information you need to go to the Internet do a very narrow and intelligent search. Just by punching the name of the symbol, you will find two or three mailing list references, pointing to this phenomenon, which in turns, points to a bug.

Your problem is NOT immediately solved, but you know someone else will handle it now, with solid information in your hand. This means reporting the issue to the community, the vendor, or whoever is in charge of your distro, patiently waiting for a fix, or maybe searching for a workaround that might help. But the whole point is that we used perf, which exposed information that we could not have obtained otherwise, and thus, allowed us to move forward in our troubleshooting.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s