Measuring and Improving Application Performance with PerfSuite

Get a realistic view of how your program runs on real hardware, so you can find small changes that make a big performance difference.
Using Performance Counters to Measure Application Characteristics

Let's say that you do have an application that isn't cache-friendly—what might happen? In the worst case scenario, rather than loading a line of data into the cache and operating on the data contained in that line repeatedly, it may use only one piece of data and then be done with it. The next piece of data you need may require another cache line to be loaded and so forth. Each of these cache loads are relatively expensive and can result in reduced performance, because the processor is waiting primarily for the data it needs to become available. Each time the next piece of data is required, the processor attempts to load it from data already resident in the cache. If it's not found, a cache miss occurs and a corresponding hardware event is signaled. The higher the ratio of cache misses to hits, the more likely it is that the overall performance of the software degrades.

Listing 1 shows a basic but concrete example of how this might occur. The listing shows a loop that initializes each element of a matrix using the sum of the corresponding element of another matrix and a vector. Because the C language stores data in row-major order, the loop as written does not access neighboring data elements in the two matrices. Fortunately, this problem has a simple solution: interchange the nested loops so the matrices are processed on a row-by-row basis. This pattern of array access also is referred to as stride-one access. Many optimizing compilers perform this type of loop-interchange optimization automatically, depending on the optimization level you select.

Test cases containing these two versions of the loop were compiled with a recent release of Intel's ICC compiler, run on a Pentium III computer and timed. The result of this simple change sped up the loop by a factor of ten. Not unexpectedly, the overall level 2 cache miss count decreased considerably for the optimized version of the loop (212,665,026 versus 25,287,572—see the next section for more information).

Often, it's useful to combine the raw hardware performance counts into a derived metric that can provide a normalized view of performance. For example, one of the most widely used metrics for performance measurement describes the average number of cycles required to complete an instruction (CPI). By counting the total number of cycles and instructions retired (both of which are available as hardware events), we easily can obtain this metric. Similarly, we might be interested in knowing, on average, how often a piece of data was reused once it was resident in the cache. By counting the appropriate cache-related events and combining them into a single metric, we can obtain an approximation of this information as well.

PerfSuite's hardware performance counter tools and libraries provide easy access to both the raw measurement data as well as a large number of derived metrics that you can use to learn about and hopefully improve the performance of your application. In its most basic use, PerfSuite requires nothing more than a slight modification to the command you execute to run your program. If your executable is in the file myprog, then instead of running myprog directly, you instead would enter psrun myprog. If all goes well, the output of psrun is an XML document that contains a standard set of hardware events along with additional information about the CPU. You can translate this XML document into a comprehensive performance report with the command psprocess, supplying it with the name of the XML file.

PerfSuite Basics

The current release of PerfSuite includes the following four tools for accessing and working with performance data:

  • psrun: a utility for hardware performance event counting and profiling of single-threaded, POSIX threads-based and MPI applications.

  • psprocess: a utility that assists with a number of common tasks related to pre- and post-processing of performance measurements.

  • psinv: a utility that provides access to information about the characteristics of a machine (for example, processor type, cache information and available performance counters).

  • psconfig: a graphical tool for easy creation and management of PerfSuite configuration files.

This section demonstrates the two commands psrun and psprocess. Visit the PerfSuite Web site for more information about and examples of the use of psinv and psconfig.

The easiest way to learn to use the basic PerfSuite tools is try them out on your own programs. Here is a sequence of commands you might enter to run the simple cache example discussed earlier with performance measurement enabled. Also shown are the current contents of the directory after each run with psrun to show that XML documents are created:

1% ls

2% psrun badcache

3% ls

4% psrun goodcache

5% ls

6% psprocess psrun.22865.xml
7% psprocess psrun.22932.xml

Listings 2 and 3 show the output of the psprocess command for the unoptimized and optimized versions of the test program; these listings have been edited slightly to fit in the available space. As you can see, a substantial amount of information is gathered during the course of the measurement and the report includes not only the raw event counts measured using PAPI, but also a series of metrics that can be derived from the counts.