qlist
You are free to download one of my perl scripts, qlist,
but always remember:
What comes for free has no guarantee.
qlist can find the efficiency and performance (MFLOP/s) of loops
written in C or FORTRAN after compilation with IBM's compiler
(xlf for FORTRAN, xlc for C) using -qlist assuming no cache misses.
qlist gets the information from the .lst file that the compiler generates with -qlist.
How to use qlist on a IBM RS6000 with a 120Mhz clock:
1. Identify the subroutine/function you want to look at.
2. Put the source code in a separate file (Not necessary but recommended).
3. Compile the code with -qlist (ex.: xlf -c -qlist a32tx.f).
4. qlist a32tx 120
Example output:
%qlist a32tx 120
Loop-summary of instructions, cycles and flops in a32tx.lst
Loop CL.32 in object a32tix
no. cyc. cyc.
Instr. of spent lost flops
fma 39 13 -6 78
fnms 15 6 -1 30
fa 7 3 2 7
fs 4 1 0 4
fm 1 0 0 1
lfq 8 5 5 0
lfqu 2 1 1 0
lfd 10 2 2 0
lfdu 2 1 1 0
lfdx 7 2 2 0
lm 1 4 4 0
stfq 1 0 0 0
stfd 6 2 2 0
stfdu 1 0 0 0
cal 8 3 3 0
cax 1 0 0 0
rlinm 2 1 1 0
bl 1 0 0 0
bc 1 0 0 0
b 1 0 0 0
mtspr 2 1 1 0
mfspr 1 1 1 0
neg 1 0 0 0
------ ---- ----- ----- ----- MFLOP/s %-eff. Ld's St's
Total 122 46 18 120 313.04 65.2 30 8
%
First column is the assembly instructions found in the loop.
Second column is how many times the instruction is executed during 1 iteration.
Third column is how many cycles this will take assuming no cache-misses.
Fourth column is how many cycles that are lost on this operation assuming that each cycle
should produce 4 FLOP's (2 multiply-add or the like per cycle).
Fifth column is how many FLOP's the instruction will produce during 1 iteration.
The last line sums up these numbers and computes the MFLOP/s speed (assuming no cache misses),
how many % this is of the theoretical peak performance of the computer
(if no Mhz-rate is given 135Mhz is assumed), and many Load's and Store's
are executed during one iteration.
qlist currently recognizes the following floating-point operations:
fma, fnma, fms, fnms, fa, fs, fm and fd.
If your code results in other floating point instructions, these will not add
to the performance figures generated by qlist.