Cornell Theory Center
Tools for HPF Programmers
Donna Bergmark
SC97 Tutorial on HPF
Table of Contents
The Challenge
- HPF is a Relatively New Language
- Vendor Tool Groups Remote are from Compiler Groups
- Little Experience in the Field
- Different Kinds of Tools are Needed than for Message Passing
However, at Cornell, we have put together a tool selection that helps
HPF programming to some degree.
These are described next, but first
let us consider the HPF programming life cycle.
The Programming Cycle
This is the programming life cycle for writing programs in languages
ranging from C to Java to HPF. The HPF programmer can use tools
at just about any possible stage shown here.
Kinds of Tools Used
Creating HPF Source
- Generally you start with existing source
- Generally you start with the serial code
- Sometimes it really old and moldy F77
- You rarely convert MPI to HPF
Cleaning up the Source
- Generally you know where your major loop is
- If it runs over your data, you want to parallelize this
- Key to get rid of inhibitors
- Replace IF with WHEN
- Use F90 intrinsic functions
- Get Rid of Data Dependences
- What Large Array is Stored into by the Loop?
Tools for Fixing up the Source
- Find the Loop[s] (Serial Profiling, e.g. Forge)
- Analyze Data Dependences (FORGE, Pghpf)
- Random Number Generation (PRNG)
- Choose a Data Distribution
- Seed the Program (XHPF)
Often you know what loop you want to run in parallel (and you'd better
have one!), but if not use any available profiling tool to pick out
the "hot spots". Some tools for choosing a data distribution
exist in Europe,
but mostly research tools and related to Vienna Fortran. The other
three categories have tools that can help.
Analyzing Data Dependencies
Fortunately there are a number of helpful tools for picking out
data dependencies between iterations of a loop you might want to
parallelize. A few of them are:
- kapf (Kuck & Associates)
- Forge Explorer (Applied Parallel Research)
- PAT (Obsolete Tool from Georgia Tech)
All of these analyze loops and classify variables.
Seeding the Program
- Often easy to pick out one key distribution
- Tedious to hunt them all down
- xhpf (from Applied Parallel Research) is handy for this:
- First, insert one or more distribution and align directives
- Load up all of your source into Forge Explorer as a package
- Run your code through xhpf as follows:
xhpf -p pkg -Auto=1 -BenignUnknown -ohpf=pkg_hpf.f
- The output has many more ALIGN and DISTRIBUTE directives added!
Plus if you use -plist
to save the parallelization
report, you'll get more information about parallel loop inhibitors.
Random Numbers
Many high performance codes deal with simulation of physical phenomena.
If these codes are parallelized without regard for maintaining independent
streams of random numbers, you could wind up replacing one lengthy stream
with many smaller, identical ones.
A handy way to get independent streams is to associate a different stream
with each index of the loop you wish to parallelize. We have a locally
developed tool at Cornell to do just that:
real*8 function prng_next (i)
integer i
Suppose that "i" is your do loop index. Then
x = prng_next(i)
would have a per-iteration sequence
of numbers. This makes your output repeatable whether run 1-way or 7-way.
Available from ftp.tc.cornell.edu
as
pub/utilities/prng.tar.Z
.
Compiling
- Right now, the more HPF compilers you can run, the better
- Different reports, different error messages
- Different features
- Different possibilities for instrumentation
- All ultimately produce and MPI program
Example: Same error, different messages
Example: Different report formats
Debugging
Then you run the program.
-
1-way: test accuracy of program cleanup
- answers should match the serial
- execution time should be within 5%
- 2-way: test parallelism
- answers should still match
- execution time should probably be horrible
- ... and get worse when you go 3-way. (Don't worry).
- You should hope the answers are right,
because
most parallel debuggers are clueless about HPF
If the Answers Aren't Right ...
- Analyze variables in INDEPENDENT loops again
- Use Print statements
- Use a parallel debugger
- pdbx, a generic linemode parallel debugger
- TotalView with PGHPF (wave of the future)
- User markers containing information
Example: using print statements
Example: an HPF debugger
Performance Tools
Usually the answers will be right until you begin adding INDEPENDENT
directives in front of loops that the compiler has declined to parallelize.
So what do you do if you have speed-down?
- profile (statement,loop, routine level) -
any useful concurrency?
- compiler report - did the loops really parallelize?
- trace - serial bottlenecks?
- monitor - I/O a problem?
Also, program markers are used to delimit program phases in
the trace.
Examples
Note:
parsing tools like AIMS have not yet learned to deal with HPF
How to use These Tools
- Excessive speed-down: usually due to excessive communication
- Check profile or trace
- Reorganize or replicate your data to reduce communication
- If it is I/O, parallelize it (i.e. stripe it)
- Check if loops are replicated but working with distributed data
- Flat performance (no speedup)
- Check to see if each machine is executing the whole program
- Trace can show one processor executing on its data while
other processors wait
- Check compiler generated code to see if loop is serialized
Summary
In summary, you will probably be using trace performance tools more than
debugging tools to get your HPF codes to work well. Why this is the case
is shown in a diagram I like to show, which compares the MPI programming
experience with the HPF one:
As with all parallel codes, your first attempt is almost certain
to get slowdown. And there comes a point beyond which it doesn't
pay to optimize more. That is shown in the two curves here.
Unlike MPI, where your effort will be to getting the right answer,
(or even to get it running), with HPF your effort will be to get
acceptable performance.
Ultimate HPF performance is a little lower than MPI, the speedup knee
is a little sooner, but you get there with the same effort (or less)
as with MPI.