Tools for HPF Programmers

Cornell Theory Center

Tools for HPF Programmers

Donna Bergmark

SC97 Tutorial on HPF

The Challenge
The Programming Cycle
Kinds of Tools Used
Creating HPF Source
Compiling
Debugging
Performance Tuning
Summary

The Challenge

HPF is a Relatively New Language
Vendor Tool Groups Remote are from Compiler Groups
Little Experience in the Field
Different Kinds of Tools are Needed than for Message Passing

However, at Cornell, we have put together a tool selection that helps HPF programming to some degree.

These are described next, but first let us consider the HPF programming life cycle.

The Programming Cycle

This is the programming life cycle for writing programs in languages ranging from C to Java to HPF. The HPF programmer can use tools at just about any possible stage shown here.

Kinds of Tools Used

Creating HPF Source

Generally you start with existing source
Generally you start with the serial code
Sometimes it really old and moldy F77
You rarely convert MPI to HPF

Cleaning up the Source

Generally you know where your major loop is
- If it runs over your data, you want to parallelize this
- Key to get rid of inhibitors
- Replace IF with WHEN
- Use F90 intrinsic functions
Get Rid of Data Dependences
What Large Array is Stored into by the Loop?

Tools for Fixing up the Source

Find the Loop[s] (Serial Profiling, e.g. Forge)
Analyze Data Dependences (FORGE, Pghpf)
Random Number Generation (PRNG)
Choose a Data Distribution
Seed the Program (XHPF)

Often you know what loop you want to run in parallel (and you'd better have one!), but if not use any available profiling tool to pick out the "hot spots". Some tools for choosing a data distribution exist in Europe, but mostly research tools and related to Vienna Fortran. The other three categories have tools that can help.

Analyzing Data Dependencies

Fortunately there are a number of helpful tools for picking out data dependencies between iterations of a loop you might want to parallelize. A few of them are:

kapf (Kuck & Associates)
Forge Explorer (Applied Parallel Research)
PAT (Obsolete Tool from Georgia Tech)

All of these analyze loops and classify variables.

Seeding the Program

Often easy to pick out one key distribution
Tedious to hunt them all down
xhpf (from Applied Parallel Research) is handy for this:
1. First, insert one or more distribution and align directives
2. Load up all of your source into Forge Explorer as a package
3. Run your code through xhpf as follows:
xhpf -p pkg -Auto=1 -BenignUnknown -ohpf=pkg_hpf.f
The output has many more ALIGN and DISTRIBUTE directives added!

Plus if you use -plist to save the parallelization report, you'll get more information about parallel loop inhibitors.

Random Numbers

Many high performance codes deal with simulation of physical phenomena. If these codes are parallelized without regard for maintaining independent streams of random numbers, you could wind up replacing one lengthy stream with many smaller, identical ones.

A handy way to get independent streams is to associate a different stream with each index of the loop you wish to parallelize. We have a locally developed tool at Cornell to do just that:

          real*8 function prng_next (i)
          integer i

Suppose that "i" is your do loop index. Then x = prng_next(i) would have a per-iteration sequence of numbers. This makes your output repeatable whether run 1-way or 7-way.

Available from ftp.tc.cornell.edu as pub/utilities/prng.tar.Z.

Compiling

Right now, the more HPF compilers you can run, the better
Different reports, different error messages
Different features
Different possibilities for instrumentation
All ultimately produce and MPI program

Example: Same error, different messages

Example: Different report formats

Debugging

Then you run the program.

1-way: test accuracy of program cleanup
- answers should match the serial
- execution time should be within 5%
2-way: test parallelism
- answers should still match
- execution time should probably be horrible
- ... and get worse when you go 3-way. (Don't worry).
You should hope the answers are right, because
most parallel debuggers are clueless about HPF

If the Answers Aren't Right ...

Analyze variables in INDEPENDENT loops again
Use Print statements
Use a parallel debugger
- pdbx, a generic linemode parallel debugger
- TotalView with PGHPF (wave of the future)
User markers containing information

Example: using print statements

Example: an HPF debugger

Performance Tools

Usually the answers will be right until you begin adding INDEPENDENT directives in front of loops that the compiler has declined to parallelize. So what do you do if you have speed-down?

profile (statement,loop, routine level) - any useful concurrency?
compiler report - did the loops really parallelize?
trace - serial bottlenecks?
monitor - I/O a problem?

Also, program markers are used to delimit program phases in the trace.

Examples

vampir trace shows load imbalance
same info in a vt trace
xprofiler shows statement level hot spots
pgprof profiler shows program hotspots, and communication to computation ratio
FORGE parallel profiler shows communication to computation ration
summary statistics from mp2sddf and pablo
ntv is simple to use and gives a useful statistical overview

Note: parsing tools like AIMS have not yet learned to deal with HPF

How to use These Tools

Excessive speed-down: usually due to excessive communication
- Check profile or trace
- Reorganize or replicate your data to reduce communication
- If it is I/O, parallelize it (i.e. stripe it)
- Check if loops are replicated but working with distributed data
Flat performance (no speedup)
- Check to see if each machine is executing the whole program
- Trace can show one processor executing on its data while other processors wait
- Check compiler generated code to see if loop is serialized

Summary

[my favorite slide of MPI vs. HPF
programming]

As with all parallel codes, your first attempt is almost certain to get slowdown. And there comes a point beyond which it doesn't pay to optimize more. That is shown in the two curves here. Unlike MPI, where your effort will be to getting the right answer, (or even to get it running), with HPF your effort will be to get acceptable performance. Ultimate HPF performance is a little lower than MPI, the speedup knee is a little sooner, but you get there with the same effort (or less) as with MPI.

Cornell Theory Center