

This chapter describes some of the techniques the HPF language and the PGHPF compiler use to handle data distribution among processors on a parallel system. This chapter also describes some data distribution limitations for the current version of PGHPF. The PGHPF compiler distributes data and generates necessary communications with the assistance of the PGHPF runtime library. Depending on the data type, distribution specification, and alignment specification, as well as each computation's required data accesses, data is communicated for a particular expression involving some computation. Data distribution is based on the data layout specified in the HPF program, the design of the parallel computer system and the layout and number of processors used. Data distributions for some arrays are specified by the programmer, other distributions are determined by the PGHPF compiler.
The PGHPF compiler targets an SPMD programming model. In the SPMD model, each processor executes the same program, but operates primarily on local data. This is implemented by loading the same program image into each processor. Each processor allocates and operates on its own local portion of distributed arrays, according to the distributions, array sizes and number of processors as determined at runtime.
Special attention is required to address the unique communication characteristics of many parallel systems. The PGHPF Runtime library handles HPF data distribution tasks in a generic manner so that HPF programs will work on distributed memory and shared memory systems. Some parallel systems use shared memory and others use distributed memory; there are also hybrid systems. Lower levels of the PGHPF runtime library are customized for different parallel architectures.
The PGHPF runtime library takes into account the communications to be performed and is optimized at two levels. The transport-independent level where efficient communications are generated based on the type and pattern of data access for computations, and the transport-dependent level where the runtime library's communication is performed using a communications mechanism and the system's hardware. To generate efficient code, data locality, parallelism, and communications must be managed by the compiler. This chapter describes the principles of data distribution that HPF and the PGHPF compiler use; the HPF programmer needs to be aware of some details of data distribution in order to write efficient parallel code.
The data distribution phase of the PGHPF compiler has two important tasks that map data to a parallel system's memory and enable computations on that data:
The PGHPF compiler distributes data for several classes of variables:
The compiler uses HPF directives as a guide for distributing the data that has a user-specified distribution. Data without distribution directives is by default replicated across all processors. Compiler-created temporaries are distributed corresponding to their required usage. The following sections describe these tasks in more detail.
The PGHPF compiler's default distribution is replication. All unspecified data, i.e. data without an explicit HPF distribution, is replicated across the available processors. For example, if the integer array BARRAY is used in a program and no HPF directives are supplied for distributing or aligning BARRAY, the default distribution is used and BARRAY is replicated. PROG1 and PROG2 in Example 5-1 show the default distribution. In PROG1, the compiler generates code using the default distribution because BARRAY is specified without a distribution, PROG2 shows an equivalent user specified distribution where BARRAY is also replicated.
! distribution directive is not supplied - replication
PROG1
INTEGER BARRAY(100)
! distribution directive equivalent to default
! is supplied - replication
PROG2
INTEGER BARRAY(100)
!HPF$ DISTRIBUTE BARRAY(*)
As described in Chapters 4 and 5 of the High Performance Fortran Handbook, PGHPF distributes data according to the supplied HPF directives. The ALIGN and DISTRIBUTE directives allow data to be distributed over processors in a variety of patterns. For example, the following code represents a distribution where a computation is partitioned over the available processors. With the given ALIGN directive, this computation involves no communication.
REAL X(15), Y(16)
!HPF$ DISTRIBUTE Y(BLOCK)
!HPF$ ALIGN X(I) WITH Y(I+1)
FORALL(I=1:15) X(I)=Y(I+1)
The next example is very similar, but uses a CYCLIC distribution. A cyclic distribution divides data among processors in a round-robin fashion. A block distribution divides data into evenly distributed chunks (as evenly as possible) over the available processors. A cyclic distribution divides data over processors so that each processor gets one element from each group of n elements, where n is the number of processors.
Figure 5-2 shows block and cyclic distributions for a one-dimensional array. Depending on the computation performed, different data distributions may be advantageous. For this computation a CYCLIC distribution would involve communication for each element computed.
REAL X(15), Y(16)
!HPF$ DISTRIBUTE Y(CYCLIC)
!HPF$ ALIGN X(I) WITH Y(I+1)
FORALL(I=1:15) X(I)=Y(I+1)
In the next example, a similar distribution represents a computation that would be partitioned over the available processors, (for the example we call the processors processor one and processor two). Because of the alignment specified in these ALIGN and DISTRIBUTE directives, the computation involves communication since the value for Y(9) when I is 8 needs to be communicated to assign it to X(8). X(8) is stored on processor one and Y(9) is stored on processor two.
REAL X(15), Y(16)
!HPF$ DISTRIBUTE Y(BLOCK)
!HPF$ ALIGN X(I) WITH Y(I)
FORALL(I=1:15) X(I)=Y(I+1)
The following example shows an erroneous distribution that programmers should avoid. According to the HPF specification, the value of a dummy index variable, I in this example, must be valid for all subscript values possible for the data, X in this example. When the ALIGN dummy index ranges for all possible value of I, 1 to 16 for this example, there would be an invalid access to the value Y(16+1). This results in a runtime error.
REAL X(16), Y(16)
!HPF$ DISTRIBUTE Y(BLOCK)
!HPF$ ALIGN X(I) WITH Y(I+1)
FORALL(I=1:15) X(I)=Y(I+1)
This code produces the following runtime error message:
0: set_aligment: invalid alignment
1: set_aligment: invalid alignment
For more details on different data distributions and examples showing more HPF data mapping directives, refer to Chapter 4 of The High Performance Fortran Handbook.

The "generalized" block distribution, GEN_BLOCK, allows contiguous segments of an array, possibly of unequal sizes, to be mapped onto processors. The GEN_BLOCK mapping array must have size equal to the rank of the target processor dimension, and the sum of its elements must be greater than or equal to the extent of the corresponding distributed array dimension. Moreover, each element in the GEN_BLOCK mapping array must be non-negative. For example:
integer, parameter::n=1000
integer A(n,n)
integer gb(4)
data gb /500, 208, 159, 133/
!hpf$ distribute A(gen_block(gb),*)
In the example above, the GEN_BLOCK mapping array gb of size four is used to map the elements of array A onto four processors. The distribution is determined at the point where the array A is allocated. Any modification to the GEN_BLOCK array prior to allocation of A will affect the distribution. However, any modification afterwards will not affect the distribution in the current scope unless the array is re-allocated or re-distributed.
GEN_BLOCK distributions can be used to balance the workload for algorithms that exhibit uneven computation. One simple problem is the addition of two triangular matrices in parallel (see example code below). Because a column i in the triangular matrix has i rows, a simple BLOCK distribution causes an uneven workload. For example, if we use BLOCK distribution for array A, each of the four processors will receive 250 columns. However, columns 1-250 on processor 1 will have 31,375 computable elements, columns 251-500 on processor 2 will have 93,875 computable elements, columns 501-750 will have 156,375 computable elements, and columns 751-1000 will have 218,875 computable elements.
The following example adds two triangular matrices, B and C, and stores the result in A
. . .
!hpf$ align with A :: B,C
!hpf$ independent
do i = 1,n
do j = 1,i
A(i,j) = B(i,j) + C(i,j)
enddo
enddo
The workload can be balanced by noting that the number of elements in a triangular matrix is equal to n*(n+1)/2, where n equals the number of columns in the matrix. Using this simple formula, the gb array above can be computed as follows:
integer,dimension(number_of_processors()) :: gb
integer,parameter :: n=1000
! Compute amount of work to be done
iwork = (n*(n+1)/2)
! Initialize gb array to zero
gb = 0
! Determine how much work is available for each processor
ieach = iwork / number_of_processors()
! Start filling at processor 1
igb = 1
! How much work for this processor
ithis = 0
do i = 1,n
! How much work for this column
ii = I
! Add this work to this processor
ithis = ithis + ii
gb(igb) = gb(igb) + 1
if( ithis .ge. ieach )then
! Next column goes to next processor
igb = igb + 1
ithis = 0
endif
enddo
All PGHPF intrinsics that accept as arguments arrays with the BLOCK distribution format also accept arrays with the GEN_BLOCK format. Also, the overlap shift optimization available for BLOCK-distributed arrays is also performed for GEN_BLOCK-distributed arrays. The size of the shadow region can be specified with the -Moverlap=size:n compiler switch.
NOTE 1: Programs compiled with the -Msmp switch cannot use GEN_BLOCK disributions. Because GEN_BLOCK distributions can result in asymmetrical memory allocation, they are not currently supported under SMP architectures. However, compiling an HPF program unit that uses GEN_BLOCK distributions with the -Mnogenblock switch causes PGHPF to treat any GEN_BLOCK distributions as simple BLOCK distributions. Although this allows one to build and run a GEN_BLOCK program on an SMP machine, the benefits of GEN_BLOCK load balancing will not be available.
NOTE 2: Program units with GEN_BLOCK-distributed arrays that make HPF_LOCAL or F77_LOCAL extrinsic calls may not work properly when compiled with the -Mnogenblock switch.
Allocatable arrays can be distributed in a manner similar to standard arrays (arrays without the ALLOCATABLE attribute). The directives that determine the distribution and alignment of an allocatable array are evaluated on entry to the allocatable array's scoping unit and are used throughout the scoping unit for creation of the array, although the arrays may later be realigned or redistributed.
Using allocatable arrays, it is important to keep in mind that an object that is being aligned with another object must exist. Thus, in the following example, the order of the ALLOCATE statements is correct:
REAL, ALLOCATABLE :: A(:), B(:)
!HPF$ ALIGN B(I) WITH A(I)
!HPF$ DISTRIBUTE A(BLOCK)
ALLOCATE (A(16))
ALLOCATE (B(16))
However, an incorrect ordering, when B is allocated before A, will produce a runtime alignment error:
0: TEMPLATE: invalid align-target descriptor.
The distribution of procedure arguments is described in detail in Chapter 5 of The High Performance Fortran Handbook. In an HPF program, the alignment of an actual argument when a procedure is called is maintained when the procedure returns, regardless of the distribution of the corresponding dummy argument within the procedure. Thus, the compiler may need to redistribute the variable upon entry to the procedure, and again when exiting the procedure.
The PGHPF compiler creates a distribution for compiler-created temporary variables. Compiler-created temporaries are distributed corresponding to the required usage. The compiler creates temporaries for several reasons:
Distribution of temporaries and user variables are performed identically; the use of temporaries is transparent from the HPF programmer's point of view (the temporaries are visible in the intermediate code).
The algorithm PGHPF uses to determine distribution of temporaries takes into account the statement in which the temporary is used. Temporaries are allocated before the statement in which they are used and de-allocated immediately after that statement. For example, an array assignment:
INTEGER, DIMENSION(100,100):: A,B,C,D
A = SUM(B) + MATMUL(C,D)
would generate intermediate code using a temporary array.
For this class of temporaries, distribution is based on the usage of the temporary. If a temporary is used as the argument to an intrinsic, the compiler tries to determine the distribution based on the other intrinsic arguments. Otherwise, it tries to assign a distribution based on the value assigned to the temporary. Otherwise, the temporary is replicated across all processors.
Numerous factors including array alignment, array distribution, array subsection usage and argument usage need to be taken into account in determining temporary distribution. For example, consider the following :
A(1:m:3) = SUM(B(1:n:2,:) + C(:,1:n:4), dim = 2)
The section of A is passed directly to the SUM intrinsic to receive the result. A temporary is needed to compute the argument to SUM. The distribution of that temporary has two possibly conflicting goals: minimize communication in the B+C expression, or minimize communication in the SUM computation and in the assignment to A.
Computations are partitioned when PGHPF applies the owner-computes rule. This rule causes the computation to be partitioned according to the distribution of the assigned portion of the computation and involves localization based on the left-hand-side (lhs) of an array assignment statement.
The bounds of a FORALL statement are localized according to the array elements owned by the left-hand-side. For BLOCK partitioned dimensions, the loop bounds are adjusted to index the slice of data owned by the current processor. For CYCLIC partitioning, two loops are required. The outer loop iterates over the cycles of the data, and the inner loop iterates over the data items in the cycle.
The
PGHPF compiler has an optional Inter-Procedural Analysis (IPA) phase.
Using the option
-Mipa, the compiler checks routines across
subroutine boundaries and reports a number of errors and warnings that can not
be detected otherwise. Optimizations are performed across procedure boundaries
where possible.
The format of this option is:
The IPA phase performs the following types of analysis:
* Interprocedural MOD analysis - When a formal argument is not modified, it is treated as an INTENT(IN) argument in the routine's callers.
* Interprocedural constant propagation - When a formal argument has the same constant value at all its call sites, the formal is replaced by the constant value in the routine.
* Interprocedural Propagation of alignments and distributions - When all call sites have the same distribution for a formal argument, IPA will change a !hpf$ inherit attribute to the actual distribution. Moreover, if two formal arguments with inherited distributions are aligned to each other at all call sites, the dummy arguments will be treated as aligned in the subprogram. Even if the caller and callee have the same alignment, the compiler optimizes the code.
* Common block variable analysis - The IPA phase eliminates the common block initialization code in routines if the common was initialized in the main program.
Once IPA checking is complete, the IPA lib directory will contain a number of files. It is the programmer's task to remove these files. If they are not removed they will remain in the directory and will be used in future compilations when the same lib directory is selected for the IPA phase (see the following subsections).
The compiler runs three phases to support Inter-Procedural Analysis. The command:
% pghpf -Mipa=lib source.hpf
is equivalent to running all three phases. The following subsections describe these phases.
Phase 1: Analysis
This phase analyzes each procedure and creates the following files:
To run only this phase, use the -ca IPA option, for example:
% pghpf -Mipa=lib -ca source.hpf
Phase 2: Propagation
This is the propagation phase, which analyzes the entire program.
This phase creates the following files:
To run only this phase, use the -cp IPA option, for example:
% pghpf -Mipa=lib -cp source.hpf
Phase 3: Inheriting
This phase compiles the routines in source.hpf and creates no new files. To run only this phase, include a previously generated .ipa file on the command line, for example:
% pghpf -Mipa=lib source.ipa
Data that is distributed with a !HPF$ DISTRIBUTE directive, and initialized with a DATA statement is valid with PGHPF. A new option, -Mkeepstatic keeps the intermediate file which is normally removed by the compiler. This option has been added to the definition of -Mg, so that the intermediate file is retained when flags are set for debugging.
To support certain HPF features, including static data initialization, PGHPF implements a pre-link phase. The pre-linker collects the following information about the program being linked and generates a new subroutine (pghpf$static$init) to implement them:
Necessary information about the routines in source.hpf is saved by PGHPF in a file named source.d. The pre-linker reads the appropriate .d files to generate the initialization subroutine pghpf$static$init. If -Mkeepstatic is set, this subroutine is written to the file pghpf.prelink.f and saved.
The prelink phase can be disabled by compiling with the compiler switch -Mnoprelink. Using the option -Mnoprelink, the following features will not work:
Additionally, with -Mnoprelink, distributed arrays in modules or common blocks will generate less efficient code.
A new option has been added to support variations with the pre-link phase. The option, -W9, will pass switches to the pre-link phase, but not to the regular Fortran compilation.
For block-distributed dimensions of arrays declared in global HPF procedures, shadow areas are allocated on the lower and upper ends of each processor's local block. See section 6.1.2, Overlap Shift Communications, for a more detailed discussion of how shadow areas are used by PGHPF. The shadow size is determined by the SHADOW directive if one is present for the array. Otherwise a default shadow size is computed by the compiler. The lower and upper shadow sizes may be specified (or computed) independently and need not be the same.
The HPF 2.0 SHADOW directive is fully recognized in PGHPF. A typical usage might appear like this:
!hpf$ distribute (*,block) :: a, b, c
!hpf$ shadow (0,1:2) :: a, b, c
This specifies that there should be no shadow regions in the first dimension, which is collapsed; a 1 element shadow region is allocated at the lower end of the local block in the second dimension and a 2 element shadow region is allocated at the upper end of the local block in the second dimension. If the shadow region is of uniform size at either end of the local block, say 2 on both ends, the following syntax can be used:
!hpf$ distribute (*,block) :: a, b, c
!hpf$ shadow (0,2) :: a, b, c
The default shadow size limit is set using the -Moverlap=size:<n> command-line option. If this option is not used, then the default shadow size limit is 4. This is the upper limit on the shadow size computed by the compiler in the absence of a shadow directive. Differencing codes which require a shadow region larger than 4 must either have shadow directives inserted or use the -Moverlap option in order for the PGHPF compiler to generate code using the overlap shift optimization (see Chapter 6 for more information on this optimization).
If an array is passed as an actual argument in a procedure call, then the default lower and upper shadow size for each of its block distributed dimensions is set to the default shadow size limit, regardless of any shadow directive for the dummy argument. Otherwise, the default shadow sizes are determined by analysis of the subscript expressions in array references in FORALL statements, array assignments, and DO INDEPENDENT loops. The computed default shadow size will be less than or equal to the default shadow size limit.
The shadow directive for a dummy array argument must appear in the specification part of a procedure if either the procedure is contained in a module or if it is a global HPF procedure and the -Mhpf2 command-line option is not used. Otherwise, the shadow directive for a dummy array argument must appear in an INTERFACE block visible to the caller and any shadow directive for the dummy argument appearing in the procedure body is ignored.
On entry to an HPF procedure, including an HPF_LOCAL extrinsic, a dummy array will have shadow areas no smaller than the size specified in the shadow directive, if present. If the actual argument has larger shadow areas and there is no other reason for the argument to be copied, then the larger shadow areas will be preserved. This minimizes argument copying at the procedure interface. The values in the shadow area of a dummy array are undefined on entry to and exit from a procedure.
In an HPF_LOCAL extrinsic procedure, the values returned by the lbound, ubound, shape, and size intrinsics exclude the shadow areas of a dummy argument. Thus these intrinsics do not indicate the actual allocated bounds or shape of a dummy array. It is safe to under-index and over-index a dummy array to the extent specified in the interface shadow directive, as long as the HPF_LOCAL extrinsic procedure was called from a global HPF program unit.
To explicitly make all shadow regions coherent, for example just prior to entering an HPF_LOCAL extrinsic in which the shadow regions will be manipulated directly using over- and under-indexing, you can use a simple code fragment such as the following. For example, a, b, and c in the second example above could be made coherent using the following fragment:
work = cshift (a, 2, dim=2) + cshift (a, -2, dim=2) +
& cshift (b, 2, dim=2) + cshift (b, -2, dim=2) +
& cshift (c, 2, dim=2) + cshift (c, -2, dim=2)
After the above fragment is executed, all of the shadow regions will be up-to-date with the current values of the corresponding array locations on neighboring processors.
NOTE: The above technique will only work correctly if the size of the distributed dimension(s) is an even multiple of the number of processors in the corresponding dimension(s) of the processor grid. This restriction will be lifted in a future release of PGHPF.

