Chapter 1. Introduction

This manual describes the SGI Scientific Computer Software Library, which runs on SGI IRIX and Linux systems. The information in this manual supplements the man pages provided with SCSL and provides details about the implementation and usage of these library routines.

SCSL contains the following groups of routines:

The SCSL routines are loaded by using the -lscs option or the -lscs_mp options to the compiler command line. The -lscs_mp option directs the linker to use the multi-processor version of the library.

When linking with SCSL, the default integer size is 4 bytes (32 bits). Another version of SCSL is available in which integers are 8 bytes (64 bits). This version allows the users access to larger memory sizes. It can be loaded by using the -lscs_i8 option or the -lscs_i8_mp option. A program can use only one of the two versions; 4-byte integer and 8-byte integer library calls cannot be mixed.

Many SCSL routines are multitasked or multithreaded; this means that a program that calls a multitasked routine will run in parallel mode and take advantage of multiple processors whenever possible, even if the program has not specifically requested multitasking. If a significant percentage of time is spent in the routine, this feature can significantly reduce wall-clock time.

Note that most LAPACK routines do not perform multiprocessing, but almost all LAPACK routines call Level 2 BLAS and Level 3 BLAS that do multiprocessing.

This manual includes the following sections:

Parallel Processing Issues

Parallel processing is a method of splitting a computational task into subtasks, and then simultaneously performing the subtasks. In many cases, the use of specialized libraries, such as SCSL, is a key component of parallel processing.

Parallel processing can eliminate idle CPU time because the workload is divided among all CPUs; therefore, the amount of work performed per unit time (the throughput) increases. However, parallel processing also introduces some overhead into program execution. In some cases, you may be able to reduce wall-clock time, but at the cost of extra CPU time which increases because more machine resources are used.

By using parallel processing, you can alleviate some of the following common problems:

  • Maximum-memory jobs: if the memory is occupied by a few large-memory jobs, one or more of the CPUs might be idle even though there are other jobs to run.

  • Dedicated machine: if the computer is running a single job, then all other CPUs are idle.

  • Light workload: if the amount of jobs waiting for a CPU is less than the total number of CPUs, then one or more of the CPUs becomes idle.

With parallel processing, the additional CPUs reduce the wall-clock time instead of sitting idle. Even when very little idle time exists, using additional CPUs can still lead to benefits.

Parallel processing introduces some overhead into program execution. This subsection discusses some of the common types of overhead introduced by parallel processing:

  • Multitasked programs require more memory than unitasked programs, and they can contain more code, more temporary variables, and can require additional stack space.

  • Multitasked jobs can be swapped more often, and remain swapped longer, on a heavily loaded production system.

  • Processors are forced to wait on semaphores during the process of synchronization.

  • Overhead is incurred when slave processors are acquired (on entry to a parallel region) and at synchronization points within parallel regions. Tests show that the overhead of executing extra autotasking code adds a nominal 0% to 5% to the overall execution time.

  • If inner-loop autotasking is used, vector performance can decrease because of shorter vector lengths and more vector loop startups.

  • Processors are sometimes held for the next parallel region to improve efficiency. While holding a processor can save time, it also costs time to acquire and hold them.

Because overhead is associated with work distribution, jobs with large granularity have less partitioning than smaller jobs. Large jobs, however, may have problems with load balancing.

Parallel processing implementation strategies are discussed in detail in the following books:

  • Linux Application Tuning Guide

  • Origin 2000 and Onyx2 Performance Tuning and Optimization Guide

In addition to these books, other documents in the MIPSpro compiler documentation set discuss parallel processing issues that are specific to compiler use. See the Guide to SGI Compilers and Compiling Tools for information about those books.