Syllabus for CS 471/Math 471 High Performance Scientific Computing Brian T. Smith, CS Dept./HPCERC/ARC Andrew C. Pineda, HPCERC/ARC Richard C. Allen, Sandia National Laboratories Fall, 1997 Lectures: M, W 4:00-4:50 PM, TAPY Rm. 217, Section I T, Th. 4:30-5:20 PM, ME Rm. 220, Section II Laboratory: M, W 5:20-6:50 PM, ESCP Rm. 109, Section I. M, W 5:20-6:50 PM, ESCP Rm. 110, Section II Office Hours: M, T, W, Th. 2:00-3:30PM (Brian Smith) ??? (Andy Pineda) Prerequisites Programming knowledge and experience, Fortran or C, knowledge of calculus, introductory knowledge of ODEs, linear algebra, numerical methods, knowledge and experience with UNIX. Note: most of this material is reviewed and although expertise and current knowledge in all of these is not required nor assumed, the student is responsible for reviewing or gaining knowledge in these areas to be able to cope with the material. I will assist in any way to gain this knowledge for those who are unfamiliar with the material or who have not used it for some time. Lack of current knowledge in all of these areas will lead to problems in keeping up with the lectures and laboratories for this course. Course Material Text Book: An Introduction to High Performance Scientific Computing, Fosdick et al., MIT Press. Laboratory Exercises: ~acpineda/CS471/... on both the CIRT and AHPCC machines. Auxiliary Material: CSEP Web Text Book, http://csep1.phy.ornl.gov/csep.html Getting Started: ~acpineda/CS471/GettingStarted.txt Offices Prof. Brian T. Smith Dr. Richard C. Allen 2nd floor, Galles Building Showroom Org. 9205 1601 Central Ave. NE. SNL Albuquerque, NM 87131 Albuquerque, NM 87185-1110 Phone: 505-277-8249 or 8337 Phone: 505-845-7825 Email: bsmith@arc.unm.edu Email: rcallen@cs.sandia.gov Dr. Andrew C. Pineda 1st Floor, Galles Building Showroom 1601 Central Ave. NE. Albuquerque, NM 87131 Phone: 505-277-8249 Email: acpineda@arc.unm.edu Support CS 471/Math 471 is being offered this fall semester both as a required course in UNM's Scientific and Engineering Computational Program for graduate students and as part of a new post non-degree graduate program in Computational Science. We thank and acknowledge the support of SNL to help establish this non-degree graduate program. To better accommodate the backgrounds of the students, we are offering two sections this semester, Section I for the post-degree students that treats more advanced application topics and involves a major project, and Section II for the UNM regular students that provides a one or two more introductory application and involves more directed laboratory material. The material presented in the lectures, except for the applications, will be the same. Lecturers Lectures will be given by several UNM faculty and staff, and SNL staff. Your primary instructors are: Prof. Brian T. Smith (CS/UNM), Dr. Andrew C. Pineda, and Dr. Richard Allen (SNL). Acknowledgments Much of this course is based on material developed under an NSF Educational Infrastructure grant CDA-9017953 at the University of Colorado's Computer Science Department, by investigators Prof. Lloyd Fosdick, Prof. Elizabeth Jessup, Ms. Carolyn Schauble, and Prof. Gitta Domik. Although it has been modified to suit the needs of the various programs at UNM, its unique methodology in Computer Science to inform students of issues in scientific computing is due to the ideas and teaching expertise of Lloyd Fosdick and others at the University of Colorado: many thanks for a very interesting and effective teaching methodology for this cross-disciplinary subject. Introduction The principal objective of this course is to introduce you to the use of high performance computing systems in science and engineering. High performance computing systems are systems that deliver computational results at a rate of over a billion floating point operations per second sustained over a few minutes to a few hours or longer; it is expected that, within the next few years, applications that use over a trillion floating point operations over sustained over a few minutes will become commonplace. High performance scientific computing uses such machines to solve science and engineering problems as well as to model and simulated complex physical systems to gain insight and understanding into how they behave. This course is designed to expose students to many of the tools and techniques used to solve advanced and complex scientific problems. The course surveys basic tools such as programming languages, problem solving environments, visualization environments, and basic computing methodologies from modeling and simulation techniques to visualization techniques. The course emphasizes hands-on experience through the use of computational laboratories that require experiments with computational models and the reporting of these experiments through logbooks to be maintained throughout the semester. What We Will Study The fastest computers today operate at peak speeds of around several hundred gigaflops (several hundred billion floating point operations per second). Recently Intel and Sandia announced that they were able to sustain a computation at the rate of around one teraflop (10^12 floating point operations per second) on the ASCI Red machine! You can get some feeling for this speed when you realize a machine operating at 100 gigaflops can do about 100 arithmetic operations in the time it takes for light to travel one foot. Today's high speed have been achieved by improved chip technology and by new computing architectures that support parallel computing. These new architectures are a challenge to the algorithm designer and the programmer because of their complexity and the fact that failure to take into account their special characteristics, such as individual central processing unit (CPU) performance and the interconnection network, can result in significant loss of performance. Therefore, we will study algorithms and software designed to achieve good performance on these systems. This will be done in the context of solving problems that are typical of those solved in the "real world". Problems of practical interest arise in such diverse areas of science as medicine, celestial mechanics, the study of fluid flows, the design of buildings and airplanes, and quantum mechanics. The problems to be solved are computationally intensive and tend to require large amounts of data. We will see that data access patterns on modern CPUs and on parallel machines are critically important to achieving high performance. A good understanding of algorithms for high performance computing requires a knowledge of numerical analysis. Without this knowledge you cannot understand why one algorithm is more accurate than another or why one will produce a result of given accuracy in less time. Furthermore, and of great practical importance, you cannot be confident of the results of a computation without the understanding of numerical methods used to produce them and how numerical rounding errors propagate (imagine how rounding errors can accumulate when you are performing 10^12 computations per second and your program runs for several days for each result). Thus, we will study aspects of numerical analysis, with particular attention to linear algebra and the solution of differential equations. Another important issue in high performance computing is how to present the results of a computation. In most cases this must be done in a graphic form, not only because it is the most natural way to do it and leads to insights about the physical system, but also because the sheer quantity of numbers produced (as many as 10^12 per second) makes it impractical, if not impossible, to print them all. We will study how to present the results of computation in graphic form, not only as simple x-y plots but as realistic pictures of the physical system. This subject is often called "scientific visualization". Parallel computing is becoming the only way to reach the high performance needed to solve the challenging modeling and simulation problems of our times. Of the many programming paradigms and tools which support parallel program development, we study just two of them: the message passing paradigm as illustrated by the Message Passing Interface, and data parallelism, as illustrated by High Performance Fortran. Each of these parallel programming tools also is used to illustrate a second major difference in approach: the hands-on detailed control of MPI versus the "compiler-knows-best" strategy of the High Performance Fortran. We still study the pros and cons of each approach. Finally, we will study computer performance itself. We will study how to measure performance, how to characterize it in terms of performance parameters, and how to use tools to enhance performance. The Approach of This Course The approach taken in this course is to use laboratory exercises and case studies to learn the important aspects of high performance scientific computing and in particular the use of this computing tool to perform modeling and simulation of our most significant scientific and engineering problems facing you in the future. The basic material is presented in lectures followed by time to perform the exercises and experiments described in a series of laboratory manuals during lab periods. At least one person will be present at all labs to assist you with the assigned exercises. Log books, describing your work in the lab periods, must be kept by you and will be handed in and marked three times during the semester. A Fortran 90 quiz is set for week ??? to allow you to assess your performance in the class. A second quiz will be scheduled ??? (week 11). Depending on the section, a further quiz will be given or projects will be assigned; this are to be determined later in the semester. Programming Languages In scientific computing, Fortran is the predominant language, mainly due to its development as the first high level programming language to be officially standardized, but also do to its strong support for numerical computing. Fortran 90 is introduced because of its strong support for numeric precision control and array processing. In addition, it is the basis for the data parallel programming paradigm used in this course, namely standard High Performance Fortran. Fortran 95, now a standard, has adopted many of the array extensions to Fortran 90 used in High Performance Fortran. You are welcome to use other languages such as C and data parallel C, now just becoming available, but as stated below, the laboratory material is based on Fortran. The course uses much sample code written in Fortran 77 that needs to be modified. The student is free to rewrite the exercises and perform the laboratory experiments in an other language of his/her choice but little if any sample code in other languages is available at this time. (Much of the example code is being rewritten in Fortran 90 and will be rewritten at the same time in C and C++, but it is unlikely that much of it will available during this semester.) About The Computing Machines The computers used are those available for general use at UNM and Sandia National Laboratory. The UNM computers are those at the Albuquerque High Performance Computing Center (AHPCC) and the central computing facility CIRT. The AHPCC machines include currently a 32-node IBM SP1 with the SP2 high performance switch, an HP 735, a Sun Sparc 20, a Sun Ultrasparc clone, an IBM RS/6000 model 370, several IBM PowerPCs (models 25T, 43P), an SGI Onyx Reality, an SGI Onyx Reality 2, and several Pentiums running Linux. The CIRT machines include a 4-node IBM SP2 with an SP3 high performance switch, a large collection of IBM RS/6000 370 machines, an IBM RS/6000 J30 4-processor shared memory processor, and an SGI Onyx Reality. The Sandia machine is 54-node Paragon. All of these machines run a version of UNIX, compatible with either System V or BSD 4.2 Unix. CIRT maintains a very large modem bank, accessed via the phone number 277-9990, from which both CIRT and AHPCC machines are accessible. AHPCC has a small modem back (approximately 10 lines), available from the phone number 277-???? About The Software And Software Tools Software, provided on the local UNM machines, includes several Fortran 90 compilers, Matlab Version 4.3, and various parallel computing libraries such as MPI, PVM, and P4. The compilers behave differently and have different support and debugging tools available from the vendor. For example, the NAG f90 compiler is a good debugging compiler with various debugging flags like -C (check subscripts), -P (checks pointers) but is a very poor code optimizer (at least on IBM workstations -- however, there is an optimizing NAG f90 compiler for Sun platforms). On the other hand, the IBM xlf, xlhpf, and xlc compilers are very good optimizing compilers with often very surprising optimizations (specified by options on the compiler) that drastically improve the performance of programs used in the laboratory experiments; these performance improvements are not described in the laboratory manuals on computer performance. Beware that many of the UNIX tools such as Make, Vi, debuggers, etc. behave differently depending on which version is being used. The versions may be Gnu software, vendor supplied (IBM, HP, Sun), or Linux, all of which can have slightly different behaviors. Before spending a lot of time looking for bugs, ask me, the lecturer, or your colleagues for help. This course relies on these tools but is not the subject of the course. See the document GettingStarted in ~acpineda/CS471 for further information on some of these tools. Classes And Laboratory Sessions The classes and laboratories are held in the rooms listed at the beginning of this document. The lecture material is presented before it is needed in the laboratory sessions. The laboratory sessions are available for performing the exercises and experiments with someone present at all times to answer questions about the lectures, laboratory exercises and experiments, and projects. In some cases, the results described in the laboratory material are based on the use of a DECstation 5000/240, particularly the computer performance laboratory. This machine is a RISC (reduced instruction set computer) architecture, like the IBM RS/6000 machines, but is considerable slower with much less sophisticated hardware. Thus, the descriptions in the lab manuals for DECstation are similar but not the same as for the IBM machines but they still remain appropriate. The major differences arises from the fact that the IBM RS/6000 has multiple functional units and relies on cache storage more so than the DECstation. Quizzes, Exams, Laboratory Log Books, And Projects: The schedule for the quizzes, submission of the log books, and final projects is as follows: Item Date Held or Due Fortran 90 Quiz ???? Log Book 1 ???? Log Book 2 ???? Project Proposals ???? Quiz ???? Log Book 3 ???? Project Presentations and Report ???? All quizzes are open book and are expected to take an hour. Logbooks The logbooks describe the observations and results obtained from the exercises performed in the laboratory manuals. The answers to any questions in the exercises should be provided in the logbooks. Comments on the exercises, what you have learned from the exercises, and questions that remain in your mind should be recorded in the logbook. The purpose of the logbook is to record your thoughts and questions as you proceed through the exercises. To provide more structure and detailed specification for the logbooks, it should reference the each exercise as you commenting on and should include the following points as appropriate for each exercise: · the results of performing the exercise · whether the exercise worked as expected or the results were anomalous · any questions you had about the results · what did you learn from the exercise · analysis of the results The logbook can either be a written record in an exercise book or recorded in a file created as you are working on the exercises in the lab period. The logbook should be printed, and if you wish, formatted with a text formatter of your choice. For example, suppose the following exercise was given, after having attended a lecture on IEEE floating point number representations and how the programming language does its arithmetic (that is, what precision it uses). Exercise 6.1: Write a program (C or Fortran) that executes the following statements: Assign to the variable x the value of the constant 1.1 Print out x with lots of digits Assign to y the value 1.0 and z the value 3.0 Assign to zz the value of y/z Print out the value of the expression 1.0 - 3.0*(y/z) Print the machine precision for the type used for x Repeat the program making sure x is represented at a higher precision and the division is performed at this higher precision. Logbook entry: I wrote the following Fortran program for exercise 6.1 x = 1.1 print '(a,e20.12)', 'x= ', x y = 1.0; z = 3.0; zz = y/z print '(a,e20.12)', 'The result of the expression 1.0-3.0*zz is ', 1.0 - 3.0*zz print *, 'The epsilon (machine precision) for the type used to represent x is, epsilon(x) end The results were: 1.10000002384 and -2.98023223877E-06 (these are obtained from a PC using the Salford f90 compiler -- what is printed is very dependent on the hardware and software being used). To my surprise, 1.1 did not print as 1.1 but a number slightly larger than 1.1. On second thought, 1.1 is not representable so the input/output conversion routines print the decimal expansion of the binary representation of 1.1 which when listed to 12 digits is larger than 1.1 The second result is not zero as expected but is a small number. It looks reasonable because the division by 3.0 will produce a number near one-third but not exactly as one-third cannot be presented and when multiplied by 3. produces a number near 1.0 but slightly larger. The size of the result is reasonable because the precision of the arithmetic with this compiler and machine is 32 bits (standard IEEE binary arithmetic) which has a precision of 23 binary digits, approximately 10^(-7). (Actually 2^(-23) is 1.1920929E-07.) For the second part of the exercise, I declared x, zz and y to be double precision and repeated the calculation. This time, .the result for x was the same and the result for the expression was 0.55e-16. I expected initially x to be a number close to 1.1 within 16 digits, but on second thought about how Fortran represents literal constants, I realized that 1.1 is a single precision constant and this is an approximation to 1.1 to single precision. To obtain what I expected, I would have to change 1.1 to 1.1d0. For the second result, 1.0 and 3.0 are represented exactly in double precision, the division is done in double precision, and thus the result is accurate to double precision, that is, approximately 16 digits. Case Studies Depending on the Section of the course, two or three case cases will be presented. For both sections, the first case study will be an introduction to molecular dynamics taken from your text book (IHPSC), followed by the corresponding lab. For Section I, the next lab will be a follow-on in the areas of molecular dynamics using a more advanced algorithm in the case study and then will be followed by the case study on advection. Instead of the follow-on molecular dynamics lab and advection case study, Section II will be presented with case studies from two guest lecturers to be determined later. The case studies will involve three lectures on the topic with exercises to be worked on during the lab sessions.