Programming Fundamentals/Understanding High Performance Computing

An explanation of the difference between sequential programming and parallel programming concepts with examples of each. A historical sketch of computers with examples of high performance computing solving problems using parallel programming concepts. Suggestions for various groups of learners to explore high performance computing. Includes software programs, source code files and several Internet links.

Preface – November 13, 2009

This module was created as an entry for the 2008-'09 Open Education Cup: High Performance Computing competition. The competition was supervised by Dr. Jan Erik Odegard, Executive Director of the Ken Kennedy Institute for Information Technology at Rice University. It was submitted to the "Parallel Algorithms and Applications" category and specifically designed as an introduction to the subject targeting intermediate grade school students to collegiate undergraduates who have little knowledge of High Performance Computing (HPC).

This module received the "Best Module" award for the "Parallel Algorithms and Applications" category which included a US $500 prize.

Those who reviewed the entries for the competition made some suggestions for improvement and most have been incorporated into this revised edition of the module. As always; my thanks to them and all others who make suggestions for improving educational materials.

Kenneth Leroy Busbee

Introduction to High Performance Computing

Grouping multiple computers or multiple computer processors to accomplish a task quicker is referred to as High Performance Computing (HPC). We want to explain how this is accomplished using parallel programming algorithms or concepts.

The Shift from a Single Processor to Parallel

We are going to start our explanation by giving two simple examples.

Example 1

After eating all you can, you toss your chicken leg bone out of the car window (shame on you for trashing up the highway), but in short order an ant finds your tossed chicken bone. One single ant could bite off the left over on the bone and transport it to the colony, one bite at a time; but, it might take him 1 whole day (24 hours) of work. But, what if he gets help? He signals some buddies and being a small colony of ants they allocate a total of 10 ants to do the task. Ten times the workers take one tenth the time. The ten ants do the task in 2 hours and 24 minutes.

I toss another bone out the window. An ant finds it and the colony allocates 50 ants to do the task of picking the bone clean. In less than 30 minutes (28.8 to be exact) the 50 ants working in parallel complete the task.

Example 2

One painter might take 8 hours to paint the exterior of an average sized house. But, if he can put a crew of 10 painters working simultaneously (or in other words in parallel) it takes only 48 munities. What about a crew of 50 painters assuming that they can do work and not get in the way of each other; well how about less than 10 minutes (9.6 to be exact).

Now let's make sure we understand that the same amount of work was done in the examples given. The work was only completed in a shorter amount of time because we put more workers on the task. Not all tasks can be divided up in this way, but when it can be divided between multiple workers, we can take advantage of the workers doing their sub part of the task in parallel. Let’s look at another example.

Example 3

I want to drive from Houston, Texas to Dallas, Texas; a distance of about 250 miles. For easy calculations let's say I can travel 50 miles in one hour. It would take me 5 hours. Well, I could divide the task between 5 cars and have each car travel 50 miles and arrive in Dallas in 1 hour. Right?

Well, wrong. The task of driving from Houston to Dallas cannot be divided into tasks that can be done in parallel. The task can only be done by one person driving in a line from Houston to Dallas in 5 hours. I used the word "line" because it helps connect us to the word: linear. A linear task cannot be broken-up into smaller tasks to be done in parallel by multiple workers. Within the computer world, the word associated with linear concept is sequential processing. I must drive one mile at a time in sequence to get to Dallas.

Our natural tendency is to share the work that is to work in parallel whenever it is possible. As a group we can accomplish many tasks that can be done in parallel in less time.

The Birth of Computers – A "Parallel" to Central Processing Unit (CPU) Story

“ENIAC, short for Electronic Numerical Integrator And Computer, was the first general-purpose electronic computer (July 1946). It was the first Turing-complete, digital computer capable of being reprogrammed to solve a full range of computing problems. ENIAC had twenty ten-digit signed accumulators which used ten's complement representation and could perform 5,000 simple addition or subtraction operations between any of them and a source (e.g., another accumulator, or a constant transmitter) every second. It was possible to connect several accumulators to run simultaneously, so the peak speed of operation was potentially much higher due to parallel operation.” (ENIAC from Wikipedia)

Often not understood by many today, the first computer used base 10 arithmetic in the electronics and was a parallel processing machine by using several accumulators to improve the speed. However, this did not last for long. During its construction:

“The First Draft of a Report (commonly shortened to First Draft) on the EDVAC – Electronic Discrete Variable Automatic Computer was an incomplete 101 page document written by John von Neumann and distributed on June 30, 1945 by Herman Goldstine, security officer on the classified ENIAC project. It contains the first published description of the logical design of a computer using the stored-program concept, which has come to be known as the von Neumann architecture.” (First Draft of a Report on the EDVAC from Wikipedia)

“The von Neumann architecture is a design model for a stored-program digital computer that uses a [central] processing [unit] and a single separate storage structure to hold both instructions and data. It is named after the mathematician and early computer scientist John von Neumann. Such computers implement a universal Turing machine and have a sequential architecture.” (Von Neumann architecture from Wikipedia)

Von Neumann also proposed using a binary (base 2) numbering system for the electronics. One of the characteristics of the von Neumann architecture was the trade off of multiple processors using base 10 electronics to a single central processor using base 2 (or digital) electronics. To compare to our ant example, the idea was to use one real fast ant versus 10 slow ants. If one real fast ant can do 1,000 tasks in an hour; it would be more powerful (be able to do more tasks) than 10 ants doing 10 tasks an hour or the equivalent of 100 tasks per hour.

The rest is history – most commercially built computers for about the first forty years (1951 to 1991) followed the von Neumann architecture. The electronic engineers keep building more reliable and faster electronics. From vacuum tube, to transistor, to integrated circuit to what we call today "chip" technology. This transformation made computers break down less frequently (they were more reliable), physically smaller, needing less electric power and faster. Personal computers were introduced in the late 1970's and within ten years became more commonly available and used.

One short coming was that most programming efforts were towards improving the linear (or sequential) way of thinking or solving a problem. After all, the computer electronic engineers would be making a faster computer next year. Everyone understood that the computer had only one central processing unit (CPU). Right?

The Need for Power

Well, wrong. Computer scientists and electronic engineers had been experimenting with multi-processor computers with parallel programming since 1946. But it's not until the 1980's that we see the first parallel processing computers (built by Cray and other computer companies) being sold as commercial built computers. It's time for another example.

Example 4

The circus traveling by train from one city to the next has an elephant that dies. They decide to toss the elephant off the train (shame on them for trashing up the country side), but in short order a "super" ant (faster than most regular ants) finds the elephant. This project is much larger than your tossed chicken bone. One single "super" ant could do the task (bite off a piece of the elephant and transport it to the colony, one bite at a time); but, it might take one whole year. After all this requires a lot more work than a chicken bone. But, what if he gets help? He signals some buddies and being a large colony of "super" ants they allocate a total of 2,190 ants to do the task. Wow, they devour the elephant in six hours.

This elephant example is exactly where the computer scientists had arrived. The electronic engineers were going to continue to make improvements in the speed of a single central processing unit computer, but not soon enough to satisfy the "need for power" to be able to solve tasks requiring immense computing power. Some of the new tasks that would require immense computer power included the human genome project, searching for oil and gas by creating 3 dimensional images of geological formations and the study of gravitational forces in the universe; just to mention a few. The solution: parallel processing to the rescue. Basically the only way to get this immense computer power was to implement parallel processing techniques. During the late 1970's and early 1980's scientists saw the need to explore the parallel processing paradigm more fully and thus the birth of High Performance Computing. Various national and international conferences started during the 1980's to be able to further the cause of High Performance Computing. For example in November of 2008 the "SC08" supercomputing conference celebrated their 20^th anniversary.

The predicting of the weather is a good example for the need of High Performance Computing. Using the fastest central processing unit computer it might take a year to predict tomorrow's weather. The information would be correct but 365 days late. Using parallel processing techniques and a powerful "high performance computer", we might be able to predict tomorrow’s weather in 6 hours. Not only correct, but in time to be useful.

Measuring Computer Power

Most people are familiar with the giga hertz (billions of instructions per second) measure to describe how fast a single CPU's processor is running. Most microcomputers of today are running around 3 GHz or 3 billion instructions a second. Although 3 billion sounds fast, many of these instructions are simple operations.

Supercomputing uses a measurement involving floating point arithmetic calculations as the benchmark for comparing computer power. "In computing, FLOPS (or flops or flop/s) is an acronym meaning FLoating point Operations Per Second." and again "On May 25, 2008, an American military supercomputer built by IBM, named 'Roadrunner', reached the computing milestone of one petaflop by processing more than 1.026 quadrillion calculations per second." (FLOPS from Wikipedia) For those of us not familiar:

Example 5: Getting a Sense of Power

3 billion or 3 GHz is:                  3,000,000,000
1 quadrillion or 1 pedaflop is: 1,000,000,000,000,000

You also should realize that your personal computer is not doing 3 gigaflop worth of calculations, but something slower when using the FLOPS measurement.

High Performance Computing Made Personal

It took several years (about 30) to get computers to a personal level (1951 to 1981). It took about twenty years (late 1980’s to present 2009) to get multi-processor computers to the personal level. Currently available to the general public are computers with "duo core" and "quad core" processors. In the near future, micro computers will have 8 to 16 core processors. People ask, "Why would I need that much computer power?" There are dozens of applications, but I can think of a least one item that almost everyone wants: high quality voice recognition. That's right! I want to talk to my computer. Toss your mouse, toss your keyboard, no more touch pad – talk to it.

Again, one short coming is that most programming efforts have been towards teaching and learning the sequential processing way of thinking or solving a problem. Educators will now need to teach and programmers will now need to develop skills in programming using parallel concepts and algorithms.

Summary

We have bounced you back and forth between sequential and parallel concepts. We covered our natural tendency to do work in parallel. But with the birth of computers the parallel concepts were set to the side and the computer industry implemented a faster single processor approach (sequential). We explained the limitations of sequential processing and the need for computing power. Thus, the birth of High Performance Computing. Parallel processing computers are migrating into our homes. With that migration, there is a great need to educate the existing generation and develop the next generation of scientists and programmers to be able to take advantage of High Performance Computing.

Learner Appropriate Activities

High Performance Computing is impacting how we do everything. Learning, working, even our relaxation and entertainment are impacted by HPC. To help more people understand HPC, I have listed appropriate activities based on where a learner is in relation to their programming skills.

Computer Literacy but No Programming Skills

We have provided two computer programs that help students see the impact of parallel processing. The first is a "Linear to Parallel Calculator" where the student enters how long it would take one person to complete a task, asks how many people will work as a group on the task, then calculates how long it will take the group to complete the task. The second is a "Parallel Speed Demonstration Program" that simulates parallel processing. It displays to the monitor the first 60 factorial numbers in 60 seconds, then shows as if 10 processors are doing it in 6 seconds, then as if 100 processors are doing it in less than 1 second. Both are compiled and ready for use on an Intel CPU machine (compiled for use on Windows OS).

Download the executable file from Connexions: Linear to Parallel Calculator

Download the executable file from Connexions: Parallel Speed Demonstration Program

An interesting activity would be to join a group that is using thousands of personal microcomputers via Internet connections for parallel processing. Several distributed processing projects are listed in the "FLOPS" article on Widipedia. One such group is the "Great Internet Mersenne Prime Search - GIMPS".

A link to the GIMPS web site is: http://www.mersenne.org/

Another activity is to "Google" some keywords. Be careful - "Googling" can be confusing and often can be difficult to focus on the precise subject that you want.

high performance computing
computational science
supercomputing
distributed processing

Learning Programming Fundamentals

Students learning to program that are currently taking courses in Modular/Structured programming and/or Object Oriented programming might want to review the source code files for the demonstration programs listed above. These programs do not do parallel programming, but the student could modify or improve them to better explain parallel programming concepts.

You may need to right click on the link and select "Save Target As" in order to download these source code files.

Download the source code file from Connexions: Linear to Parallel Calculator

Download the source code file from Connexions: Parallel Speed Demonstration Program

Another appropriate activity is to "Google" some of the key words listed above. With your fundamental understanding of programming, you will understand more of the materials than those with no programming experience. You should get a sense that parallel programming is becoming a more important part of a computer professional’s work and career.

Review the "Top 500 Super Computers" at: http://www.top500.org/

Look at the source code listings provided in the next section, but remember, you cannot compile or run these on your normal computer.

Upper Division Under-Graduate College Students

The challenge is to try parallel computing, not just talk about it.

During the week of May 21st to May 26th in 2006, this author attended a workshop on Parallel and Distributed Computing. The workshop was given by the National Computational Science Institute and introduced parallel programming using multiple computers (a group of micro computers grouped or clustered into a super-micro computer). The conference emphasized several important points related to the computer industry:

During the past few years super-micro computers have become more powerful and more available.
Desk top computers are starting to be built with multiple processors (or cores) and we will have multiple (10 to 30) core processors within a few years.
Use of super-micro computing power is wide spread and growing in all areas: scientific research, engineering applications, 3D animation for computer games and education, etc.
There is a shortage of educators, scientific researchers, and computer professionals that know how to manage and utilize this developing resource. Computer professionals needed include: Technicians that know how to create and maintain a super-micro computer; and Programmers that know how to create computer applications that use parallel programming concepts.

This last item was emphasized to those of you beginning a career in computer programming that as you progress in your education, you should be aware of the changing nature of computer programming as a profession. Within a few years all professional programmers will have to be familiar with parallel programming.

During the conference this author wrote a program that sorts an array of 150,000 integers using two different approaches. The first way was without parallel processing. When it was compiled and executed using a single machine, it took 120.324 seconds to run (2 minutes). The second way was to redesign the program so parts of it could be run on several processors at the same time. When it was compiled and executed using 11 machines within a cluster of micro-computers, it took 20.974 seconds to run. That’s approximately 6 times faster. Thus, parallel programming will become a necessity to be able to utilize the multi-processor hardware of the near future.

A distributed computing environment was set up in a normal computer lab using a Linix operating system stored on a CD. After booting several computers with the CD, the computers can communicate with each other with the support of "Message Passing Interface" or MPI commands. This model known as the Bootable Cluster CD (BCCD) is available from:

Bootable Cluster CD – University of Northern Iowa at: http://www.bccd.net/

The source code files used during the above workshop were modified to a version 8, thus an 8 is in the filename. The non-parallel processing "super" code was named: nonps8.cpp with the parallel processing "super" code named: ps8.cpp (Note: The parallel processing code contains some comments that describe that part of the code being run by a machine identified as the "SERVER_NODE" with a part of the code being run by the 10 other machines (the Clients). The client machines communicate critical information to the server node using "Message Passing Interface" or MPI commands.)

You may need to right click on the link and select "Save Target As" in order to download these source code files.

Download the source code file from Connexions: nonps8.cpp

Download the source code file from Connexions: ps8.cpp

Two notable resources with super computer information were provided by presenters during the workshop:

Oklahoma University – Supercomputing Center for Education & Research at: http://www.oscer.ou.edu/education.php

Contra Costa College – High Performance Computing at: http://contracosta.edu/hpc/resources/presentations/

You can also "Google" the topic's key words and spend several days reading and experimenting with High Performance Computing.

Consider reviewing the "Educator Resources" links provided in the next section.

Educator Resources

There are many sites that provide materials and assistance to those teaching the many aspects of High Performance Computing. A few of them are:

Shodor – A National Resource for Computational Science Education at: http://www.shodor.org/home/

CSERD – Computational Science Education Reference Desk at: http://www.shodor.org/refdesk/

National Computational Science Institute at: http://www.computationalscience.org/

Association of Computing Machinery at: http://www.acm.org/

Super Computing – Education at: http://sc09.sc-education.org/about/index.php

Simple Definitions

high performance computing: Grouping multiple computers or multiple computer processors to accomplish a task in less time.

sequential processing: Using only one processor and completing the tasks in a sequential order.

parallel processing: Dividing a task into parts that can utilize more than one processor.

central processing unit: The electronic circuitry that actually executes computer instructions.

parallel programming: Involves developing programs that utilize parallel processing algorithms that take advantage of multiple processors.

Programming Fundamentals