Parallel Computing and Computer Clusters/Theory
Typical real-world applications
What do people do with clusters of computers?
- The web
- Search engine indexes
- Scientific and engineering work
- Weather modeling and prediction
- Molecular modeling
- Product design and simulation
Classes of problems
Embarrassingly parallel problems
Parallel Programming Models
There are generally two ways to accomplish parallel architectures. They can be either used separately or the architecture can be any combination of the two. The shared memory model is a model where all processors in the architecture share memory and address spaces. All the processors in the architecture can access all of the attached main memory. The message passing model is a model that puts the information into messages. Discrete units of information that can be passed around to the various CPU's in the architecture. Both architectures have their advantages and disadvantages.
Shared memory models are models where all CPU's can address the main memory of the machine. Generally the main memory is accessible though a bus. This allows all CPU's to use memory addresses, and addresses can be passed around to different CPU's. Having all CPU's access the memory through a bus introduces a number of issues. Such as bandwidth of the bus, clearly if a bus is small and there are a large number of CPU's then optimization of the CPU usage is going to be difficult to achieve. Often bus communication is augmented with local cache. Processor L1 and L2 cache, this then introduces the challenge of cache coherence. How to assure that the data in main memory and cache are up to date enough, such that all CPU's will see the appropriate data. There are several strategies to proper and optimized operation here.
Message passing systems have memory that is either not directly connected to the CPU or even possibly spread across various geographic locations. This results in a system of send and receives, where the data is packed into a message and sent across a network to various other machines in the network.
History of distributed computing
The Eighties and Nineties: PVM and MPI
Today: GFS, MapReduce, Hadoop
Hadoop at it's core is a combination of two open source frameworks, MapReduce and HDFS. Both are open source implementations based on white papers published by Google. MapReduce is a processing framework based on a Google framework of the same name. MapReduce takes the data stored in HDFS and processes it on each node in the cluster. MapReduce consists of two procedures defined by the programmer. The Map and the Reduce. Mappers take lines in the form of keys and values. Mappers also emit keys and values. The Reducer has an input format that is a key and a vector of all the values for that key that were emitted by the mapper. HDFS is the Hadoop filesystem. HDFS is an implementation of Google distributed filesystem. HDFS is a software implementation that distributes the storage of files across multiple nodes in a cluster. There is a default replication factor of three to assure with a high degree of certainty that no data is lost.
Mathematics of parallel processing
Parallel processing is the simultaneous execution of the same task (split up and specially adapted) on multiple processors in order to obtain faster results. The parallel nature can come from a single machine with multiple processors or multiple machines connected together to form a cluster.
Amdahl's law is a demonstration of the law of diminishing returns: while one could speed up part of a computer a hundred-fold or more, if the improvement only affects 12% of the overall task, the best the speedup could possibly be is times faster.
More technically, the law is concerned with the speedup achievable from an improvement to a computation that affects a proportion P of that computation where the improvement has a speedup of S. (For example, if an improvement can speedup 30% of the computation, P will be 0.3; if the improvement makes the portion affected twice as fast, S will be 2.) Amdahl's law states that the overall speedup of applying the improvement will be
To see how this formula was derived, assume that the running time of the old computation was 1, for some unit of time. The running time of the new computation will be the length of time the unimproved fraction takes (which is 1 − P) plus the length of time the improved fraction takes. The length of time for the improved part of the computation is the length of the improved part's former running time divided by the speedup, making the length of time of the improved part P/S. The final speedup is computed by dividing the old running time by the new running time, which is what the above formula does.
In the special case of parallelization, Amdahl's law states that if F is the fraction of a calculation that is sequential (i.e. cannot benefit from parallelisation), and (1 − F) is the fraction that can be parallelised, then the maximum speedup that can be achieved by using N processors is
In the limit, as N tends to infinity, the maximum speedup tends to 1/F. In practice, price/performance ratio falls rapidly as N is increased once (1 − F)/N is small compared to F.
As an example, if F is only 10%, the problem can be sped up by only a maximum of a factor of 10, no matter how large the value of N used. For this reason, parallel computing is only useful for either small numbers of processors, or problems with very low values of F: so-called embarrassingly parallel problems. A great part of the craft of parallel programming consists of attempting to reduce F to the smallest possible value.