- Hardware architecture (parallel computing)
- US6094715A - SIMD/MIMD processing synchronization - Google Patents
- Part 1: The Parallel Computing Environment
- Parallel Computing
- 1 Introduction
Parallel machines today consist of either off-the-shelf chips, or in certain cases, chips that have been modified from the off-the-shelf version to include additional caches and other support structure to ideally suit them for being combined together into a large parallel computer.
Initial shipments of the EV5 chip give it a peak performance of million floating point operations per second megaflops or MFlops in IEEE standard bit and bit mode running at a speed of MHz. Continual upgrades of the chip increased the speed to , and finally MHz by In some instances, the chips implemented in the MPPs run about one year behind their implementation in workstations, since the MPP support must be compatible with the new chip.
The SP series is available in configurations that we informally term as "standard" or "hybrid" where the standard configuration has one processor per node and the hybrid versions have an SMP on each node. The carefully balanced architecture of these P2SC processors allows them to obtain high levels of performance on a wide variety of scientific and technical applications. In the class of "hybrid" configurations, a recent SP version has a 4-way SMP on each node with each PowerPC e processor in the node running at MHz leading to a peak performance rating of 2.
Hardware architecture (parallel computing)
Two of these machines have been placed at the Lawrence Livermore National Laboratories as part of the U. A larger machine with nodes and 4 processors per node is currently in the installation process. In Figure 4 , we show a greatly simplified schematic diagram of a generic hybrid architecture that demonstrates distributed memory across nodes and shared memory within a single node.
The optimal programming model for such hybrid architectures is often a combination of message passing across the nodes and some form of threads within a node to take full advantage of shared memory. This type of programming will be discussed in some detail in the section on programming models. Figure 4. Crucial to the performance of an MPP application is memory hierarchy. Even in computers that are classified as uniform memory access, there are levels of memory associated with the architecture.
Consider for example an isosceles trapezoid of memory. See Figure 5. It is useful in understanding the following procedure that describes the process of moving data between memory and the microprocessor. The microprocessor requests the value of B 1 from data cache. Data cache does not have B 1 , so it requests B 1 from secondary cache. Secondary cache does not have B 1. This is called a secondary cache miss. It retrieves a line 8 words for secondary cache from local memory.
This includes elements B Data cache receives a line 4 words for data cache from secondary cache. This is elements B The microprocessor receives B 1 from data cache.
When the microprocessor needs B 2 through B 4 , it need only go to data cache. When the microprocessor needs B 5 , data cache does not have it. Data cache requests B 5 through B 8 from secondary cache, which has them and passes them on. Data cache passes B 5 through B 8 on to the microprocessor as it gets requests for them. When the microprocessor finishes with them, it requests B 9 from data cache. Data cache requests a new line of data elements from secondary cache, which does not have them. This is the second secondary cache miss, and it is the signal to the system to begin streaming data.
Secondary cache requests another 8-word line from local memory and puts it into another of its three-line compartments.
- Bill W. and Mr. Wilson: The Legend and Life of A.A.’s Cofounder!
- Understanding Criminal Behaviour: Psychosocial Approaches to Criminality.
It may end up in any of the three lines, since the selection process is random. A 4-word line is passed from secondary cache to data cache, and a single value is moved to the microprocessor. When the value of B 9 gets to the microprocessor, the situation is as illustrated in Figure 7. Because streaming has begun, data is now prefetched. Secondary cache anticipates the microprocessor's continuing need for consecutive data, and begins retrieving B 17 through B 24 from memory before it is requested.
Data cache requests data elements from secondary cache before it receives requests from the microprocessor.
US6094715A - SIMD/MIMD processing synchronization - Google Patents
As long as the microprocessor continues to request consecutive elements of B, the data will be ready with a minimum of delay. The process of streaming data between local memory and the functional units in the microprocessor continues until the DO loop is completed for the N values. In general, one can obtain a larger amount of total memory on an MPP machine as compared to an SMP machine and the cost per megabyte is usually much cheaper.
At the time it was released, its memory was larger than the memory available on shared memory machines. As discussed in a number of applications, the need for large memory is frequently a major reason for moving to an MPP machine. Another important part of the supercomputer is the system by which the processors share data or the interconnect network. Interconnection networks can be classified as static or dynamic. In a static network, processors are connected directly to other processors via a specific topology. In a dynamic network, processors are connected dynamically using switches, and other links can establish paths between processors and banks of memory.
Part 1: The Parallel Computing Environment
In the case of static interconnects, the data transfer characteristics depend on the explicit topology. Some popular configurations for the interconnect include 2D meshes, linear arrays, rings, hypercubes, and so on.
- How Parallel Processing Works.
- Principles of Security and Trust: Third International Conference, POST 2014, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2014, Grenoble, France, April 5-13, 2014, Proceedings;
- MIMD (Multiple Instruction, Multiple Data) Machines.
- Other Resources!
- Energy Basis for Man and Nature.
- Alan Turing: The Enigma (Centenary Edition).
For some of these configurations, part of the job of the parallel programmer was to design the data layout to fit the machine. This was because the non-uniform nature of the interprocessor communication made an enormous difference in the performance of a given application.
As technology improved in certain parallel computer designs, these communication latencies were masked from the user or significantly reduced with improved interconnects. The Cray T3D and T3E lines of parallel computers greatly improved this obstacle to performance by arranging the processors in a 3-dimensional torus. A diagram of the torus is given in Figure 8. In the toroidal configuration, each processing element is connected with a bi-directional interconnect not only to neighboring processors, but additional connections are made to connect the ends of the array in the perpendicular directions thus the name 3D torus.
The HPS offers the highly attractive feature that the available bandwidth between any two nodes in a system remains near constant regardless of the location of the nodes or the size of the overall system, up to the maximum size of the switch available on current SP configurations. This is achieved by a multistage interconnect network which adds switching stages to increase aggregate bandwidth with increasing number of nodes.
Switches use various techniques for directing the data from one port of the switch connected to a node to another. These techniques include temporal division, spatial division, and frequency domain division. In a multistage network such as the HPS, additional links are added with increasing nodes. The number of hops, H, increases logarithmically with the number of nodes.
But, this is compensated by the logarithmic increase in the number of links N and thus the bandwidth between any two nodes remains constant. A common and realistic indicator of aggregate network capacity is the bisection bandwidth that is the maximum possible bandwidth across a minimum network bisection, where the bisection cuts the network into two equal parts.
For large systems, this provides an important advantage over direct connect networks such as rings and meshes, where the bisection bandwidth increases much more slowly. The constant bandwidth between any two nodes on an SP system also eliminates concerns based on "near" and "far" nodes; as far as an application programmer is concerned, all nodes on an SP are "equidistant" from each other. The first level of understanding performance issues requires a few definitions. We have the basic measure of computer performance being that of peak performance , namely the maximum performance rate possible for a given operation usually prefixed by Mega 10 6 per second.
The performance in operations per second OPS is based on the clock rate of the processor and the number of operations per clock cycle. At the chip level, there are various ways to increase the number of operations per clock cycle and thus the overall chip OPS capability. This measure of supercomputing performance is often jokingly referred to as the "speed that the vendor is guaranteed never to exceed," Jack Dongarra.
Theoretical peak performance for some machines used in the applications chapters. Some machines are available with more processors and faster clocks than given here. Perhaps the next most important hardware factor affecting performance is the communication speed on the chip, speed between local memory and the chip, and speed between nodes.