Ewfdwefwefwf

Chapter 1 Parallel Computer Models

Prof. D. P Theng . GHRCE

TAE
TAE - I TAE - II

TAE Components
Quiz Test Assignment

Date of Submission
Second week- July 2013 Sept 2013

TAE - III
TAE - IV TAE - V TAE - VI

Technical Presentation
Attendance PPT on Paper Review Chapter Review

Fourth Week- July 2013
Sept 2013 First week- Aug 2013 Fourth Week- Aug 2013 Sept 2013

TAE - VII Guest Lecture/Industrial Visit



Early computing was entirely mechanical:
   



abacus (about 500 BC) mechanical adder/subtracter (Pascal, 1642) difference engine design (Babbage, 1827) binary mechanical computer (Zuse, 1941) electromechanical decimal machine (Aiken, 1944)



Mechanical and electromechanical machines have limited speed and reliability because of the many moving parts. Modern machines use electronics for most information transmission.

 Computing

is normally thought of as being divided into generations.  Each successive generation is marked by sharp changes in hardware and software technologies.  With some exceptions, most of the advances introduced in one generation are carried through to later generations.  We are currently in the fifth generation.

 Technology
      

and Architecture

Vacuum tubes and relay memories CPU driven by a program counter (PC) and accumulator Machines had only fixed-point arithmetic

 Software

and Applications

Machine and assembly language Single user at a time No subroutine linkage mechanisms Programmed I/O required continuous use of CPU

 Representative

IAS, IBM 701

systems: ENIAC, Princeton

 Technology
     

and Architecture

Discrete transistors and core memories I/O processors, multiplexed memory access Floating-point arithmetic available Register Transfer Language (RTL) developed

 Software

and Applications

High-level languages (HLL): FORTRAN, COBOL, ALGOL with compilers and subroutine libraries Still mostly single user at a time, but in batch mode

 Representative

LARC, IBM 7090

systems: CDC 1604, UNIVAC

 Technology
  

and Architecture

Integrated circuits (SSI/MSI) Microprogramming Pipelining, cache memories, lookahead processing

 Software


and Applications



Multiprogramming and time-sharing operating systems Multi-user applications

 Representative

systems: IBM 360/370, CDC 6600, TI ASC, DEC PDP-8

 Technology
   

and Architecture

LSI/VLSI circuits, semiconductor memory Multiprocessors, vector supercomputers, multicomputers Shared or distributed memory Vector processors

 Software


and Applications

Multprocessor operating systems, languages, compilers, and parallel software tools

 Representative

systems: VAX 9000, Cray X-MP, IBM 3090, BBN TC2000

 Technology
  

and Architecture



ULSI/VHSIC processors, memory, and switches High-density packaging Scalable architecture Vector processors

 Software
  

and Applications

Massively parallel processing Grand challenge applications Heterogenous processing

 Representative

systems: Fujitsu VPP500, Cray MPP, TMC CM-5, Intel Paragon

Elements of a Modern Computer System

 The

hardware, software, and programming elements of modern computer systems can be characterized by looking at a variety of factors, including:
   




Computing problems Algorithms and data structures Hardware resources Operating systems System software support Compiler support

 Numerical
 

computing

complex mathematical formulations tedious integer or floating-point computation

 Transaction

 

processing

accurate transactions large database management information retrieval

 Logical
 

Reasoning

logic inferences symbolic manipulations

 Traditional

algorithms and data structures are designed for sequential machines.  New, specialized algorithms and data structures are needed to exploit the capabilities of parallel architectures.  These often require interdisciplinary interactions among theoreticians, experimentalists, and programmers.

 The

architecture of a system is shaped only partly by the hardware resources.  The operating system and applications also significantly influence the overall architecture.  Not only must the processor and memory architectures be considered, but also the architecture of the device interfaces (which often include their advanced processors).

 Operating

systems manage the allocation and deallocation of resources during user program execution.  UNIX, Mach, and OSF/1 provide support for
    

multiprocessors and multicomputers multithreaded kernel functions virtual memory management file subsystems network communication services

 An

OS plays a significant role in mapping hardware resources to algorithmic and data structures.

 Compilers,

assemblers, and loaders are traditional tools for developing programs in high-level languages. With the operating system, these tools determine the bind of resources to applications, and the effectiveness of this determines the efficiency of hardware utilization and the system’s programmability.  Most programmers still employ a sequential mind set, abetted by a lack of popular parallel software support.

 Parallel

software can be developed using entirely new languages designed specifically with parallel support as its goal, or by using extensions to existing sequential languages.  New languages have obvious advantages (like new constructs specifically for parallelism), but require additional programmer education and system software.  The most common approach is to extend an existing language.



Preprocessors


use existing sequential compilers and specialized libraries to implement parallel constructs



Precompilers


perform some program flow analysis, dependence checking, and limited parallel optimzations requires full detection of parallelism in source code, and transformation of sequential code into parallel constructs



Parallelizing Compilers




Compiler directives are often inserted into source code to aid compiler parallelizing efforts

Six layers for computer system development based on recent classification by Lionel Ni (1990)

Hardware configurations differ from machine to machine, even those of the same model.  The address space of a processor in a computer system varies among different architecture.  HLL and communication models depends on the architectural choice.  From programmers viewpoint, these two layers should be architecture-transparent.


 Hardware

configurations differ from machine to machine (even with the same Flynn classification)  Address spaces of processors vary among different architectures, and depend on memory organization, and should match target application domain.  The communication model and language environments should ideally be machine-independent, to allow porting to many computers with minimum conversion costs.  Application developers prefer architectural transparency.

 Architecture

has gone through evolutional, rather than revolutional change.  Sustaining features are those that are proven to improve performance.  Starting with the von Neumann architecture (strictly sequential), architectures have evolved to include processing lookahead, parallelism, and pipelining.

Scalar
Sequential I/E Overlap Lookahead Functional Parallelism

Multiple Pipeline Func. Units Implicit Vector Explicit Vector

Architectural
Evolution
Associative Processor

Memory-tomemory SIMD

Register-toregister MIMD

Multicomputer Processor Array Mutiprocessor



Single instruction, single data stream (SISD)


conventional sequential machines vector computers with scalar and vector hardware



Single instruction, multiple data streams (SIMD)




Multiple instructions, multiple data streams (MIMD)


parallel computers



Multiple instructions, single data stream (MISD)


systolic arrays



Among parallel machines, MIMD is most popular, followed by SIMD, and finally MISD.

 Only

SIMD and MIMD are applicable to parallel computers.



Single Instruction, Single Data (SISD)
- A single processor with a single instruction stream, operating sequentially on a single data stream. SISD machines are the traditional single-processor, sequential computers - also known as Von Neumann architecture, as opposed to “non-Von” parallel computers.



 Intrinsic

parallel computers execute in MIMD

mode.  Two classes:




Shared-memory multiprocessors Message-passing multicomputers

 Processor


communication



Shared variables in a common memory (multiprocessor) Each node in a multicomputer has a processor and a private local memory, and communicates with other processors through message passing.



Multiple Instruction, Multiple Data (MIMD)
Each processor can independently execute its own instruction stream on its own local data stream. MIMD machines are asynchronous, with more coarsegrained parallelism - they run a smaller number of parallel processes, one for each processor, operating on the large chunks of data local to each processor.




 SIMD

architecture  A single instruction is applied to a vector (one-dimensional array) of operands.  Two families:



Memory-to-memory: operands flow from memory to vector pipelines and back to memory Register-to-register: vector registers used to interface between memory and functional pipelines



Single Instruction, Multiple Data (SIMD)
A single instruction stream is broadcast to every processor, all processors execute the same instructions in lock-step on their own local data stream. SIMD machines are synchronous, with more fine-grained parallelism - they run a large number parallel processes, one for each data element in a parallel vector or array.




 Provide

synchronized vector processing  Utilize spatial parallelism instead of temporal parallelism  Achieved through an array of processing elements (PEs)  Can be implemented using associative memory.

The ideal performance of a computer system demands a perfect match between machine capability & program behavior.  Performance depends on


  


  

hardware technology architectural features efficient resource management algorithm design data structures language efficiency programmer skill compiler technology

 Turnaround
  

time depends on:




disk and memory accesses input and output compilation time operating system overhead CPU time

 Since

I/O and system overhead frequently overlaps processing by other programs, it is fair to consider only the CPU time used by a program, and the user CPU time is the most important factor.

 CPU

is driven by a clock with a constant cycle time  (usually measured in nanoseconds).  The inverse of the cycle time is the clock rate (f = 1/, measured in megahertz).  The size of a program is determined by its instruction count, Ic, the number of machine instructions to be executed by the program.  Different machine instructions require different numbers of clock cycles to execute. CPI (cycles per instruction) is thus an important parameter.

 It

is easy to determine the average number of cycles per instruction for a particular processor if we know the frequency of occurrence of each instruction type.  Of course, any estimate is valid only for a specific set of programs (which defines the instruction mix), and then only if there are sufficiently large number of instructions.  In general, the term CPI is used with respect to a particular instruction set and a given program mix.

 The

time required to execute a program containing Ic instructions is just T = Ic  CPI  .  Each instruction must be fetched from memory, decoded, then operands fetched from memory, the instruction executed, and the results stored.  The time required to access memory is called the memory cycle time, which is usually k times the processor cycle time . The value of k depends on the memory technology and the processor-memory interconnection scheme.

 The
 

processor cycles required for each instruction (CPI) can be attributed to cycles needed for instruction decode and execution (p), and cycles needed for memory references (m  k).

 The

total time needed to execute a program can then be rewritten as T = Ic  (p + m  k) .

 The
 

five performance factors (Ic , p, m, k, ) are influenced by four system attributes: instruction-set architecture (affects Ic and p) compiler technology (affects Ic and p and m) CPU implementation and control (affects p  ) cache and memory hierarchy (affects memory access latency, k  )




 Total

CPU time can be used as a basis in estimating the execution rate of a processor.

 If

C is the total number of clock cycles needed to execute a given program, then total CPU time can be estimated as T = C   = C / f.  Other relationships are easily observed:
  

CPI = C / Ic T =Ic  CPI   T =Ic  CPI / f

 Processor

speed is often measured in terms of millions of instructions per second, frequently called the MIPS rate of the processor.

Ic f  Ic f MIPS rate    6 6 T 10 CPI 10 C 10
 The

MIPS rate is directly proportional to the clock rate and inversely proportion to the CPI.  All four system attributes (instruction set, compiler, processor, and memory technologies) affect the MIPS rate, which varies also from program to program.

 The

number of programs a system can execute per unit time, Ws , in programs per second.  CPU throughput, Wp, is defined as

f Wp  I c  CPI
In a multiprogrammed system, the system throughput is often less than the CPU throughput.

Machine VAX 11/780 IBM RS/6000
 The

Clock 5 MHz 25 MHz

Performance CPU Time 1 MIPS 12x seconds 18 MIPS x seconds

instruction count on the RS/6000 is 1.5 times that of the code on the VAX.  Average CPI on the VAX is assumed to be 5.  Average CPI on the RS/6000 is assumed to 1.39.  VAX has typical CISC architecture.  RS/6000 has typical RISC architecture.

 Programmability

depends on the programming environment provided to the users.  Conventional computers are used in a sequential programming environment with tools developed for a uniprocessor computer.  Parallel computers need parallel tools that allow specification or easy detection of parallelism and operating systems that can perform parallel scheduling of concurrent events, shared memory allocation, and shared peripheral and communication links.

 Use

a conventional language (like C, Fortran, Lisp, or Pascal) to write the program.  Use a parallelizing compiler to translate the source code into parallel code.  The compiler must detect parallelism and assign target machine resources.  Success relies heavily on the quality of the compiler.  Kuck (U. of Illinois) and Kennedy (Rice U.) used this approach.

 Programmer

write explicit parallel code using parallel dialects of common languages.  Compiler has reduced need to detect parallelism, but must still preserve existing parallelism and assign target machine resources.  Seitz (Cal Tech) and Daly (MIT) used this approach.

 Parallel

extensions of conventional high-level languages.  Integrated environments to provide



 



different levels of program abstraction validation, testing and debugging performance prediction and monitoring visualization support to aid program development, performance measurement graphics display and animation of computational results

 Two
 

Categories

Multiprocessors Multicomputers



Distinguished by having a shared memory or unshared distributed memories.

 Three
 

shared memory multiprocessors models
Uniform memory access (UMA) Nonuniform memory access (NUMA) Cache-only memory architecture ( COMA)



  





The physical memory is uniformly shared by all the processors. All processors have equal access time to all memory words Suitable for general purpose and time-sharing applications Synchronization is done through shared variables in the common memory Two types
Symmetric  Asymmetric


Access time varies with the location of the memory word  Shared memory is distributed to all processors


Similar Documents

Popular Essays