Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

ACA Notes for new students, Lecture notes of Geology

Rajiv Gandhi Proudyogiki Vishwavidyalaya Geology

ACA helpful notes for self study the fundamentals

Typology: Lecture notes

2018/2019

Uploaded on 05/07/2019

jayu-gehlod 🇮🇳

1 document

1 / 42

Partial preview of the text

Download ACA Notes for new students and more Lecture notes Geology in PDF only on Docsity! Chameli Devi Group of Ins�tu�ons Department of Computer Science and Engineering Subject Notes Subject Code: CS-6001 Subject Name: Advanced Computer Architecture UNIT-1 Flynn's Classifica�on Flynn’s classifica�on dis�nguishes mul�-processor computer architectures according to two independent dimensions of Instruc�on stream and Data stream. An instruc�on stream is sequence of instruc�ons executed by machine. And a data stream is a sequence of data including input, par�al or temporary results used by instruc�on stream. Each of these dimensions can have only one of two possible states: Single or Mul�ple. Flynn’s classifica�on depends on the dis�nc�on between the performance of control unit and the data processing unit rather than its opera�onal and structural interconnec�ons. Following are the four category of Flynn classifica�on and characteris�c feature of each of them. a) Single Instruc�on Stream, Single Data Stream (SISD) The figure 1.1 is represents an organiza�on of simple SISD computer having one control unit, one processor unit and single memory unit. Instruc�on Data Stream Stream Figure 1.1: SISD processor organiza�ons • They are also called scalar processor i.e., one instruc�on at a �me and each instruc�on have only one set of operands. • Single instruc�on: only one instruc�on stream is being acted on by the CPU during any one clock cycle. • Single data: only one data stream is being used as input during any one clock cycle. b) Single Instruc�on Stream, Mul�ple Data Stream (SIMD) processors • A type of parallel computer. • Single instruc�on: All processing units execute the same instruc�on issued by the control unit at any given clock cycle as shown in figure where there is mul�ple processors execu�ng instruc�on given by one control unit. • Mul�ple data: Each processing unit can operate on a different data element as shown if figure below the processor are connected to shared memory or interconnec�on network providing mul�ple data to processing unit. Figure 1.2: SIMD processor organiza�ons C) Mul�ple Instruc�on Stream, Single Data Stream (MISD) • A single data stream is feed into mul�ple processing units. • Each processing unit operates on the data independently via independent instruc�on streams as shown in figure 1.3 a single data stream is forwarded to different processing unit which are connected to different control unit and execute instruc�on given to it by control unit to which it is a�ached. Figure 1.3: MISD processor organiza�ons • Thus in these computers same data flow through a linear array of processors execu�ng different instruc�on streams. d) Mul�ple Instruc�on Stream, Mul�ple Data Stream (MIMD) • Mul�ple Instruc�ons: Every Processor may be execu�ng a different instruc�on stream. • Mul�ple Data: every processor may be working with a different data stream as shown in figure 1.4 mul�ple data stream is provided by shared memory. • Can be categorized as loosely coupled or �ghtly coupled depending on sharing of data and control. • Execu�on can be synchronous or asynchronous, determinis�c or non- determinis�c. • Examples: most current supercomputers, networked parallel computer "grids" and mul�-processor SMP computers - including some types of PCs. • Memory access across link is slower If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA Figure 1.6: Shared Memory (NUMA) The COMA model (Cache only Memory Access): The COMA model is a special case of NUMA machine in which the distributed main memories are converted to caches. All caches form a global address space and there is no memory hierarchy at each processor node. Advantages: • Global address space provides a user-friendly programming perspec�ve to memory • Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs Disadvantages: • Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases traffic on the shared memory-CPU path, and for cache coherent systems, geometrically increase traffic associated with cache/memory management. • Programmer responsibility for synchroniza�on constructs that insure "correct" access of global memory. • Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors. Distributed Memory Like shared memory systems, distributed memory systems vary widely but share a common characteris�c. Distributed memory systems require a communica�on network to connect inter-processor memory. Figure 1.7: Distributed Memory Systems • Processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors. • Because each processor has its own local memory, it operates independently. Advantages: • Memory is scalable with number of processors. Increase the number of processors and the size of memory increases propor�onately. • Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. • Cost effec�veness: can use commodity, off-the-shelf processors and networking. Disadvantages: • The programmer is responsible for many of the details associated with data communica�on between processors. • It may be difficult to map exis�ng data structures, based on global memory, to this memory organiza�on. Mul�-vector and SIMD Computers A vector operand contains an ordered set of n elements, where n is called the length of the vector. Each element in a vector is a scalar quan�ty, which may be a floa�ng point number, an integer, a logical value or a character. A vector processor consists of a scalar processor and a vector unit, which could be thought of as an independent func�onal unit capable of efficient vector opera�ons. Vector Supercomputer Vector computers have hardware to perform the vector opera�ons efficiently. Operands cannot be used directly from memory but rather are loaded into registers and are put back in registers a�er the opera�on. Vector hardware has the special ability to overlap or pipeline operand processing. Figure 1.8: Architecture of Vector Supercomputer SIMD Computer The Synchronous parallel architectures coordinate Concurrent opera�ons in lockstep through global clocks, central control units, or vector unit controllers. A synchronous array of parallel processors is called an array processor. These processors are composed of N iden�cal processing elements (PES) under the supervision of a one control unit (CU) This Control unit is a computer with high speed registers, local memory and arithme�c logic unit.. An array processor is basically a single instruc�on and mul�ple data (SIMD) computers. There are N data streams; one per processor, so different data can be used in each processor. The figure below show a typical SIMD or array processor Figure 1.9: Configura�ons of SIMD Computers These processors consist of a number of memory modules which can be either global or dedicated to each processor. Thus the main memory is the aggregate of the memory modules. These Processing elements and memory unit communicate with each other through an interconnec�on network. SIMD processors are especially designed for performing vector computa�ons. SIMD has two basic architectural organiza�ons a. Array processor using random access memory b. Associa�ve processors using content addressable memory. Data and Resource Dependence Data dependence: The ordering rela�onship between statements is indicated by the data dependence. Five type of data dependence are defined below: 1. Flow dependence: A statement S2 is flow dependent on S1 if an execu�on path exists from s1 to S2 and if at least one output (variables assigned) of S1feeds in as input (operands to be used) to S2 also called RAW hazard and denoted as S1 -> S2 2. An�-dependence: Statement S2 is an�-dependent on the statement S1 if S2 follows S1 in the program order and if the output of S2 overlaps the input to S1 also called RAW hazard and denoted as S1 |->S2 3. Output dependence: two statements are output dependent if they produce (write) the same output variable. Also called WAW hazard and denoted as S1 0->S2 4. I/O dependence: Read and write are I/O statements. I/O dependence occurs not because the same variable is involved but because the same file referenced by both I/O statement. 5. Unknown dependence: The dependence rela�on between two statements cannot be determined in the Consider the following fragment of any program: S1 Load R1, A S2 Add R2, R1 S3 Move R1, R3 S4 Store B, R1 • Here the Forward dependency S1to S2, S3 to S4, S2 to S2 • An�-dependency from S2to S3 • Output dependency S1 toS3 It corresponds to execu�on of essen�ally independent jobs or programs on a parallel computer. This is prac�cal for a machine with a small number of powerful processors, but imprac�cal for a machine with a large number of simple processors (since each processor would take too long to process a single job). Communica�on Latency Balancing granularity and latency can yield be�er performance. Various latencies a�ributed to machine architecture, technology, and communica�on pa�erns used. Latency imposes a limi�ng factor on machine scalability. Ex. Memory latency increases as memory capacity increases, limi�ng the amount of memory that can be used with a given tolerance for communica�on latency. Inter-processor Communica�on Latency • Needs to be minimized by system designer • Affected by signal delays and communica�on pa�erns Ex. n communica�ng tasks may require n (n - 1)/ 2 communica�on links, and the complexity grows quadra�cally, effec�vely limi�ng the number of processors in the system. Communica�on Pa�erns • Determined by algorithms used and architectural support provided • Pa�erns include permuta�ons broadcast mul�cast conference • Tradeoffs o�en exist between granularity of parallelism and communica�on demand. Program Graphs and Packing A program graph is similar to a dependence graph Nodes = { (n,s) }, where n = node name, s = size (larger s = larger grain size). Edges = {(v, d)}, where v = variable being “communicated,” and d = communica�on delay. Packing two (or more) nodes produces a node with a larger grain size and possibly more edges to other nodes. Packing is done to eliminate unnecessary communica�on delays or reduce overall scheduling overhead. Scheduling A schedule is a mapping of nodes to processors and start �mes such that communica�on delay requirements are observed, and no two nodes are execu�ng on the same processor at the same �me. Some general scheduling goals: • Schedule all fine-grain ac�vi�es in a node to the same processor to minimize communica�on delays. • Select grain sizes for packing to achieve be�er schedules for a par�cular parallel machine. • Node Duplica�on Grain packing may poten�ally eliminate interprocessor communica�on, but it may not always produce a shorter schedule. By duplica�ng nodes (that is, execu�ng some instruc�ons on mul�ple processors), we may eliminate some interprocessor communica�on, and thus produce a shorter schedule. Program flow mechanism Conven�onal machines used control flow mechanism in which order of program execu�on explicitly stated in user programs. Dataflow machines which instruc�ons can be executed by determining operand availability. Data Flow Features No need for shared memory program counter control sequencer Special mechanisms are required to detect data availability match data tokens with instruc�ons needing them enable chain reac�on of asynchronous instruc�on execu�on A Dataflow Architecture – 1 The Arvind machine (MIT) has N PEs and an N -by –N interconnec�on network. Each PE has a token-matching mechanism that dispatches only instruc�ons with data tokens available. Each datum is tagged with • address of instruc�on to which it belongs • context in which the instruc�on is being executed Tagged tokens enter PE through local path (pipelined), and can also be communicated to other PEs through the rou�ng network. Instruc�on addresses effec�vely replace the program counter in a control flow machine. Context iden�fier effec�vely replaces the frame base register in a control flow machine. Since the dataflow machine matches the data tags from one instruc�on with successors, synchronized instruc�on execu�on is implicit. Demand-Driven Mechanisms Data-driven machines select instruc�ons for execu�on based on the availability of their operands; this is essen�ally a bo�om-up approach. Demand-driven machines take a top-down approach, a�emp�ng to execute the instruc�on (a demander) that yields the final result. This triggers the execu�on of instruc�ons that yield its operands, and so forth. The demand-driven approach matches naturally with func�onal programming languages (e.g. LISP and SCHEME). Pa�ern driven computers: An instruc�on is executed when we obtain a par�cular data pa�erns as output. There are two types of pa�ern driven computers String-reduc�on model: each demander gets a separate copy of the expression string to evaluate each reduc�on step has an operator and embedded reference to demand the corresponding operands each operator is suspended while arguments are evaluated Graph-reduc�on model: expression graph reduced by evalua�on of branches or sub-graphs, possibly in parallel, with demanders given pointers to results of reduc�ons. Based on sharing of pointers to arguments; traversal and reversal of pointers con�nues un�l constant arguments are encountered. System interconnect architecture Various types of interconnec�on networks have been suggested for SIMD computers. These are basically classified have been classified on network topologies into two categories namely • Sta�c Networks • Dynamic Networks Sta�c versus Dynamic Networks The topological structure of an SIMD array processor is mainly characterized by the data rou�ng network used in interconnec�ng the processing elements. The topological structure of an SIMD array processor is mainly characterized by the data rou�ng network used in the interconnec�ng the processing elements. Network proper�es and rou�ng The goals of an interconnec�on network are to provide low-latency high data transfer rate wide communica�on bandwidth. Analysis includes latency bisec�on bandwidth data- rou�ng func�ons scalability of parallel architecture These Network usually represented by a graph with a finite number of nodes linked by directed or undirected edges. Number of nodes in graph = network size. Number of edges (links or channels) incident on a node = node degree d (also note in and out degrees when edges are directed).Node degree reflects number of I/O ports associated with a node, and should ideally be small and constant. Network is symmetric if the topology is the same looking from any node; these are easier to implement or to program. Diameter: The maximum distance between any two processors in the network or in other words we can say Diameter, is the maximum number of (rou�ng) processors through which a message must pass on its way from source to reach des�na�on. Thus diameter measures the maximum delay for transmi�ng a message from one processor to another as it determines communica�on �me hence smaller the diameter be�er will be the network topology. Connec�vity: How many paths are possible between any two processors i.e., the mul�plicity of paths between two processors. Higher connec�vity is desirable as it minimizes conten�on. Arch connec�vity of the network: the minimum number of arcs that must be removed for the network to break it into two disconnected networks. The arch connec�vity of various network are as follows • 1 for linear arrays and binary trees • 2 for rings and 2-d meshes • 4 for 2-d torus • d for d-dimensional hypercube Larger the arch connec�vity lesser the conjunc�ons and be�er will be network topology. Channel width :The channel width is the number of bits that can communicated simultaneously by a interconnec�on bus connec�ng two processors: Bisec�on Width and Bandwidth: In order divide the network into equal halves we require the remove some communica�on links. The minimum numbers of such communica�on links that have to be removed are called the Bisec�on Width. Bisec�on width basically provide us the informa�on about the largest number of messages which can be sent simultaneously (without needing to use the same wire or rou�ng processor at the same �me and so delaying one another), no ma�er which processors are sending to which other processors. Thus larger the bisec�on width is the be�er the network topology is considered. Bisec�on Bandwidth is the minimum volume of communica�on allowed between two halves of the network with equal numbers of processors. Data Rou�ng Func�ons: A data rou�ng network is used for inter –PE data exchange. It can be sta�c as in case of hypercube rou�ng network or dynamic such as mul�stage network. Various type of data rou�ng func�ons are Shi�ing, Rota�ng, Permuta�on (one to one), Broadcast (one to all), Mul�cast (many to many), Personalized broadcast (one to many), Shuffle, Exchange Etc. Factors Affec�ng Performance Func�onality – howthe network supports data rou�ng, interrupt handling, synchroniza�on, request/message combining, and coherence. Network latency – worst-case �me for a unit message to be transferred Bandwidth – maximum data rate. Hardware complexity – implementa�on costs for wire, logic, switches, connectors, etc. Scalability – how easily does the scheme adapt to an increasing number of processors, memories, etc. Crossbar switches A crossbar switch is a circuit that enables many interconnec�ons between elements of a parallel system at a �me. A crossbar switch has a number of input and output data pins and a number of control pins. In response to control instruc�ons set to its control input, the crossbar switch implements a stable connec�on of a determined input with a determined output. The diagrams of a typical crossbar switch are shown in the figure below. Figure 1.22: Crossbar switch Figure 1.23: Crossbar switch a) general scheme, b) internal structure Control instruc�ons can request reading the state of specified input and output pins i.e. their current connec�ons in a crossbar switch. Crossbar switches are built with the use of mul�plexer circuits, controlled by latch registers, which are set by control instruc�ons. Mul�port Memory In the mul�port memory system, different memory module and CPUs have separate buses. The module has internal control logic to determine port which will access to memory at any given �me. Priori�es are assigned to each memory port to resolve memory access conflicts. Advantages: Because of the mul�ple paths high transfer rate can be achieved. Disadvantages: It requires expensive memory control logic and a large number of cables and connec�ons. Figure 1.24: Mul�port memory organiza�on Mul�stage and combining networks Mul�stage connec�on networks are designed with the use of small elementary crossbar switches (usually they have two inputs) connected in mul�ple layers. The elementary crossbar switches can implement 4 types of connec�ons: straight, crossed upper broadcast and lower broadcast. All elementary switches are controlled simultaneously. The network like this is an alterna�ve for crossbar switches if we have to switch a large number of connec�ons, over 100. The extension cost for such a network is rela�vely low. In such networks, there is no full freedom in implemen�ng arbitrary connec�ons when some connec�ons have already been set in the switch. Because of this property, these networks belong to the category of so called blocking networks. Figure 1.25: A mul�stage connec�on network for parallel systems To obtain nonblocking proper�es of the mul�stage connec�on network, the redundancy level in the circuit should be much increased. To build a nonblocking mul�stage network n x n, the elementary two-input switches have to be replaced by 3 layers of switches n x m, r x r and m x n, where m ³ 2n - 1 and r is the number of elementary switches in the layer 1 and 3. Such a switch was designed by a French mathema�cian Clos and it is called the Clos network. This switch is commonly used to build large integrated crossbar switches. UNIT-2 Instruc�on Set Architectures The instruc�on set, also called instruc�on set architecture (ISA), is part of a computer that pertains to programming, which is basically machine language. The instruc�on set provides commands to the processor, to tell it what it needs to do. The instruc�on set consists of addressing modes, instruc�ons, na�ve data types, registers, memory architecture, interrupt, and excep�on handling, and external I/O. Examples of instruc�on set • ADD - Add two numbers together. • COMPARE - Compare numbers. • IN - Input informa�on from a device, e.g., keyboard. • JUMP - Jump to designated RAM address. • LOAD - Load informa�on from RAM to the CPU. • OUT - Output informa�on to device, e.g., monitor. • STORE - Store informa�on to RAM. Computers are classified on the basis on instruc�on set they have as: CISC Scalar Processors CISC (Complex Instruc�on Set Computer): CISC based computer will have shorter programs which are made up of symbolic machine language. A Complex Instruc�on Set Computer (CISC) supplies a large number of complex instruc�ons at the assembly language level. During the early years, memory was slow and expensive and the programming was done in assembly language. Since memory was slow and instruc�ons could be retrieved up to 10 �mes faster from a local ROM than from main memory, programmers tried to put as many instruc�ons as possible in a microcode. Figure 2.1: (a) CSIC Architecture (b) RSIC Architecture RISC Scalar Processors RISC (Reduced Instruc�on Set Computer): RISC is a type of microprocessor that has a rela�vely limited number of instruc�ons. It is designed to perform a smaller number of types of computer instruc�ons so that it can operate at a higher speed (perform more million instruc�ons per second, or millions of instruc�ons per second). Earlier, computers used only 20% of the instruc�ons, making the other 80% unnecessary. One advantage of reduced instruc�on set computers is that they can execute their instruc�ons very fast because the instruc�ons are so simple. Advantages: • Speed: Since a simplified instruc�on set allows for a pipelined, superscalar design RISC processors o�en achieve 2 to 4 �mes the performance of CISC processor using comparable semiconductor technology and the same clock rates. • Simpler Hardware : Because the instruc�on set of a RISC processor is so simple, it uses up much less chip space; extra func�ons, such as memory management units or floa�ng point arithme�c units, can also be placed on the same chip. Smaller chips allow a semiconductor manufacturer to place more parts on a single silicon wafer, which can lower the per-chip cost drama�cally. Difference between CISC and RISC VLIW Architecture Very long instruc�on word (VLIW) describes a computer processing architecture in which a language compiler or pre-processor breaks program instruc�on down into basic opera�ons that can be performed by the processor in parallel (that is, at the same �me). VLIW is some�mes viewed as the next step beyond the reduced instruc�on set compu�ng (RISC) architecture, which also works with a limited set of rela�vely basic instruc�ons and can usually execute more than one instruc�on at a �me (a characteris�c referred to as superscalar ). The main advantage of VLIW processors is that complexity is moved from the hardware to the so�ware, which means that the hardware can be smaller, cheaper, and require less power to operate. Figure 2.2: A VLIW processor architecture and instruc�on format Figure 2.3: Pipeline execu�on Pipelining in VLIW Processors Decoding of instruc�ons is easier in VLIW than in superscalars, because each “region” of an instruc�on word is usually limited as to the type of instruc�on it can contain. VLIW Opportuni�es “Random” parallelism among scalar opera�ons is exploited in VLIW, instead of regular parallelism in a vector or SIMD machine. The efficiency of the machine is en�rely dictated by the success, or “goodness,” of the compiler in planning the opera�ons to be placed in the same instruc�on words. Different implementa�ons of the same VLIW architecture may not be binary-compa�ble with each other, resul�ng in different latencies. Memory Hierarchy of data items at successive memory levels be consistent is called the “coherence property.” Coherence Strategies Write-through As soon as a data item in M i is modified, immediate update of the corresponding data item(s) in M i+1, Mi+2 … Mn is required. This is the most aggressive (and expensive) strategy. Write-back The update of the data item in M i+1 corresponding to a modified item in Mi is not updated unit it (or the block/page/etc. in M i that contains it) is replaced or removed. This is the most efficient approach, but cannot be used (without modifica�on) when mul�ple processors share Mi+1, …, Mn. Memory Capacity Planning: The performance of a memory hierarchy is determined by the effec�ve access �me (Teff) to any level in the hierarchy. It depends on the hit ra�o and access frequencies at successive levels. Hit Ra�o (h): is a concept defined for any two adjacent levels of a memory hierarchy. When an informa�on item found in Mi, it is a hit, otherwise, a miss. The hit ra�o (hi) at Mi is the probability that an informa�on item will be found in Mi. the miss ra�o at Mi is defined as 1-hi. The access frequency to Mi is defined as fi= (1-h1)(1-h2)….(1-hi) Effec�ve Access Time (Teff): In prac�ce, we wish to achieve as high a hit ra�o as possible at M1. Every �me a miss occurs, a penalty must be paid to access the next higher level of memory. The Teff of a memory hierarchy is given by: n T eff. =∑ f i .t i i=1 =h1t1+ (1-h1) h2t2+ (1-h1) (1-h2) h3t3+………+ (1-h1) (1-h2)…… (1-hn-1) tn Hierarchy Op�miza�on: The total cost of a memory hierarchy is es�mated as: n C total=∑ C i .S i i=1 Interleaved memory organiza�on- memory interleaving It is a technique for compensa�ng the rela�vely slow speed of DRAM (Dynamic RAM). In this technique, the main memory is divided into memory banks which can be accessed individually without any dependency on the other. High-Order Interleaving Arguably the most “natural” arrangement would be to use bus lines A26-A27 as the module determiner. In other words, we would feed these two lines into a 2-to-4 decoder, the outputs of which would be connected to the Chip Select pins of the four memory modules. If we were to do this, the physical placement of our system addresses would be as follows: Low-Order Interleaving An alterna�ve would be to use the low bits for that purpose. In our example here, for instance, this would entail feeding bus lines A0-A1 into the decoder, with bus lines A2-A27 being �ed to the address pins of the memory modules. This would mean the following storage pa�ern: In other words, consecu�ve addresses are stored in consecu�ve modules, with the understanding that this is mod 4, i.e. we wrap back to M0 a�er M3. Bandwidth The memory bandwidth (B) of an m-way interleaved memory is lower-bounded by 1 and upper-bounded by m. The approxima�on of B by Hellerman is: B = m0.56 ~ In this equa�on m denotes the number of interleaved memory modules. This equa�on indicated that the efficient memory bandwidth is approximately two �mes that of single module when four memory modules are used. Fault Tolerance To achieve various interleaved memory organiza�ons, low order and high order interleaving are combined. In each memory module, sequen�al addresses are allocated in high order interleaved memory. This makes it simple to isolate faulty memory modules in a memory bank of m memory modules. If one module failure is detected the remaining modules can s�ll be used by opening the window in the address space. This fault isola�on cannot be performed in low order interleaved memory, where a module failure may paralyze the complete memory bank. Hence, low order interleaved memory is not fault tolerant. Backplane Buses A backplane bus interconnects processors, data storage and peripheral devices in a �ghtly coupled hardware. The system bus must be designed to allow communica�on between devices on the devices on the bus without disturbing the internal ac�vi�es of all the devices a�ached to the bus. These are typically `intermediate' buses, used to connect a variety of other buses to the CPU-Memory bus. They are called Backplane Buses because they are restricted to the backplane of the system. Backplane bus specifica�on They are generally connected to the CPU-Memory bus by a bus adaptor, which handles transla�on between the buses. Commonly, this is integrated into the CPU-Memory bus controller logic. While these buses can be used to directly control devices, they are used as 'bridges` to other buses. For example, AGP bus devices – i.e. video cards – act as bridges between the CPU-Memory bus and the actual display device: the monitor. • Allow processors, memory and I/O devices to coexist on single bus. • Balance demands of processor-memory communica�on with demands of I/O device-memory Communica�on • Interconnects the circuit boards containing processor, memory and I/O interfaces an interconnec�on Structure within the chassis. • Data address and control lines form the data transfer bus (DTB) in VME bus. • DTB Arbitra�on bus that provide control of DTB to requester using the arbitra�on logic. • Interrupt and Synchroniza�on bus used for handling interrupt. Figure 2.6: VME bus System The backplane bus is made of signal lines and connectors. A special bus controller board is used to house the backplane control logic, such as the system clock driver, arbiter, bus �mer and power driver. Func�onal module: A func�onal module is collec�on of electronic circuitry that resides on one func�onal board and works to achieve special bus control func�on. Asynchronous Data Transfer All the opera�ons in a digital system are synchronized by a clock that is generated by a pulse generator. The CPU and I/O interface can be designed independently or they can share common bus. If CPU and I/O interface share a common bus, the transfer of data between two units is said to synchronous Strobe Control In strobe control, a control signal, called strobe pulse, which is supplied from one unit to other, indicates that data transfer has to take place. Thus, for each data transfer, a strobe is ac�vated either by source or des�na�on unit Handshaking The handshaking technique has one more control signal for acknowledgement that is used for in�ma�on. As in strobe control, in this technique also, one control line is in the same direc�on as data flow, telling about the validity of data. Other control line is in reverse direc�on telling whether des�na�on has accepted the data. Advantage of asynchronous bus transac�on • It is not clocked. • It can accommodate a wide range of devices. Arbitra�on transac�on (Bus Arbitra�on) Since at a unit �me only one device can transmit over the bus, hence one important issue is to decide who should access the bus. Bus arbitra�on is the process of determining the bus master who has the bus control at a given �me when there is a request for bus from one or more devices. Fig. 3.1 Basic Structure of pipeline Types of pipeline: Pipelines can actually be divided into two classes: a) Sta�c or Linear Pipelines: These pipelines can perform one opera�on (Addi�on or Mul�plica�on) at a �me. The opera�on of a sta�c pipeline can only be changed a�er the pipeline has been drained. (A pipeline is said to be drained when the last input data leave the pipeline.) For example, consider a sta�c pipeline that is able to perform addi�on and mul�plica�on. Each �me that the pipeline switches from a mul�plica�on opera�on to an addi�on opera�on, it must be drained and set for the new opera�on. b) Dynamic or Non Linear Pipelines processor: A dynamic pipeline can perform more than one opera�on at a �me. To perform a par�cular opera�on on an input data, the data must go through a certain sequence of stages. For example, Figure 3.2 shows a three-stage dynamic pipeline that performs addi�on and mul�plica�on on different data at the same �me. Mechanism for Instruc�on pipeline: In Von Neumann architecture, the process of execu�ng an instruc�on involves several steps. First, the control unit of a processor fetches the instruc�on from the cache (or from memory). Then the control unit decodes the instruc�on to determine the type of opera�on to be performed. When the opera�on requires operands, the control unit also determines the address of each operand and fetches them from cache (or memory). Next, the opera�on is performed on the operands and, finally, the result is stored in the specified loca�on. An instruc�on pipeline increases the performance of a processor by overlapping the processing of several different instruc�ons. As shown in Figure 3.3, an instruc�on pipeline o�en consists of five stages, as follows: • Instruc�on fetch (IF): Retrieval of instruc�ons from cache (or main memory). • Instruc�on decoding (ID): Iden�fica�on of the opera�on to be performed. • Operand fetch (OF): Decoding and retrieval of any required operands. • Execu�on (EX): Performing the opera�on on the operands. • Write-back (WB): Upda�ng the des�na�on operands. Fig. 3.2 Stage of an instruc�on pipeline An instruc�on pipeline overlaps the process of the preceding stages for different instruc�ons to achieve a much lower total comple�on �me, on average, for a series of instruc�ons. Improving the Throughput of Instruc�on Pipeline: Three sources of architectural problems may affect the throughput of an instruc�on pipeline. They are fetching, bo�leneck, and issuing problems. Some solu�ons are given for each. Fetching problem- In general, supplying instruc�ons rapidly through a pipeline is costly in terms of chip area. Buffering the data to be sent to the pipeline is one simple way of improving the overall u�liza�on of a pipeline. The u�liza�on of a pipeline is defined as the percentage of �me that the stages of the pipeline are used over a sufficiently long period of �me. A pipeline is u�lized 100% of the �me when every stage is used (u�lized) during each clock cycle. The bo�leneck Problem The bo�leneck problem relates to the amount of load (work) assigned to a stage in the pipeline. If too much work is applied to one stage, the �me taken to complete an opera�on at that stage can become unacceptably long. This rela�vely long �me spent by the instruc�on at one stage will inevitably create a bo�leneck in the pipeline system. In such a system, it is be�er to remove the bo�leneck that is the source of conges�on. One solu�on to this problem is to further subdivide the stage. Another solu�on is to build mul�ple copies of this stage into the pipeline. Pipelining hazards: If an instruc�on is available, but cannot be executed for some reason, a hazard exists for that instruc�on. These hazards create issuing problems; they prevent issuing an instruc�on for execu�on. Three types of hazard are discussed here. They are called structural hazard, data hazard, and control hazard. A structural hazard refers to a situa�on in which a required resource is not available (or is busy) for execu�ng an instruc�on. A data hazard refers to a situa�on in which there exists a data dependency (operand conflict) with a prior instruc�on. A control hazard refers to a situa�on in which an instruc�on, such as branch, causes a change in the program flow. Each of these hazards is explained next. Structural Hazards: A structural hazard occurs as a result of resource conflicts between instruc�ons. One type of structural hazard that may occur is due to the design of execu�on units. If an execu�on unit that requires more than one clock cycle (such as mul�ply) is not fully pipelined or is not replicated, then a sequence of instruc�ons that uses the unit cannot be subsequently (one per clock cycle) issued for execu�on. A replica�ng and/or pipelining execu�on unit increases the number of instruc�ons that can be issued simultaneously. Data Hazards: In a non-pipelined processor, the instruc�ons are executed one by one, and the execu�on of an instruc�on is completed before the next instruc�on is started. In this way, the instruc�ons are executed in the same order as the program. However, this may not be true in a pipelined processor, where instruc�on execu�ons are overlapped. An instruc�on may be started and completed before the previous instruc�on is completed. The data hazard, which is also referred to as the data dependency problem, comes about as a result of overlapping (or changing the order of) the execu�on of data-dependent instruc�ons. For example, in Figure 3.5 instruc�on i2 has a data dependency on i1 because it uses the result of i1 (i.e., the contents of register R2) as input data. If the instruc�ons were sent to a pipeline in the normal manner, i2 would be in the OF stage before i1 passed through the WB stage. This would result in using the old contents of R2 for compu�ng a new value for R5, leading to an invalid result. To have a valid result, i2 must not enter the OF stage un�l i1 has passed through the WB stage. In this way, as is shown in Figure 3.6, the execu�on of i2 will be delayed for two clock cycles. In other words, instruc�on i2 is said to be stalled for two clock cycles. O�en, when an instruc�on is stalled, the instruc�ons that are posi�oned a�er the stalled instruc�on will also be stalled. However, the instruc�ons before the stalled instruc�on can con�nue execu�on. The delaying of execu�on can be accomplished in two ways. One way is to delay the OF or IF stages of i2 for two clock cycles. There are three primary types of data hazards named as • RAW(Read a�er Write) • WAR(Write a�er Read) • WAW(Write a�er Write) RAW: This type of data hazard was discussed previously; it refers to the situa�on in which i2 reads a data source before i1 writes to it. This may produce an invalid result since the read must be performed a�er the write in order to obtain a valid result. For example, in the sequence i1: Add R2, R3, R4 --R2=R3+R4 i2: Add R5, R2, R1 --R5=R2+R1 an invalid result may be produced if i2 reads R2 before i1 writes to it. WAR: This refers to the situa�on in which i2 writes to a loca�on before i1 reads it. For example, in the sequence i1: Add R2, R3, R4 --R2=R3+R4 i2: Add R4, R5, R6 --R4=R5+R6 an invalid result may be produced if i2 writes to R4 before i1 reads it; that is, the instruc�on i1 might use the wrong value of R4. WAW: This refers to the situa�on in which i2 writes to a loca�on before i1 writes to it. For example, in the sequence i1: Add R2, R3, R4 --R2=R3+R4 i2: Add R2, R5, R6 --R2=R5+R6 the value of R2 is recomputed by i2. If the order of execu�on were reversed, that is, i2 writes to R2. Tomasulo’s Algorithm The Tomasulo algorithm was first implemented in the IBM 360/91 Floa�ng Point Unit which came out three years a�er the CDC 6600. This scheme was intedned to address several issues: • A small number of floa�ng point registers available the 360/91 had 4 double precision registers. • Long memory latency this was just prior to the introduc�on of caches as a standard part of the memory hierarchy. • The cost effec�veness of func�onal unit hardware with mul�ple copies of the same func�onal unit, some units were o�en underu�lized. • The performance penalties of name dependencies. These lead to WAW and WAR hazards. FPU consist of: • Instruc�on buffer • Load and Store buffer : entries in this buffer consist of a) Busy bit: indica�ng the buffer element contains an outstanding load or store opera�on. b) Tag: indica�ng the des�na�on (or source for store) of the data for the opera�on. c) Address (not shown) provided by the integer unit d) Data. • FP Register File : entries in this register consist of a) Valid Bit: indica�ng the register contains the current value of the register. b) Tag: indica�ng the current source of the register value if not present. c) Value: the register value, if present. • FP func�onal unit with associated reserva�on status : Fig 3.3 Floa�ng Point Adder In the first stage, the man�ssas M1 and M2 are aligned based on the difference in the exponents E1 and E2. If | E1 - E2 | = k > 0, then the man�ssa with the smaller exponent is right shi�ed by k digit posi�ons. In the second stage, the man�ssas are added (or subtracted). In the third stage, the result is normalized so that the final man�ssa has a nonzero digit a�er the frac�on point. When necessary, this normalized adjustment is done by shi�ing the result man�ssa and the exponent. Mul�func�onal arithme�c pipeline: In arithme�c pipeline is similar to an assembly line in a factory. Data enters a stage of pipeline, which performs some arithme�c opera�on on data. The results are then passed to the next stage, which performs its opera�on and so on un�l the final computa�on has been performed. • Each stage performs only its specific func�on, it does not have to be capable of performing the task of any other stage. An individual stage might be an adder or mul�plier or other hardware to perform some arithme�c func�on. • Varia�ons on arithme�c pipeline are fixed arithme�c pipeline. • It is not very useful. Unless the exact func�on performed by the pipeline is required, the CPU cannot use the fixed arithme�c pipeline. Configurable arithme�c pipeline. • It is be�er suitable as it uses mul�plexer as its input. The control unit of CPU sets the select signals of the mul�plexer to control flow of data (i.e. pipeline is configurable). Vectored arithme�c unit. • A CPU may include a vectored arithme�c unit. A vectored arithme�c unit contains mul�ple func�onal units to perform addi�on, mul�plica�on, shi�ing, division etc) to operate different arithme�c opera�ons in parallel. • It is used to implement floa�ng point opera�ons, mul�plica�on of fixed point numbers and similar computa�ons encountered in scien�fic opera�ons. Unit-IV Cache Coherence In a mul�processor system, data inconsistency may occur among adjacent levels or within the same level of the memory hierarchy. For example, the cache and the main memory may have inconsistent copies of the same object. As mul�ple processors operate in parallel, and independently mul�ple caches may possess different copies of the same memory block, this creates cache coherence problem. Cache coherence schemes help to avoid this problem by maintaining a uniform state for each cached block of data. Figure 4.1: Cache coherence in mul�processor system Let X be an element of shared data which has been referenced by two processors, P1 and P2. In the beginning, three copies of X are consistent. If the processor P1 writes a new data X1 into the cache, by using write-through policy, the same copy will be wri�en immediately into the shared memory. In this case, inconsistency occurs between cache memory and the main memory. When a write-back policy is used, the main memory will be updated when the modified data in the cache is replaced or invalidated. In general, there are three sources of inconsistency problem − • Sharing of writable data • Process migra�on • I/O ac�vity Sharing of writable data When two processors (P1 and P2) have same data element (X) in their local caches and one process (P1) writes to the data element (X), as the caches are write-through local cache of P1, the main memory is also updated. Now when P2 tries to read data element (X), it does not find X because the data element in the cache of P2 has become outdated. Figure 4.2: Sharing of writable data Process migra�on In the first stage, cache of P1 has data element X, whereas P2 does not have anything. A process on P2 first writes on X and then migrates to P1. Now, the process starts reading data element X, but as the processor P1 has outdated data the process cannot read it. So, a process on P1 writes to the data element X and then migrates to P2. A�er migra�on, a process on P2 starts reading the data element X but it finds an outdated version of X in the main memory. Figure 4.3: Process migra�on I/O ac�vity As illustrated in the figure, an I/O device is added to the bus in a two-processor mul�processor architecture. In the beginning, both the caches contain the data element X. When the I/O device receives a new element X, it stores the new element directly in the main memory. Now, when either P1 or P2 (assume P1) tries to read element X it gets an outdated copy. So, P1 writes to element X. Now, if I/O device tries to transmit X it gets an outdated copy. Figure 4.4: I/O ac�vity Snoopy Bus Protocols Snoopy protocols achieve data consistency between the cache memory and the shared memory through a bus- based memory system. Write-invalidate and write-update policies are used for maintaining cache consistency. Figure 4.5: Consistent copy of block X is in shared memory and three processor cache Figure 4.6: A�er a write invalidate opera�on by P1 In this case, we have three processors P1, P2, and P3 having a consistent copy of data element ‘X’ in their local cache memory and in the shared memory (Figure 4.5). Processor P1 writes X1 in its cache memory using write- invalidate protocol. So, all other copies are invalidated via the bus. It is denoted by ‘I’ (Figure 4.6). Invalidated blocks are also known as dirty, i.e. they should not be used. The write-update protocol updates all the cache copies via the bus. By using write back cache, the memory copy is also updated (Figure 4.7). Figure 4.7: A�er write update opera�on by P1 Cache Events and Ac�ons Following events and ac�ons occur on the execu�on of memory-access and invalida�on commands − • Read-miss − When a processor wants to read a block and it is not in the cache, a read-miss occurs. This ini�ates a bus-read opera�on. If no dirty copy exists, then the main memory that has a consistent copy, supplies a copy to the reques�ng cache memory. If a dirty copy exists in a remote cache memory, that cache will restrain the main memory and send a copy to the reques�ng cache memory. In both the cases, the cache copy will enter the valid state a�er a read miss. • Write-hit − If the copy is in dirty or reserved state, write is done locally and the new state is dirty. If the new state is valid, write-invalidate command is broadcasted to all the caches, invalida�ng their copies. When the shared memory is wri�en through, the resul�ng state is reserved a�er this first write. • Write-miss − If a processor fails to write in the local cache memory, the copy must come either from the main memory or from a remote cache memory with a dirty block. This is done by sending a read- invalidate command, which will invalidate all cache copies. Then the local copy is updated with dirty state. • Read-hit − Read-hit is always performed in local cache memory without causing a transi�on of state or using the snoopy bus for invalida�on. • Block replacement − When a copy is dirty, it is to be wri�en back to the main memory by block replacement method. However, when the copy is either in valid or reserved or invalid state, no replacement will take place. Directory-Based Protocols By using a mul�stage network for building a large mul�processor with hundreds of processors, the snoopy cache protocols need to be modified to suit the network capabili�es. Broadcas�ng being very expensive to perform in a mul�stage network, the consistency commands is sent only to those caches that keep a copy of the block. This is the reason for development of directory-based protocols for network-connected mul�processors. In a directory-based protocols system, data to be shared are placed in a common directory that maintains the coherence among the caches. Here, the directory acts as a filter where the processors ask permission to load an entry from the primary memory to its cache memory. If an entry is changed the directory either updates it or invalidates the other caches with that entry. Directory-based coherence uses a special directory to serve instead of the shared bus in the bus-based coherence protocols. Both of these designs use the corresponding medium (i.e. directory or bus) as tool to facilitate the communica�on between different nodes, and to guarantee that the coherence protocol is working properly along all the communica�ng nodes. In directory based cache coherence, this is done by using this directory to keep tracking of the status for all cache blocks, the status of each block include in which cache Figure 4.15: Simple buffer and buffer with VC Buffers are commonly operated as FIFO queues. Therefore, once a message occupies a buffer for a channel, no other message can access the physical channel, even if the message is blocked. Alterna�vely, a physical channel may support several logical or virtual channels mul�plexed across the physical channel. Each unidirec�onal virtual channel is realized by an independently managed pair of message buffers. Logically, each virtual channel operates as if each were using a dis�nct physical channel opera�ng at half the speed. This representa�on can be seen in figure 4.16. Virtual channels were originally introduced to solve the problem of deadlock in wormhole-switched networks [3 ch2]. Deadlock is a network state where no messages can advance because each message requires a channel occupied by another message. Figure 4.16: Communica�on lines with VC Virtual channels can also be used to improve message latency and network throughput. By allowing messages to share a physical channel, messages can make progress rather than remain blocked. For example, in figure 4.17 we see two messages crossing the physical channel between routers R1 and R2. With no virtual channels, message A will prevent message B from advancing un�l the transmission of message A has been completed. Par��oning the buffer in virtual channels, both messages con�nue to make progress. The rate at which each message is forwarded is nominally one-half the rate achievable when the channel is not shared. Figure 4.17: Packets advances with the use of VC’s This approach described, does not place any restric�ons on the use of the virtual channels. Therefore, when used in this manner these buffers are referred to as virtual lanes. Virtual channels were originally introduced as a mechanism for deadlock avoidance in networks with physical cycles, and as such rou�ng restric�ons are placed on their use. Virtual channels also can have different classes, meaning that each virtual channels can have its own type of priority dependent on the characteris�cs that we want to provide them. Vector Processing Principles In compu�ng, a vector processor or array processor is a central processing unit (CPU) that implements an instruc�on set containing instruc�ons that operate on one-dimensional arrays of data called vectors, compared to scalar processors, whose instruc�ons operate on single data items. Vector processors can greatly improve performance on certain workloads, notably numerical simula�on and similar tasks. Other CPU designs may include some mul�ple instruc�ons for vector processing on mul�ple (vectorised) data sets, typically known as MIMD (Mul�ple Instruc�on, Mul�ple Data) and realized with VLIW (Very Long Instruc�on Word). Such designs are usually dedicated to a par�cular applica�on and not commonly marketed for general-purpose compu�ng. The Fujitsu FR-V VLIW/vector processor combines both technologies. Vector Instruc�on types • Vector-Vector Instruc�ons • Vector-Scalar Instruc�ons • Vector-Memory Instruc�ons • Vector Reduc�on Instruc�ons • Gather and Sca�er Instruc�ons • Masking Instruc�ons Figure 4.18: Vector Instruc�on format The values of A and B are either in memory or in processor registers. Each floa�ng point adder and mul�plier unit is supposed to have 4 segments. All segment registers are ini�ally ini�alized to zero. Therefore the output of the adder is zero for the first 8 cycles un�l both the pipes are full. Ai and Bi are brought and mul�plied at a rate of one pair per cycle. A�er 4 cycles the products are added to the Output of the adder. During the next 4 cycles zero is added. At the end of the 8th cycle the first four products A1B1 through A4B4 are in the four adder segments and the next four products A5B5 through A8B8 are in the mul�plier segments. Thus the 9th cycle and onwards starts breaking down the summa�on into four sec�ons. SIMD organiza�on: distributed memory model and shared memory model There are various architecture suppor�ng parallel processing exists these are broadly classified as Mul�processors and Mul�computer. The common classifica�on is Shared-Memory Mul�processors Models which include all UMA: uniform memory access (all SMP servers), NUMA: non-uniform-memory-access (Stanford DASH, SGI Origin 2000, Cray T3E) and COMA: cache-only memory architecture (KSR) Which have very low remote memory access latency. Figure 4.19: Shared memory mul�processor The Distributed-Memory Mul�computer Model must have a message-passing network, highly scalable like NORMA model (no-remote-memory-access), IBM SP2, Intel Paragon, TMC CM-5, INTEL ASCI Red, PC cluster Figure 4.20: Distributed memory mul�processor Principles of Mul�threading: Mul�threading Issues and Solu�ons In computer architecture, mul�threading is the ability of a central processing unit (CPU) or a single core in a mul�-core processor to execute mul�ple processes or threads concurrently, appropriately supported by the opera�ng system. This approach differs from mul�processing, as with mul�threading the processes and threads share the resources of a single or mul�ple cores: the compu�ng units, the CPU caches, and the transla�on look aside buffer (TLB). Mul�processing systems include mul�ple complete processing units, mul�threading aims to increase u�liza�on of a single core by using thread-level as well as instruc�on-level parallelism. As the two techniques are complementary, they are some�mes combined in systems with mul�ple mul�threading CPUs and in CPUs with mul�ple mul�threading cores. The mul�threading paradigm has become more popular as efforts to further exploit instruc�on-level parallelism have stalled since the late 1990s. This allowed the concept of throughput compu�ng to re-emerge from the more specialized field of transac�on processing; even though it is very difficult to further speed up a single thread or single program, most computer systems are actually mul�tasking among mul�ple threads or programs. Thus, techniques that improve the throughput of all tasks result in overall performance gains. Two major techniques for throughput compu�ng are mul�threading and mul�processing. Advantages If a thread gets a lot of cache misses, the other threads can con�nue taking advantage of the unused compu�ng resources, which may lead to faster overall execu�on as these resources would have been idle if only a single thread were executed. Also, if a thread cannot use all the compu�ng resources of the CPU (because instruc�ons depend on each other's result), running another thread may prevent those resources from becoming idle. Disadvantages Mul�ple threads can interfere with each other when sharing hardware resources such as caches or transla�on look aside buffers (TLBs). As a result, execu�on �mes of a single thread are not improved but can be degraded, even when only one thread is execu�ng, due to lower frequencies or addi�onal pipeline stages that are necessary to accommodate thread-switching hardware. Mul�ple-Context Processors Mul�ple context processor (MCP) architectures increase performance and reduce overhead by reducing the frequency of full context switches. In this study we accomplish this through hardware support for inter-process communica�on (IPC) and scheduling. Conven�onal scheduling techniques for single context processors do not adapt well to mul�ple context pla�orms. We present a new scheduling algorithm designed for mul�ple context processors which u�lize informa�on about task interac�on between independent tasks (inter-process communica�on) to more efficiently schedule tasks on a MCP architecture, called IPC directed scheduling. Simula�on of the processor using so�ware to simulate processor hardware was used to explore the efficacy of our design. Experimental results demonstrate the improved performance of MCP architectures over single context processors and of IPC directed scheduling compared with conven�onal scheduling techniques. Unit-V Parallel Programming Models The model of a parallel algorithm is developed by considering a strategy for dividing the data and processing method and applying a suitable strategy to reduce interac�ons. Parallel programming models are specifically designed for mul�processor, mul�computer or vector / SIMD computers. Parallel compu�ng is the simultaneous use of mul�ple compute resources to solve a computa�onal problem: • A problem is broken into discrete parts that can be solved concurrently • Each part is further broken down to a series of instruc�ons So we can say in func�onal and logic models- • Two language-oriented programming for parallel processing are purposed. • Func�onal programming model such as LISP, SISAL, Strand 88. • Logic programming model as prolog. • Based on predicate logic, logic programming is suitable for solving large database queries. Parallel Languages and Compilers The environment for parallel computers is much more demanding than that for sequen�al computers. So�ware for driving parallel computers is s�ll in the early developmental stage. Users are s�ll forced to spend a lot of �me programming hardware details instead of concentra�ng on program parallelism using high level abstrac�on. To break this hardware / so�ware barrier, we need a parallel so�ware environment which provides be�er tools for users to implement parallelism and to debug programs. Most of the recently developed so�ware tools are s�ll in the research and tes�ng stage, and a few have become commercially available. A parallel language is able to express programs that are executable on more than one processor. In this we use high level language in source code as it becomes a necessity in modern computer. Figure 5.5: Compila�on phases in parallel code genera�on Role of Compiler The role of compiler is to remove the burden of program op�miza�on and code genera�on. A parallelizing compiler consists of the three major phases. • Flow Analysis • Program flow pa�ern in order to determine data and control dependence in the source code. • Flow analysis is conducted at different execu�on levels on different parallel computers. • Op�miza�on • The transforma�on of user programs in order to explore the hardware capabili�es as much as possible. • Transforma�on can be conducted at the loop level, locality level, or prefetching level. • The ul�mate goal of PO is to maximize the speed of code execu�on. • Code Genera�on • Code genera�on usually involves transforma�on from one representa�on to another, called an intermediate form. • Even more demanding because parallel constructs must be included. • Code genera�on closely �ed to instruc�on scheduling policies used. Language Features for Parallelism Language features are classified into six categories- • Op�miza�on features. • This features conver�ng sequen�ally coded programs into parallel forms. • The purpose is to match the so�ware with the hardware Parallelism in the target machine. • Automated parallelizer • Semi-automated parallelizer(programmers interac�on) • Interac�ve restructure support (sta�c analyzer, run �me sta�c, data flow graph,) • Availability features. • Feature enhances user friendliness. Make the language portable to a larger class of parallel computer • Expand the applicability of so�ware libraries • Scalability – scalable to the number of processors available and independent of hardware topology • Compa�bility-compa�ble with establishment sequen�al • Portability –portable to shared memory mul�processors, message passing mul�computer, or both • Synchroniza�on /communica�on features. • Single assignment languages • Remote producer call • Data flow languages such as ID • Send /receive for message passing • Barriers ,mailbox , semaphores , monitors • Control of parallelism • Coarse ,medium, or fine grains • Explicit versus implicit parallelism • Global parallelism in the en�re program • Take spilt parallelism • Shared task queue • Data parallelism features • Used to specify how data are accessed and distributed in either SIMD and MIMD computers • Run- �me automa�c decomposi�on • Mapping specifica�on • Virtual processor support • Direct access to shared data • SPMD(single program mul�ple data) • Process management features • Needed to support the efficient crea�on of parallel processes, implementa�on of mul�threading or mul�tasking. • Dynamic process crea�on at run �me Parallel Programming Environment An environment for parallel programming consists of hardware pla�orms, languages supported, OS and so�ware tools, and applica�on packages. The hardware pla�orms vary from shared memory, message passing, vector processing, and SIMD to data flow computers. Key parallel programming steps: • To find the concurrency in the given problem. • To structure the algorithm so that concurrency can be exploited. • To implement the algorithm in a suitable programming environment. • To execute and tune the performance of the code on a parallel system. So�ware Tools and Environments • Tools support individual process tasks such as checking the consistency of a design, compiling a program, comparing test results, etc. Tools may be general-purpose, stand-alone tools (e.g. a word- processor) or may be grouped into workbenches. • Workbenches support process phases or ac�vi�es such as specifica�on, design, etc. They normally consist of a set of tools with some greater or lesser degree of integra�on. • Environments support all or at least a substan�al part of the so�ware process. They normally include several different integrated workbenches. The diagram below illustrates this classifica�on and shows some examples of these different classes of CASE support. Many types of tool and workbench have been le� out of this diagram. Figure 5.6: Tools, workbenches and environments The term 'workbenches' is not now much used and the term 'environments' has been extended to cover sets of tools focused on a specific process phase e.g. programming environment or requirements engineering environment. In the above diagram environments are classified as integrated environments or process- centered environments. Integrated environments provide infrastructure support for integra�ng different tools but are not concerned with how these tools are used. Process-centered environments are more general. They include so�ware process knowledge and a process engine which uses this process model to advise engineers on what tools or workbenches to apply and when they should be used. In prac�ce, the boundaries between these different classes are blurred. Tools may be sold as a single product but may embed support for different ac�vi�es. For example, most word processors now provide a built-in diagram editor. CASE workbenches for design usually support programming and tes�ng so they are more akin to environments than specialized workbenches.

Documents

questions

ACA Notes for new students, Lecture notes of Geology

Related documents

Partial preview of the text