Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Circuit-Aware Architectural Simulation: Enhancing High-Performance Analysis, Lab Reports of Health sciences

The limitations of traditional architectural simulators in handling the recent trend towards circuit-level phenomena in system design. The authors propose an architectural simulator design that incorporates circuit modeling capabilities, enabling simulations that react to circuit characteristics on a cycle-by-cycle basis. The paper details the design, related work, circuit simulation methodology, and performance optimizations of the simulator.

Typology: Lab Reports

Pre 2010

Uploaded on 09/02/2009

koofers-user-l6v-1
koofers-user-l6v-1 🇺🇸

10 documents

1 / 6

Toggle sidebar

Related documents


Partial preview of the text

Download Circuit-Aware Architectural Simulation: Enhancing High-Performance Analysis and more Lab Reports Health sciences in PDF only on Docsity! Circuit-Aware Architectural Simulation Seokwoo Lee, Shidhartha Das, Valeria Bertacco, Todd Austin David Blaauw, and Trevor Mudge Advanced Computer Architecture Lab The University of Michigan 1301 Beal Ave, Ann Arbor, MI 48109 razor@eecs.umich.edu ABSTRACT Architectural simulation has achieved a prominent role in the system design cycle by providing designers the ability to quickly examine a wide variety of design choices. How- ever, the recent trend in system design toward architectures that react to circuit-level phenomena has outstripped the capabilities of traditional cycle-based architectural simula- tors. In this paper, we present an architectural simulator design that incorporates a circuit modeling capability, per- mitting architectural-level simulations that react to circuit characteristics (such as latency, energy, or current draw) on a cycle-by-cycle basis. While these additional capabilities slow simulation speed, we show that the careful application of circuit simulation optimizations and simulation sampling techniques permit high levels of detail with sufficient speed to examine entire workloads. Categories and Subject Descriptors B.8.2 [Performance Analysis and Design Aids]: Archi- tectural Simulation; B.5.2 [Register-Transfer-Level Im- plementation]: Design Aids—Simulation General Terms Computer system simulation Keywords Architectural simulation, High-performance simulation, Cir- cuit simulation 1. INTRODUCTION To accelerate the hardware design cycle, architects often employ architectural simulators of the hardware they are designing. They implement these models in traditional pro- gramming languages or hardware description languages, and then execute programs on them to validate the performance and correctness of a proposed hardware design. Although software models run slower than their hardware counter- parts, architects can construct hardware models in minutes Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2004, June 7–11, 2004, San Diego, California, USA. Copyright 2004 ACM 1-58113-828-8/04/0006 ...$5.00. or hours rather than in the months needed to manufacture real hardware. The faster build time speeds up the hard- ware design cycle, giving architects the ability to evaluate more designs before committing to a single solution for fab- rication. In the traditional approach to architectural simulation, a software model of the architecture is constructed by first identifying the major components of the system and deter- mining their operation latency as a function of the expected clock cycle of the machine. For example, ALUs and register files typically have a latency of one cycle, while the latency of the caches are dependent on the size of the cache. With the latency of components defined, an architectural model of the complete machine is constructed by quantifying the number of each component, their datapath connections within the microarchitecture, and the hazards (or stalls) that can be experienced by instructions using the various components in the design. Once the model is defined, architectural sim- ulation becomes the process of determining the total number of cycle it takes to execute a program, based on instructions executed, availability of resources, and the hazards experi- enced during execution. There is a recent trend in computer architecture design toward systems that can adapt to circuit-level phenomena. In these highly adaptable systems, it is possible for the ar- chitecture to influence circuit operation and vice versa. Ex- amples of these type of systems include di/dt [6] and ther- mal throttling [14] and Razor clocking [7]. Throttling tech- niques monitor current and temperature characteristics of the underlying circuit implementation. If current demands get too high (which induces noise on the supply lines) or if temperatures rise too high, an architectural-level system controller will be invoked to throttle down instruction fetch bandwidth. With fewer instruction entering the microar- chitecture, current demands and temperature are quickly reduced. Razor clocking is a technique to reduce circuit energy levels below the point required by worst case com- putation paths [7]. In the event a computation fails due to extraordinary energy requirements, an error recovery mech- anism restores correct state. With prudent energy tuning, the approach can greatly reduce circuit energy demands with little impact on computational speed. These novel circuit-aware architectural optimizations share the requirement that the architectural simulator must accu- rately gauge detailed circuit phenomena to correctly simu- late the operation of the machine under study. For example, the throttling techniques must count the total number of devices switching during each cycle of operation, and simu- lation of Razor clocking requires detail timing information of pipeline stages on a per-cycle basis. The approach that has been taken to analyze much of this work has been to utilize extremely simplistic analytical circuit models of mi- croarchitectural components. The primary advantage of the analytical circuit models is flexibility and speed. They also have minimal performance impact with typically less than a 100% slowdown. However, researchers have begun to ques- tion the accuracy of simple analytical circuit models [8]. In this paper, we present an architectural simulation mod- eling infrastructure that incorporates circuit simulation ca- pabilities. The approach is quite accurate because we an- alyze detailed circuit-level phenomena including individual gate delay and energy characteristics. Performance, while considerably slower than architectural simulation, is main- tained using an effective combination of circuit and archi- tecture level simulation optimizations. The optimization we implement include i) early circuit simulation termination based on architectural constraints, ii) circuit timing mem- oization, and iii) fine-grained instruction sampling. Using our optimized circuit-aware architectural simulator, we are able to examine the performance of a large program in detail in under 5 hours of simulation. The remainder of this paper is organized as follows. Sec- tion 2 details related work in architectural simulation, cir- cuit simulation, and simulator performance optimization. Section 3 details our circuit simulation methodology and its integration into an architectural simulation model. Sec- tion 4 describes the optimizations we implemented to further improve the performance the simulator. Section 5 demon- strates the use of our simulator with a case study of Razor clocking. Finally, Section 6 summarizes the paper and sug- gests future directions. 2. BACKGROUND AND RELATED WORK A number of popular architectural simulation infrastruc- tures exist that are widely used in academia and industry. One of the most notable examples is the SimpleScalar tool set [2], a collection of simulation models capable of running programs compiled for the PISA, Alpha and ARM instruc- tion sets. At the core of the simulator infrastructure is an emulation mechanism to execute programs of interest. They also include event management routines, resource tracking mechanisms, and statistical analysis packages. The resulting models are at a high enough abstraction level that they exe- cute fairly efficiently. The most detailed SimpleScalar mod- els execute programs at a rate of about 100,000 instruction per second (IPS), permitting architects to examine seconds of real-time execute (billions of instructions) in a few hours of simulation. The approach that has been taken to incorporate circuit characteristics (such as power demands) into microarchitec- tural simulators has been to utilize analytical circuit models of components and drive them with the events in the archi- tectural simulator. In the Wattch power analyzer [3], circuit level power behavior was characterized with analytical mod- els, including those developed for the CACTI on-chip cache model [17]. The microarchitectural simulator records the switching characteristics of the components such as caches, register files, branch target buffers, and translation look- aside buffers. The dynamic power consumption can then be calculated directly. A similar approach was adopted by SimplePower [10]. It incorporated register transfer level (RTL) power mod- els based on look-up tables (LUT) into a microarchitectural simulator. Finally, the Cai-Lim power-performance simula- tor introduced empirical power density models derived from Intel internal design data for major microarchitectural func- tional blocks [4]. In all of these tools, microprocessor power is estimated by multiplying the frequency of the accesses to the microarchitectural functional blocks with their lumped capacitance. This in turn is derived from the circuit model of the block. The primary advantage of these analytical circuit models is flexibility and speed. The models are sufficiently high level that they can be readily reconfigured to different component configurations (e.g., cache size, branch predictor configura- tion) and fabrication technologies. They also have minimal simulation performance impact with less than a 100% slow- down for all of the models discussed. Recently, however, the accuracy of analytical models in architectural models has come into question. In a paper by Ghiasi [8] it was noted that even at a 90% confidence level, two analytical-based power models (Wattch and Cai-Lim) failed to agree on the benefits of a variety of power-based optimizations. The re- sults of the two models were uncorrelated, suggesting that at least one of the tools was based on incomplete or inac- curate models. The authors concluded that the accuracy of circuit-level models must be improved to detect anything but the grossest power savings. In our previous Razor work (by Ernst et al. [7]), we further refined this approach to architectural modeling to allow direct measurement of mod- ule latency, based on input vectors from a live architectural simulation. Our model was built by hand, thus limiting our experimental studies to examining the effects of Razor clocking on a single Kogge-Stone adder circuit [7]. Ideally, for circuit-aware architectural designs, we would like to leverage an analysis framework with the accuracy of circuit-level simulation and the flexibility and speed of architectural-level simulation. This is not possible with state- of-the art circuit simulation tools alone. When running the Razor clocking experiment (detailed in Section 5.2) on Syn- opsys VCS, a compile-based Verilog simulator which we con- figured to use SDF back-annotation timing information, sim- ulations ran at rates of about 50 instructions per second on a Sun Blade 1000 workstation. Typical architectural sim- ulations examine up to 1 billion instructions, which would require at least six months of simulation! In addition, the framework was not sufficiently flexible to accomplish the necessary analysis, e.g., the tool could not accommodate voltage changes during simulation as all logic and wires were characterized as voltage-derived delays bound in the simu- lation code. Hence, this particular tool is not sufficiently flexible to examine dynamic voltage scaling (DVS), an opti- mization of intense interest in the architecture literature. In the domain of circuit-level simulation, SPICE has been the industry standard for the past 25 years. During this time, much research [1, 11] has been devoted to improve the simulation performance to handle the increasing complexity of circuit designs without sacrificing simulation accuracy. In this context, SPECS2 [15] marked a change of pace by proposing one of the first table-based approaches to tim- ing simulation, which is also the technique of choice for our simulator and for many of the current tools in this arena. Today, state-of-the-art commercial simulators can simulate designs of millions of transistors at a speed a thousand times faster than SPICE with an accuracy trade-off of just a few percent. However, architectural designs require an analysis circuit-level module: (vectorstate, vectorin, Vdd) → (delay, energy) Where vectorstate represents the current state of the circuit, vectorin is the current input vector, and Vdd is the current operating voltage. The hash table returns the circuit evalua- tion latency and the circuit evaluation energy. We index the hash table with a combination of vectorstate and vectorin be- cause vectorstate encode the current state of the circuit and vectorin indicates the input transitions. Combined with the current operating voltage, Vdd, the inputs to the hash table fully encodes the factors that determine delay and energy. Whenever the hash table does not include the requested entry, full-scale circuit simulation is performed to compute the delay and energy of the circuit computation. The result is then inserted into the hash table with the expectation that later portions of the program will generate similar vectors. In our implementation, the size of the hash table is limited to 256 MB. In addition, we found better performance when we dynamically re-order the hash bucket chains, by bring- ing the most recently referenced element to the head of the chain. The latter optimization further exploits fine-grained program value locality. For our baseline hash table implementation, we achieved a hit rate of typically less than 50%, which still rendered a sizeable speedup. Investigation into the access stream quickly revealed that hashing the entire input vector to pipeline stage logic is overly restrictive. For example, load instructions that pass through the execute (EX) stage of the pipeline include two input register operands in their input vectors, yet, the second operand is ignored during execution of the load (instead the instruction offset field is used). By including the second operand in the input vector, multiple hash table entries are required to memoize the same load ad- dress computation. To alleviate this problem, a per-opcode input vector filtering mechanism was developed. Each in- struction opcode indicates with a mask which inputs do not influence stage logic evaluation. These inputs are masked off before attempting to memoize the circuit simulation. The optimization resulted in a much higher hash table hit rate of 70-85% on average. Simulation speedups due to memo- ization were quite noticeable, with most experiments expe- riencing 3-5x improvements. 4.3 SimPoint Analysis Typical architectural simulations in the literature analyze dynamic program lengths of 1 billion or more instructions. Even after deploying all of the previous optimizations, we will only reach simulation speeds of the order of a 1000 in- structions per second, which would require more than a week of simulation time to complete a single program run. Fortunately, we can draw on a recent result in computer simulation sampling to relax the performance demands for the circuit-aware architectural simulator. SimPoint analysis was recently proposed as a technique to dramatically re- duce the number of instructions simulated to characterize a program’s performance on a complex microarchitecture [13]. SimPoint uses basic block distribution analysis along with several techniques from clustering analysis to concisely summarize the behavior of an arbitrary section of execution in a program. This information summarizes whole program behavior and greatly reduces simulation time by using only representative samples. In our work, we use 10 million instruction length samples (called Early Multiple SimPoints) [12]. The SimPoints indi- cate a collection of sample starting points to simulate within the program, the length of the samples, and the weight to use when combining simulation statistics (e.g., IPC). With this technique, even our slowest simulation was able to analyze a complete program in just over 5 hours (at 554 instruc- tions/second). Error analysis of these SimPoints indicate an error of less than 10% (typically less than 3%) for a wide variety of benchmarks [5]. 5. EXAMPLE CASE STUDY To evaluate the quality our circuit-aware architectural simulator, we modeled the Razor clocking technique pro- posed by Ernst et. al. [7]. In Razor designs, the latency of an instruction (in cycles) may vary based on the latency of the circuit evaluation within a pipeline stage. In this sec- tion, we present a high level overview of the Razor clocking technology, and demonstrate its evaluation using our circuit- aware architectural simulator. 5.1 Razor Timing Speculation The key observation underlying the design of Razor is that the worst-case conditions that drive traditional design are improbable conditions. Thus, by building error detec- tion and correction mechanisms into the Razor design, it becomes possible to tune voltage to typical energy require- ments, rather than worst case. The resulting design has significantly lower energy requirements, even in the pres- ence of added energy processing demands due to occasional error recoveries. The Razor design utilizes an in-situ tim- ing error detection and correction mechanism implemented within the Razor flip-flop. Razor flip-flops double-sample pipeline stage values, once with an aggressive fast clock and again with a delayed clock that guarantees a reliable second sample. A metastability-tolerant error detection circuit is employed to check the validity of all values latched on the fast Razor clock. In the event of a timing error, a modified pipeline flush mechanism restores the correct stage value into the pipeline, flushes earlier instructions, and restarts the next instruction after the erroneous computation. For additional background on Razor timing verification and re- lated dynamic verification work in general, see references [7, 16]. 5.2 Experimental Framework To model Razor clocking, we implemented an architec- tural model of a baseline 64-bit Alpha processor model. The processor architecture is a simple in-order pipeline consist- ing of instruction fetch, instruction decode, execute, and memory/writeback with 8 Kbytes of I-cache and D-cache. In addition, the entire processor was described in Verilog and synthesized using Synopsys Design Analyzer (version 2003.03-2). Global routing capacitances were estimated by performing global place and route using Cadence Silicon En- semble (version 5.4.126) and Mentor Graphics Xcalibre (ver- sion 9.1 5.6). The processor was mapped to a 0.18um TSMC process, and it was validated to operate at 200 MHz. Af- ter careful performance analysis, it was found that only the instruction decode and execute stages were critical at the worst-case voltage and frequency settings; hence, only these stages are incorporated into the circuit-aware architectural simulations. 5.3 Simulation Case Study Table 1 shows that the baseline performance of our circuit- aware simulator is comparable to the compile-based Verilog VCS simulator, which simulated the Razor design at about 50 instructions per second. However, after applying all opti- mizations, it reaches a speed of 887 instructions per second, more than 8 times faster than its barebone counterpart. Optimization options instructions/sec None 102 Pruning 347 Pruning and Memoization 887 Table 1: Benefits of Circuit Simulation Optimiza- tions (when simulating GCC) Figure 5: Case Study of a Razor Design Figure 5 demonstrates the performance of Razor clocking as measured by our circuit-aware architectural simulator. The top graph shows the relative energy of the pipeline with decreasing voltage. As voltage decreases, simulated pipeline energy decreases, even in the presence of expensive tim- ing error recoveries. Because our circuit-aware architectural simulator can accurately gauge the per-cycle stage evalua- tion latency, it is possible to assess the voltage (around 1.2V) at which the cost of Razor timing error recovery outweighs the benefits of further decreasing voltage. In addition, the bottom graph demonstrates the measurement capability of our circuit simulator. The figure illustrates the latency, through the EX stage logic, as a probability distribution function for all input vectors produced during a simulation of the GNU GCC compiler. Given that the worst-case la- tency through this stage is over 4ns, it is clear that typical case latencies are much less, allowing Razor to lower voltage with only small increases in circuit timing error rates. 6. CONCLUSIONS In this paper we have shown that it is possible to combine circuit simulation with an architectural simulator and still achieve significant simulation throughput rates. By identi- fying those events that repeat or are not critical we can still capture delay information that is voltage or data dependent in the simulation. In the past, this sort of analysis required a clumsy coupling of architectural simulation and selective SPICE simulation. Our framework provides not only an au- tomated solution to the specific problem of voltage and data dependent delays, but it can be extended in a natural way to other run-time dependencies, such as process variation and noise coupling. Acknowledgements This work is supported by grants from ARM Ltd., the Na- tional Science Foundation, and the Gigascale Systems Re- search Center. 7. REFERENCES [1] E. Acuna, J. Dervenis, A. Pagones, and R. Saleh. iSPLICE3: a new simulator for mixed analog/digital circuits. In IEEE Custom Integrated Circuits Conference, pages 13.1/1–13.1/4, May 1989. [2] T. Austin, E. Larson, and D. Ernst. Simplescalar: An infrastructure for computer system modeling. In IEEE Computer, Feb. 2002. [3] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Proc. 27th Int. Symp. on Computer Architecture (ISCA27), May 2000. [4] G. Cai and C. H. Lim. Architectural level power/performance optimization and dynamic power estimation. In Cool Chips Tutorial in conjunction with the 32nd Int. Symp. on Microarchitecture (MICRO-32), Nov. 1999. [5] B. Calder. Simpoint website. In http://www.cse.ucsd.edu/ calder/simpoint/, 2003. [6] W.-K. Chen. The VLSI handbook. In CRC Press publisher, 2000. [7] D. Ernst, N. S. Kim, S. Das, S. Pant, T. Pham, R. Rao, C. Ziesler, D. Blaauw, T. Austin, T. Mudge, and K. Flautner. Razor: A low-power pipeline based on circuit-level timing speculation. In 36th Annual International Symposium on Microarchitecture (MICRO-36), Dec. 2003. [8] S. Ghiasi and D. Grunwald. A comparison of two architectural power models. In Workshop on Power Aware Computing Systems (PACS-2000), Dec. 2000. [9] M. H. Lipasti and J. P. Shen. Exploiting value locality to exceed the dataflow limit. In 29th International Symposium on Microarchitecture (MICRO-29), Dec. 1996. [10] N. Vijaykrishnan et al. Energy-driven integrated hardware-software optimizations using SimplePower. In Proc. 27th Int. Symp. on Computer Architecture (ISCA27), May 2000. [11] C. Ratzlaff, N. Gopal, and L. Pillage. RICE: Rapid interconnect circuit evaluator. In DAC, Proceedings of Design Automation Conference, pages 555–560, June 1991. [12] T. Sherwood, E. Perelman, and B. Calder. Basic block distribution analysis to find periodic behavior and simulation points in applications. In International Conference on Parallel Architectures and Compilation Techniques, Sept. 2001. [13] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002. [14] K. Skadron, M. Stan, and T. Abdelzaher. Control-theoretic techniques and thermal-RC modeling for accurate and localized dynamic thermal management. In 8th International Symposium on High-Performance Computer Architecture (HPCA-8), Feb. 2002. [15] C. Visweswariah and R. Rohrer. SPECS2: An integrated circuit timing simulator. In ICCAD, Proceedings of the International Conference on Computer Aided Design, pages 94–97, Nov. 1987. [16] C. Weaver and T. Austin. A fault tolerant approach to microprocessor design. In IEEE International Conference on Dependable Systems and Networks (DSN-2001), June 2001. [17] S. Wilton and N. Jouppi. An enhanced access and cycle time model for on-chip caches. In Western Research Laboratory Research Report 93/5, July 1993.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved