Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Study on Invalidate-Parallel Coherence in Chip Multi-Processors: Directory-Based Cache, Exams of Computer Architecture and Organization

A final report for cs 533 detailing the research on cache coherence in chip multi-processor (cmp) systems using a directory-based protocol called invalidate-parallel coherence. The report covers related work, architecture, experimental setup, implementation, and results.

Typology: Exams

2009/2010

Uploaded on 02/24/2010

koofers-user-02i
koofers-user-02i 🇺🇸

10 documents

1 / 14

Toggle sidebar

Related documents


Partial preview of the text

Download Study on Invalidate-Parallel Coherence in Chip Multi-Processors: Directory-Based Cache and more Exams Computer Architecture and Organization in PDF only on Docsity! CS 533: Final Report John Criswell Patrick Meredith Alex Papakonstantinou Apeksha Godiyal May 6, 2008 1 Introduction The increase in transistor capacity per die and the need to achieve higher perfor- mance in a power efficient way has led to the emergence of Chip Multi-Processor (CMP) architectures. These architectures are based on multiple processing cores on a single chip and have the potential to provide higher peak throughput, eas- ier design scalability, and greater performance/power ratios than monolithic designs. Even though current implementations [3, 12, 27] use a modest number of cores (two to four), the trend is towards larger numbers of cores on each die [14]. In order for the applications to quickly adapt to this parallel multipro- cessing environment, the proposed architectures use shared memory and private caches for fast memory accesses. This raises the need for cache coherency. Cache coherence (CC) is a key aspect in the design of shared-memory paral- lel systems. It is the mechanism that guarantees correctness by maintaining system-wide agreement about the value of a memory location at any point in program execution. The overhead in maintaining coherency should be kept as small as possible in order to observe the performance gains available in multi- processing. There has been extensive research on the cache coherence schemes for parallel architectures. Parallel architectures can be distinguished by their interconnection scheme in two major categories: Broadcast and Point-to-Point. Different protocols are used for the two types of multi-processor systems. Bus- snooping protocols are used for broadcast- based systems while Directory-based protocols are suitable for Point-to-Point interconnected processors. The latter offers better scalability for thousand-core systems of the foreseeable future. The traditional CC protocols are designed for multi-chip multiprocessors. Im- plementing those protocols on a CMP system would potentially prove inefficient due to the significant differences in topological configurations [25]. For exam- ple, in a traditional multiprocessor, the greatest communication latencies are between nodes, while in a CMP the communication latency between nodes is small compared to the off-chip memory accesses [25]. Moreover, the effect of 1 CS 533: Final Report 2 technology scaling on the intra-die communication speed emphasizes the non- uniform latency characteristics of core communication for different proximities. As the number of cores on the die is due to continue increasing, we focus on directory-based protocols. We explore the potential speedup gains of a technique for reducing the upgrade miss (write hit to a shared cache line) overhead. This technique is called Invalidate-parallel coherence and is based on the work of Acacio et al [7], albeit with some modifications that increase its efficiency in the CMP topology. The idea of invalidate-parallel coherence is to overlap invalidates and exclusive ownership requests to the home node for upgrade misses. This is done by using the past history of sharers to proactively send invalidates to the expected current sharers on a write hit to a shared line. Thus, upgrade misses can be accelerated in strongly consistent processors. Previous studies [7] show that upgrade misses are an important fraction of the L2 miss rate (more than 30% in some cases). The protocol is implemented on a CMP with multiple L2 caches and a scalable point-to-point interconnect. 2 Related Work Even though most of the commercial implementations of multi-processor sys- tems are using bus-snooping cache coherence protocols, a great amount of re- search and related literature exists on directory-based implementations [16, 9]. During the last decade, the work on cache coherence has been shifting towards the CMP domain as Moore’s law allows a continuing increase to the number of cores that can fit in a chip. Some of the existing literature proposes shared L2 caches [8] while others use private L2 caches for greater scalability. Since our work is focused on the core-rich CMPs of the near future, we assume private L2 caches. Minimizing the overhead of cache coherency is of great concern in the research community, as the performance benefit feasible in a multicore system can be severely impacted by the increased L2 misses and the directory indirection. Directory caches have been proposed as a means of reducing the memory over- head entailed in accessing the off-chip directory. Caching the directory state was proposed by Michael and Nanda [22] who studied the performance impact of caching directory entries. They showed that it can lead to 40% or more improve- ment in the execution time of applications. A similar approach is used in the proximity-aware coherence work [5]. They use a mesh of cores on a single chip, each with a private L2 cache and a directory controller with a directory cache. The proximity-aware coherence tries to minimize off-chip memory accesses by serving them from on-chip core caches when the home node cache can not fulfill them without sending an off-chip request. Our work is based on a similar CMP architecture configuration (though without the directory caches), but we aim to parallelize the on-chip communication for upgrade misses. Alternatively, other works [6, 20] have tried to explore parallelization of coherence messages through CS 533: Final Report 5 of sharers of the line to the requester. The requester waits for ACKs from all the sharers. On receiving all the ACKs, it notifies the directory of completion of the request by sending an ’Exclusive Unblock’ message. The requester then transitions to the modified state. Directory: If the requested line is in invalid state in the directory, the directory satisfies the ownership request from the memory. If the line is in modified state, the read-exclusive request is forwarded to the exclusive owner. If the line is in owned state, the read-exclusive request is forwarded to the exclusive owner and the list of sharers is invalidated. If the line is in shared state, the data is sent from the memory to the requester and the list of sharers is invalidated. Then the directory waits for an ’Exclusive Unblock’ message from the requester, which tells it that the write miss is completed. On receiving this acknowledgment, it changes the state to modified. Remote node: The remote node sends the data from its cache to the requester. The remote node then transitions to the invalid state. 3.3 Parallel-Invalidate Coherence Our parallel-invalidate coherence optimization (suggested by previous work [25]) accelerates upgrade misses (write hit on a shared line). These misses correspond to memory locations referenced by store instructions for which a read-only copy of the line is found in the local cache. In the baseline coherence protocol, the request for ownership is sent to the directory first. The directory then sends invalidations to all the sharers. This two step protocol results in high latencies. We propose the use of prediction as a means of reducing latency in upgrade misses [7]. The predicted sharers of a line are sent invalidates in parallel with the exclusive ownership request to the directory. To achieve this, a list of potential sharers of a line is maintained in each node. We assign a small associative memory to each core for storing the sharer list and call it a ’hint cache’ [25]. The hint cache uses an LRU replacement policy. It is read on every upgrade miss. The sharers of a line are updated during invalidation requests from the directory; the motivation is that after a write (which causes the invalidations) the cores that previously shared the line will continue to access it. The protocol steps for read-miss and write-miss are the same as those in the baseline protocol. The steps for upgrade miss [25] are detailed below: 1. Requester: The requesting core looks at the hint cache entry corre- sponding to that line and sends invalidates to the potential sharers. It also generates a read-exclusive request and forwards it to the directory. If there is a miss in the hint cache, the read-exclusive request is simply sent to the directory. 2. Directory: The directory acts in the same way as in the baseline protocol. CS 533: Final Report 6 The directory sends invalidates to all sharers (even to the nodes to which the requester already sent invalidates). 3. Remote node: When the parallel invalidation request arrives, the cache line is marked invalid, and an ACK is returned to the requester if and only if the remote node had the line in the shared state. In all other states, the parallel invalidate is ignored by removing it from the message buffer with no other actions taken. 4. Requester: When the requester receives at least one ACK from all the sharers, it grants the write ownership to the requester. We keep track of which node has sent an ACK to make sure that we do not receive multiple ACKs from one node and thus unblock the directory before it is valid to do so. 5. Additional Concerns - Because of the parallel invalidations, and ACKS due to them, it is possible for ACKs to arrive in states not expecting ACK, e.g. modified, owned. If these ACKs are not ignored by the protocol, it would cause a deadlock, so we remove them from the message buffers and take no action. The potential speed up achieved by parallel invalidation is due to the fact that when the prediction at the requester is correct (includes all sharers), the write ownership is granted, based on the ACKS in response to parallel invalidations, much earlier than waiting for all ACKs from normal invalidations. 4 Experimental Setup 4.1 Benchmarks For our project, we decided to use the the SPLASH2 [28] multi-threaded bench- mark suite. SPLASH2 has the advantage of being designed for multiprocessor systems, which extends well to a CMP platform. We had to overcome several problems to run SPLASH2 on a modern, multiprocessor x86 system. As it stands, we still only have seven of the benchmarks running (though for two of the benchmarks, there are two versions, and in each case we have both versions working). The first issue was that SPLASH2 uses an M4 [10] macro file for all of the multiprocessing functions, such as thread/process create and lock/unlock. The macro file provided with SPLASH2 is a null macro set that cases the benchmarks to be built without multiprocessor support. Fortunately, we were able to find a pthreads [17] specific macro file, written by Bastiaan Stougie for his Master’s thesis [4], which required only minor modification to compile the benchmarks. CS 533: Final Report 7 Even with the macro files, several of the benchmarks require minor modification to compile with modern versions of GCC. Several of the benchmarks are actually written in K&R C. The following is a list of the SPLASH2 benchmarks that we have working: • Ocean, contiguous and non-contiguous. Ocean is a distributed simula- tion of large-scale ocean movements based on eddies and boundary cur- rents. The non-contiguous version results in non-contiguous grid par- titions, which causes it to be slower than the contiguous version when running on a multiprocessor as there is less data locality. • Water-nsquared. Water-nsquared solves a molecular dynamics N-body problem. It is an O(n2) algorithm because it checks for collisions in a pair-wise fashion between every molecule in the simulation at a given point in time. • Water-spatial. Water-spatial solves the same molecular dynamics N-body problem as Water-nsquared. Water-spatial differs in that the data struc- ture used for the computation is a 3-d grid of boxes, each containing a linked list of the molecules present in that box at a given time. A process that owns a box in the grid only needs to look at the neighboring boxes for molecules that might be within the collision radius of a molecule in the box. The algorithm is thus (roughly) O(n) rather than O(n2). • FFT. FFT is a complex, one-dimensional Fast Fourier Transform. It ex- hibits near linear scaling with the number of processors. • LU, contiguous and non-contiguous. LU performs LU decomposition of dense matrices. LU decomposition is primarily used for solving linear systems of equations. LU also scales very well with number of processors. As in Ocean the contiguous version has better data locality. • Radix. Radix performs integer radix sort on an array of integers. Additionally, Cholesky works, but the input files are unrecognizable to the benchmark, making it effectively unusable. Barnes cannot be compiled due to a missing ulocks.h header file. Radiosity and Raytrace have no easy means to check output so we did not attempt to compile them. Fmm and Volrend require fairly extensive modifications to compile, and we decided that seven benchmarks (two with two versions) would be acceptable for the scope of the project. 4.2 Simulators Our project requires a simulator that can simulate a CMP using a directory- based cache coherence system. Our team examined several candidate simulators CS 533: Final Report 10 Program Baseline Simulated Time (s) Parallel Invalidate Simulated Time (s) Baseline (Ruby Cycles) Parallel Invalidate (Ruby Cycles) Speedup Comments FFT 51 53 251,130,894 265,466,250 0.95 -m20 -p16 -n65536 -l4 Ocean Contiguous 31 33 153,308,732 164,160,387 0.93 514 x 514 matrix Ocean Non- contiguous 38 38 188,139,004 190,558,067 0.99 514 x 514 matrix Water NSquared 11 11 55,645,000 58,025,500 0.96 Default SPLASH2 in- puts except for 16 pro- cessors and 9 simulated timesteps Water Spa- tial 7 7 33,344,000 33,142,500 1.01 Default SPLASH2 in- puts except for 16 pro- cessors and 9 simulated timesteps LU Con- tiguous 52 53 257,247,107 262,106,146 0.98 -n1024 -p16 -b16 LU Non- contiguous 34 35 170,080,111 178,329,169 0.95 -n512 -p16 -b16 Radix 34 34 169,518,115 169,069,922 1.00 -p16 -n9466080 -r1024 -m524288 Table 1: Performance Numbers. Speedup is calculated using Ruby Cycles. sharers. To avoid this, we extended the buffer used in the original protocol for holding pending upgrade misses to also keep a list of the nodes that have replied with ACKs. Thus, duplicate ACKs were ignored by looking into the list of previ- ously received ACKs. Moreover, modifications to ignore ACKs that may arrive in states not expecting them were added. For the same purpose, an ID genera- tor was put in place to tag each upgrade miss with a unique ID, so as to avoid receiving delayed ACKs from previous upgrade misses (even though this case was never observed in our simulations). 6 Results Our simulated system is a CMP Pentium 4 processor at 20 MHz running Fedora Core 5 Linux. The simulated machine has a single 16 core processor with 256 MB of RAM. We chose to use a two dimensional torus network topology so that processors would not be equidistant to all other processors. We were able to run the SPLASH2 benchmarks [28] described in Section 4.1. We configured all the benchmark programs to use 16 threads. The input pa- rameters for the benchmarks are given in Table 1. We shut down and restarted the simulator for each run of each benchmark. Restarting the simulator en- sures that each program begins execution in the same hardware state as all other benchmarks. Failure to do this in earlier experiments gave us better, but CS 533: Final Report 11 misleading, performance results. We adjusted the inputs to each program to increase simulated execution time; we aimed to have programs execute for at least 10 simulated seconds. The benchmarks have their own code to calculate their execution time; this is done by reading the system clock at the beginning and end of the benchmark. Since we wanted to measure execution time using simulated processor cycles, we added code to the benchmarks to cause a simulator breakpoint when measuring the beginning and end time. When a breakpoint occurred, we recorded a value called Ruby cycles, which is the number of simulated processor cycles since the beginning of the simulation [1]. In this way, we could accurately measure the number of processor cycles needed to run the benchmark. We added the breakpoints so that the number of cycles recorded include both the initialization time of the benchmark as well as the main computation of the benchmark. This allows our results to take into account all of the processing time needed to calculate a result, and it factors in serial execution time. The results, summarized in Table 1, indicate that our implementation of parallel invalidation provides no significant performance improvement but does not hurt system performance significantly either. To determine why the performance did not improve, we made a closer exami- nation of the LU non-contiguous benchmark. We reran the LU non-contiguous benchmark using an infinite sized Hint-Cache. The speedup of LU non-contiguous improved from 0.95 (shown in Table 1) to 1.03. From this, we can conclude that either the size of our Hint-Cache was too small or that our LRU replacement policy is not optimal in selecting entries to eject. 7 Future Work To improve the performance of our system, we need to determine both the optimal size of the Hint-Cache as well as the best strategy for ejecting lines from the Hint-Cache. Finding the optimal size is merely a matter of experimentation. If a optimal but reasonable size cannot be found, then we will need to examine the sharing patterns more closely to determine if a better replacement strategy is possible. In future work, we’d also like to determine and improve, if necessary, the Hint- Cache’s accuracy. One approach to improving hint cache accuracy is to examine other points in the protocol’s state diagram to update the list of potential shar- ers. For example, in the case of a producer-consumer application where the producer is always the same node, our Hint-Cache updating scheme will not ensure timely updates of the producer’s Hint-Cache. One way of handling this case would be to have the directory send updated lists of the sharers within its replies to upgrade miss requests, either directly to the requester or through the CS 533: Final Report 12 owner node. From a preliminary implementation of this idea using the LU non- contiguous benchmark, we observed a significant decrease (around 6.5M) in simulation cycles. References [1] Gems frequently asked questions. http://www.cs.wisc.edu/gems/doc/gems- wiki/moin.cgi/Frequently Asked Questions. [2] Protocols - multifacet GEMS documentation wiki. http://www.cs.wisc.edu/gems/doc/gems-wiki/moin.cgi/Protocols. [3] AMD. http://www.amd.com/us-en/processors/productinformation/0 30 118 94 84 15184 00.html. [4] Bastiaan Stougie. Optimization of a Data Race Detector. Master’s thesis, Delft University of Technology, 2003. [5] Jeffery A. Brown, Rakesh Kumar, and Dean Tullsen. Proximity-aware directory-based coherence for multi-core processor architectures. In SPAA ’07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 126–134, New York, NY, USA, 2007. ACM. [6] D. Dai and D. K. Panda. Reducing cache invalidation overheads in worm- hole routed dsms using multidestination message passing. Proc. of Inter- national Conference on Parallel Processing, (1):138–145, 1996. [7] Acacio M. E. et. al. The use of prediction for accelerating upgrade misses in cc-numa multiprocessors. PACT, 2002. [8] Barsoso L. et. al. Piranha: A scalable architecture based on single-chip multiprocessing. ISCA, (27), 2000. [9] D. Lenoski et al. The stanford dash multiprocessor. IEEE Computer, 1992. [10] Free Software Foundation. GNU M4 Webpage. www.gnu.org/software/ m4/. [11] K. Gharachorloo, M. Sharma, S. Steely, and S. V. Doren. Architecture and design of alphaserver gs320. Proc. of International Conference on Architectural Support for Programming Language and Operating Systems, pages 13–24, 2000. [12] Intel. http://www.intel.com/products/processor/coreduo/. [13] S. Kaxiras and C. Young. Coherence communication prediction in shared- memory multiprocessors. Proc. of the 6th International High Performance Computer Architecture, pages 156–167, 2000.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved