Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding the Role of Cache, Main Memory, and Virtual Memory in Memory Hierarchy, Slides of Architecture

Memory ManagementOperating SystemsComputer Architecture

The concept of memory hierarchy, focusing on cache and main memory, and introduces virtual memory. It discusses the latency issues in memory access, the benefits of cache memory, and the distinction between virtual and physical addresses in virtual memory systems.

What you will learn

  • How does virtual memory differ from main memory?
  • What is the difference between a virtual address and a physical address?
  • Why is it important to have a memory hierarchy in a computer system?
  • What is the role of cache memory in a computer system?
  • How does the use of cache memory impact the performance of a computer system?

Typology: Slides

2021/2022

Uploaded on 08/01/2022

hal_s95
hal_s95 🇵🇭

4.4

(620)

8.6K documents

1 / 30

Toggle sidebar

Related documents


Partial preview of the text

Download Understanding the Role of Cache, Main Memory, and Virtual Memory in Memory Hierarchy and more Slides Architecture in PDF only on Docsity! 1 Memory Systems Doug Burger James R. Goodman Gurindar S. Sohi University of Wisconsin-Madison 0.1 Introduction The memory system serves as the repository of information (data) in a computer system. The processor (also called the central processing unit, or CPU) accesses (reads or loads) data from the memory system, performs computations on them, and stores (writes) them back to memory. The memory system is a collection of storage locations. Each storage location, or memory word, has a numerical address. A collection of storage locations from an address space. Figure 1 shows the essentials of how a processor is connected to a memory system via address, data, and control lines. When a processor attempts to load the contents of a memory location, the request is very urgent. In virtually all computers, the work soon comes to a halt (in other words, the processor stalls) if the memory request does not return quickly. Modern computers are generally able to continue briefly by overlapping memory requests, but even the most sophisticated computers will frequently exhaust their ability to process data and stall momen- tarily in the face of long memory delays. Thus, a key performance parameter in the design of any computer, fast or slow, is the effective speed of its memory. Ideally, the memory system must be both infinitely large so that it can contain an arbitrarily large amount of information, and infinitely fast so that it does not limit the processing unit. Practically, however, this is not possi- ble. There are three properties of memory that are inherently in conflict: speed, capacity, and cost. In general, technology tradeoffs can be employed to optimize any two of the three factors at the expense of the third. Thus it 2 is possible to have memories that are (1) large and cheap, but not fast; (2) cheap and fast, but small; or (3) large and fast, but expensive. The last of the three is further limited by physical constraints. A large-capacity memory that is very fast is also physically large, and speed-of-light delays place a limit on the speed of such a memory system. The latency (L) of the memory is the delay from when the processor first requests a word from memory until that word arrives and is available for use by the processor. The latency of a memory system is one attribute of performance. The other is bandwidth (BW), which is the rate at which information can be transferred from the memory system. The bandwidth and the latency are closely related. If R is the number of requests that the mem- ory can service simultaneously, then: (1) From Eq. (1) we see that a decrease in the latency will result in an increase in bandwidth, and vice versa, if R is unchanged. We can also see that the bandwidth can be increased by increasing R, if L does not increase pro- portionately. For example, we can build a memory system that takes 20 ns to service the access of a single 32-bit word. Its latency is 20 ns per 32-bit word, and its bandwidth is or 200 Mbytes/s. If the memory system is modified to accept a new (still 20 ns) request for a 32-bit word every 5 ns by overlapping requests, then its bandwidth is or 800 Mbytes/s. This memory system must be able to handle four requests at a given time. Building an ideal memory system (infinite capacity, zero latency and infinite bandwidth, with affordable cost) is not feasible. The challenge is, given the cost and technology constraints, to engineer a memory system whose abilities match the abilities that the processor demands of it. That is, engineering a memory system that performs BW R L --= 32 20x10 9– ------------------- bits sec -------- 32 5x10 9– ---------------- bits sec -------- 5 0.3 Cache Memories The basic unit of construction of a semiconductor memory system is a module or bank. A memory bank, con- structed from several memory chips, can service a single request at a time. The time that a bank is busy servicing a request is called the bank busy time. The bank busy time limits the bandwidth of a memory bank. Both caches and main memories are constructed in this fashion, although caches have significantly shorter bank busy times than do main memory banks. The hardware can dynamically allocate parts of the cache memory for addresses deemed most likely to be accessed soon. The cache contains only redundant copies of the address space, which is wholly contained in the main memory. The cache memory is associative, or content-addressable. In an associative memory, the address of a memory location is stored, along with its content. Rather than reading data directly from a memory location, the cache is given an address and responds by providing data which may or may not be the data requested. When a cache miss occurs, the memory access is then performed with respect to the backing storage, and the cache is updated to include the new data. The cache is intended to hold the most active portions of the memory, and the hardware dynamically selects portions of main memory to store in the cache. When the cache is full, bringing in new data must be matched by deleting old data. Thus a strategy for cache management is necessary. Cache management strategies exploit the principle of locality. Spatial locality is exploited by the choice of what is brought into the cache. Temporal local- ity is exploited by the choice of which block is removed. When a cache miss occurs, hardware copies a large, contiguous block of memory into the cache, which includes the requested word. This fixed-size region of mem- ory, known as a cache line or block, may be as small as a single word, or up to several hundred bytes. A block is a set of contiguous memory locations, the number of which is usually a power of two. A block is said to be aligned if the lowest address in the block is exactly divisible by the block size. That is to say, for a block of size B beginning at location A, the block is aligned if (3)A modulo B 0= 6 Conventional caches require that all blocks be aligned. When a block is brought into the cache, it is likely that another block must be evicted. The selection of the evicted block is based on an attempt to capture temporal locality. Since prescience is difficult to achieve, other methods are generally used to predict future memory accesses. A least-recently-used (LRU) policy is often the basis for the replacement choice. Other replacement policies are sometimes used, particularly because true LRU replacement requires extensive logic and hardware bookkeeping. The cache often comprises two conventional memories: the data memory and the tag memory, shown in Figure 3. The address of each cache line contained in the data memory is stored in the tag memory, as well as other information (state information), particularly the fact that a valid cache line is present. The state also keeps track of which cache lines the processor has modified. Each line contained in the data memory is allocated a cor- responding entry in the tag memory to indicate the full address of the cache line. The requirement that the cache memory be associative (content-addressable) complicates the design. Addressing data by content is inherently more complicated than by its address. All the tags must be compared concurrently, of course, because the whole point of the cache is to achieve low latency. The cache can be made simpler, however, by introducing a mapping of memory locations to cache cells. This mapping limits the num- ber of possible cells in which a particular line may reside. The extreme case is known as direct mapping, in which each memory location is mapped to a single location in the cache. Direct mapping makes many aspects of the design simpler, since there is no choice of where the line might reside, and no choice as to which line must be replaced. Direct mapping, however, can result in poor utilization of the cache when two memory locations are alternately accessed and must share a single cache cell. A hashing algorithm is used to determine the cache address from the memory address. The conventional map- ping algorithm consists of a function with the form (4) where Acache is the address within the cache for main memory location Amemory, cache_size is the capacity of the Acache Amemory mod cache_size cache_line_size ----------------------------------------------------------= 7 cache in addressable units (usually bytes), and cache_line_size is the size of the cache line in addressable units. Since the hashing function is simple bit selection, the tag memory need only contain the part of the address not implied by the hashing function. That is, (5) where Atag is stored in the tag memory and div is the integer divide operation. In testing for a match, the com- plete address of a line stored in the cache can be inferred from the tag and its storage location within the cache. A two-way set-associative cache maps each memory location into either of two locations in the cache and can be constructed essentially as two identical direct-mapped caches. However, both caches must be searched at each memory access, and the appropriate data selected and multiplexed on a tag match (hit). On a miss, a choice must be made between the two possible cache lines as to which is to be replaced. A single LRU bit can be saved for each such pair of lines to remember which line has been accessed more recently. This bit must be toggled to the current state each time either of the cache lines is accessed. In the same way, an M-way associative cache maps each memory location into any of M memory locations in the cache and can be constructed from M identical direct-mapped caches. The problem of maintaining the LRU ordering of M cache lines quickly becomes hard, however, since there are M! possible orderings, so it takes at least (6) bits to store the ordering. In practice, this requirement limits true LRU replacement to three- or four-way set associativity. Figure 3 shows how a cache is organized into sets, blocks, and words. The cache shown is a 2-Kbyte, four- way set-associative cache, with 16 sets. Each set consists of four blocks. The cache block size in this example is 32 bytes, so each block contains eight 4-byte words. Also depicted at the bottom Figure 3 is a four-way inter- leaved main memory system (see the next section for details). Each successive word in the cache block maps into a different main memory bank. Because of the cache’s mapping restrictions, each cache block obtained Atag Amemory div size_of_cache= log2 M!( ) 10 Cache sizes have been steadily increasing on personal computers and workstations. Intel Pentium-based per- sonal computers come with 8 Kbyte each of instruction and data caches. Two of the Pentium chip sets, manufac- tured by Intel and OPTi, allow level-two caches ranging from 256 to 512 Kbyte and 64 Kbyte to 2 Mbyte, respectively. The newer Pentium Pro systems also have 8 Kbyte, first-level instruction and data caches, but they also have either a 256 Kbyte or a 512 Kbyte second-level cache on the same module as the processor chip. Higher-end workstations—such as DEC Alpha 21164-based systems—are configured with substantially more cache. The 21164 also has 8 Kbyte, first-level instruction and data caches. Its second-level cache is entirely on- chip, and is 96 Kbyte. The third-level cache is off-chip, and can have a size ranging from 1 Mbyte to 64 Mbyte. For all desktop machines, cache sizes are likely to continue to grow—although the rate of growth compared to processor speed increases and main memory size increases is unclear. 0.4 Parallel and Interleaved Main Memories Main memories are comprised of a series of semiconductor memory chips. A number of these chips, like caches, form a bank. Multiple memory banks can be connected together to form an interleaved (or parallel) memory system. Since each bank can service a request, an interleaved memory system with K banks can service K requests simultaneously, increasing the peak bandwidth of the memory system to K times the bandwidth of a single bank. In most interleaved memory systems, the number of banks is a power of two, that is, . An n- bit memory word address is broken into two parts: a k-bit bank number and an m-bit address of a word within a bank. Though the k bits used to select a bank number could be any k bits of the n-bit word address, typical inter- leaved memory systems use the low-order k address bits to select the bank number; the higher order bits of the word address are used to access a word in the selected bank. The reason for using the low-order k bits will be discussed shortly. An interleaved memory system which uses the low-order k bits to select the bank is referred to as a low-order or a standard interleaved memory. There are two ways of connecting multiple memory banks: simple interleaving and complex interleaving. Sometimes simple interleaving is also referred to as interleaving, and complex interleaving as banking. K 2 k = m n k–= 11 Figure 5 shows the structure of a simple interleaved memory system. m address bits are simultaneously sup- plied to every memory bank. All banks are also connected to the same read/write control line (not shown in Figure 5). For a read operation, the banks start the read operation and deposit the data in their latches. Data can then be read from the latches, one by one, by appropriately setting the switch. Meanwhile, the banks could be accessed again, to carry out another read or write operation. For a write operation, the latches are loaded, one by one. When all the latches have been written, their contents can be written into the memory banks by supplying m bits of address (they will be written into the same word in each of the different banks). In a simple interleaved memory, all banks are cycled at the same time; each bank starts and completes its individual operations at the same time as every other bank; a new memory cycle can start (for all banks) once the previous cycle is com- plete. Timing details of the accesses can be found in The Architecture of Pipelinined Computers, [Kogge, 1981]. One use of a simple interleaved memory system is to back up a cache memory. To do so, the memory must be able to read blocks of contiguous words (a cache block) and supply them to the cache. If the low-order k bits of the address are used to select the bank number, then consecutive words of the block reside in different banks, they can all be read in parallel, and supplied to the cache one by one. If some other address bits are used for bank selection, then multiple words from the block might fall in the same memory bank, requiring multiple accesses to the same bank to fetch the block. Figure 6 shows the structure of a complex interleaved memory system. In such a system, each bank is set up to operate on its own, independent of the other banks’ operation. In this example, Bank 1 could carry out a read operation on a particular memory address, while Bank 2 carries out a write operation on a completely unrelated memory address. (Contrast this with the operation in a simple interleaved memory where all banks are carrying out the same operation, read or write, and the locations accessed within each bank represent a contiguous block of memory.) Complex interleaving is accomplished by providing an address latch and a read/write command line for each bank. The memory controller handles the overall operation of the interleaved memory. The pro- cessing unit submits the memory request to the memory controller, which determines the bank that needs to be accessed. The controller then determines if the bank is busy (by monitoring a busy line for each bank). The con- 12 troller holds the request if the bank is busy, submitting it later when the bank is available to accept the request. When the bank responds to a read request, the switch is set by the controller to accept the request from the bank and forward it to the processing unit. Timing details of the accesses can be found in The Architecture of Pipe- lined Computers [Kogge, 1981]. A typical use of a complex interleaved memory system is in a vector processor. In a vector processor, the pro- cessing units operate on a vector, for example a portion of a row or a column of a matrix. If consecutive ele- ments of a vector are present in different memory banks, then the memory system can sustain a bandwidth of one element per clock cycle. By arranging the data suitably in memory and using standard interleaving (for example, storing the matrix in row-major order will place consecutive elements in consecutive memory banks), the vector can be accessed at the rate of one element per clock cycle as long as the number of banks is greater than the bank busy time. Memory systems that are built for current machines vary widely, the price and purpose of the machine being the main determinant of the memory system design. The actual memory chips, which are the components of the memory systems, are generally commodity parts built by a number of manufacturers. The major commodity DRAM manufacturers include (but are certainly not limited to) Hitachi, Fujitsu, LG Semicon, NEC, Oki, Sam- sung, Texas Instruments, and Toshiba. The low-end of the price/performance spectrum is the personal computer, presently typified by Intel Pentium systems. Three of the manufacturers of Pentium-compatible chip sets (which include the memory controllers) are Intel, OPTi, and VLSI Technologies. Their controllers provide for memory systems that are simply inter- leaved, all with minimum bank depths of 256 Kbyte, and maximum system sizes of 192 Mbyte, 128 Mbyte, and 1 Gbyte, respectively. Both higher-end personal computers and workstations tend to have more main memory than the lower-end systems, although they usually have similar upper limits. Two examples of such systems are workstations built with the DEC Alpha 21164, and servers built with the Intel Pentium Pro. The Alpha systems, using the 21171 chip set, are limited to 128 Mbyte of main memory using 16 Mbit DRAMs, although they will be expandable to 15 along with other information about the page. In most implementations the page offset is the same for a virtual address and the physical address to which it is mapped. The virtual memory hierarchy is different than the cache/main memory hierarchy in a number of respects, resulting primarily from the fact that there is a much greater difference in latency between accesses to the disk and to main memory. While a typical latency ratio for cache and main memory is one order of magnitude (main memory has a latency ten times larger than the cache), the latency ratio between disk and main memory is often four orders of magnitude or more. This large ratio exists because the disk is a mechanical device—with a latency partially determined by velocity and inertia—whereas main memory is limited only by electronic and energy constraints. Because of the much larger penalty for a page miss, many design decisions are affected by the need to minimize the frequency of misses. When a miss does occur, the processor could be idle for a period during which it could execute tens of thousands of instructions. Rather than stall during this time, as may occur upon a cache miss, the processor invokes the operating system and may switch to a different task. Because the operat- ing system is being invoked anyway, it is convenient to rely on the operating system to set up and maintain the page table, unlike cache memory, where it is done entirely in hardware. The fact that this accounting occurs in the operating system enables the system to use virtual memory to enforce protection on the memory. This ensures that no program can corrupt the data in memory that belong to any other program. Hardware support provided for a virtual memory system generally includes the ability to translate the virtual addresses provided by the processor into the physical addresses needed to access main memory. Thus, only upon a virtual address miss is the operating system invoked. An important aspect of a computer that implements vir- tual memory, however, is the necessity of freezing the processor at the point at which a miss occurs, servicing the page table fault, and later returning to continue the execution as if no page fault had occurred. This require- ment means either that it must be possible to halt execution at any point—including possibly in the middle of a complex instruction—or that it must be possible to guarantee that all memory accesses will be to pages resident in main memory. As described above, virtual memory requires two memory accesses to fetch a single entry from memory, one 16 into the page table to map the virtual address into the physical address, and the second to fetch the actual data. This process can be sped up in a variety of ways. First, a special-purpose cache memory to store the active por- tion of the page table can be used to speed up the first access. This special-purpose cache is usually called a translation lookaside buffer (TLB). Second, if the system also employs a cache memory, it may be possible to overlap the access of the cache memory with the access to the TLB, ideally allowing the requested item to be accessed in a single cache access time. The two accesses can be fully overlapped if the virtual address supplies sufficient information to fetch the data from the cache before the virtual-to-physical address translation has been accomplished. This is true for an M-way set associative cache of capacity C if the following relationship holds: (7) For such a cache, the index into the cache can be determined strictly from the page offset. Since the virtual page offset is identical to the physical page offset, no translation is necessary, and the cache can be accessed concur- rently with the TLB. The physical address must be obtained before the tag can be compared, of course. An alternative method applicable to a system containing both virtual memory and a cache is to store the vir- tual address in the tag memory instead of the physical address. This technique introduces consistency problems in virtual memory systems that either permit more than a single address space, or allow a single physical page to be mapped to more than one single virtual page. This problem is known as the aliasing problem. Chapter 102 is devoted to virtual memory, and contains significantly more material on this topic for the inter- ested reader. Research Issues Research is occurring on all levels of the memory hierarchy. At the register level, researchers are exploring techniques to provide more registers than are architecturally visible to the compiler. A large volume of work exists (and is occurring) for cache optimizations and alternate cache organizations. For instance, modern proces- sors now commonly split the top level of the cache into separate physical caches, one for instructions (code) and one for program data. Due to the increasing cost of cache misses (in terms of processor cycles), some research Page_size C M ----≥ 17 trades-off increasing the complexity of the cache for reducing the miss rate. Two examples of cache research from opposite ends of the hardware/software spectrum are blocking [Lam, 1991] and skewed-associative caches [Seznec, 1993]. Blocking is a software technique in which the programmer or compiler reorganizes algorithms to work on subsets of data that are smaller than the cache, instead of streaming entire large data structures repeatedly through the cache. This reorganization greatly improves temporal locality. The skewed-associative cache is one example of a host of hardware techniques that map blocks into the cache differently, with the goal of reducing misses from set conflicts. In skewed-associative caches, either one of two hashing functions may determine where a block should be placed in the cache, as opposed to just the one hashing function (low-order index bits) that traditional caches use. An important cache-related research topic is prefetching [Mowry, 1992], in which the processor issues requests for data well before the data are actually needed. Speculative prefetching is also a current research topic. In speculative prefetching, prefetches are issued based on guesses as to which data will be needed soon. Other cache-related research examines placing special structures in parallel with the cache, trying to optimize for workloads that do not lend themselves well to caches. Stream buffers [Jouppi, 1990] are one such example. A stream buffer automatically detects when a linear access through a data structure is occurring. The stream buffer issues multiple sequential prefetches upon detection of a linear array access. Much of the ongoing research on main memory involves improving the bandwidth from the memory system without greatly increasing the number of banks. Multiple banks are expensive, particularly with the large and growing capacity of modern DRAM chips. Rambus [Rambus Inc., 1992] and Ramlink [IEEE Computer Society, 1993] are two such examples. Research issues associated with improving the performance of the virtual memory system fall under the domain of operating system research. One proposed strategy for reducing page faults allows each running pro- gram to specify its own page replacement algorithm, enabling each program to optimize the choice of page replacements based on its reference pattern [Engler et al., 1995]. Other recent research focuses on improving the performance of the TLB. Two techniques for doing this are the use of a two-level TLB (the motivation is similar to that for a two-level cache), and the use of superpages [Talluri, 1994]. With superpages, each TLB entry may 20 switches to a different task while the needed page is read from the disk. Every memory request issued by the CPU requires an address translation, which in turn requires an access to the page table stored in memory. A translation lookaside buffer (TLB) is used to reduce the number of page table lookups. The most frequent virtual-to-physical mappings are kept in the TLB, which is a small associative mem- ory tightly coupled with the CPU. If the needed mapping is found in the TLB, the translation is performed quickly and no access to the page table need be made. Virtual memory allows systems to run larger or more pro- grams than are able to fit in main memory, enhancing the capabilities of the system. Defining Terms Bandwidth: The rate at which the memory system can service requests. Cache memory: A small, fast, redundant memory used to store the most frequently accessed parts of the main memory. Interleaving: Technique for connecting multiple memory modules together in order to improve the bandwidth of the memory system. Latency: The time between the initiation of a memory request and its completion. Memory hierarchy: Successive levels of different types of memory, which attempt to approximate a single large, fast, and cheap memory structure. Virtual memory: A memory space implemented by storing the more-frequently-accessed parts in main mem- ory and less-frequently-accessed parts on disk. References Denning, P. J. 1970. “Virtual memory,” Computing Surveys, vol. 2, no. 3, pp. 153-170. Engler, D. R., Kaashoek, M. F., O’Toole, J. Jr. 1995. “Exokernel: An Operating System Architecture for Appli- cation-Level Resource Management,” Proc. 15th Symposium on Operating Systems Principles, pp. 251-266. Hennessy, J. L. and Patterson, D. A. 1990. Computer Architecture: A Quantitative Approach, 1st ed. Morgan Kaufmann Publishers, San Mateo, CA. 21 Hill, M. D. 1988. “A case for direct-mapped caches,” IEEE Computer, vol. 21, no. 12. IEEE Computer Society. 1993. IEEE Standard for High-Bandwidth Memory Interface Based on SCI Signaling Technology (RamLink), Draft 1.00 IEEE P1596.4-199X. Jouppi, N. 1990. “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” Proc. 17th Annual International Symposium on Computer Architecture, pp. 364- 373. Kogge, P. M. 1981. The Architecture of Pipelined Computers, New York: McGraw-Hill. Kroft, D. 1981. “Lockup-Free Instruction Fetch/Prefetch Cache Organization,” Proc. 8th Annual International Symposium on Computer Architecture, pp. 81-87. Lam, M.S.,Rothberg, E. E., and Wolf, M. E. 1991. “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. 4th Annual Symposium on Architectural Support for Programming Languages and Operat- ing Systems, pp. 63-74. Mowry, T. C., Lam, M. S., Gupta, A. 1992. “Design and Evaluation of a Compiler Algorithm for Prefetching,” Proc. 5th Annual Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 62-73. Rambus, Inc. 1992. Rambus Architectural Overview, Mountain View, CA. Seznec, A. 1993. “A case for two-way skewed-associative caches,” Proc. 20th International Symposium on Computer Architecture, pp. 169-178. Smith, A. J. 1986. “Bibliography and readings on CPU cache memories and related topics,” ACM SIGARCH Computer Architecture News, vol. 14, no 1, pp. 22-42. Smith, A. J. 1991. “Second bibliography on cache memories,” ACM SIGARCH Computer Architecture News, vol. 19, no 4, pp. 154-182. Talluri, M. and Hill, M. D. 1994. “Surpassing the TLB Performance of Superpages with Less Operating System Support,” Proc. Sixth International Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 171-182. 22 Further Information Some general information on the design of memory systems is available in High-Speed Memory Systems by A. V. Pohm and O. P. Agarwal. Computer Architecture: A Quantitative Approach by John Hennessy and David Patterson contains a detailed discussion on the interaction between memory systems and computer architecture. For information on memory system research, the recent proceedings of the International Symposium on Com- puter Architecture contain annual research papers in computer architecture, many of which focus on the memory system. To obtain copies, contact the IEEE Computer Society Press, at 10662 Los Vaqueros Circle, P.O. Box 3014, Los Alamitos, CA 90720-1264. 25 CPU Cache Main Virtual Memory Memory Very Fast (Electronic speeds) Semiconductor SRAM Small Fast (Electronic Speeds) Semiconductor DRAM Large Very Slow (Mechanical Speeds) Magnetic/Optical Very Large Very Fast Semiconductor SRAM Tiny FIGURE 2 Registers (Electronic speeds) 128 bytes - 4Kbytes 32 Kbytes - 4 Mbytes 4 Mbytes - 512 Mbytes 40 MBytes - 8 Gbytes 26 tag index offset state tag data Compare Incoming & Stored Tags and Select Data Word Data Word Hit/Miss Decode Incoming Address A Cache Block (Frame) FIGURE 3 27 4-way set-associative cache 16 sets 4 cache blocks/set Eight 4-byte words/block 4 main memory banks Each successive word in a block maps to a different main memory bank. FIGURE 4
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved