Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Memory Hierarchy and Cache Systems: Processor-Main Memory Interface and Virtual Memory, Slides of Computer Science

An overview of the memory hierarchy at two main interface levels: processor-main memory and main memory-secondary memory. It covers the introduction of caches, replacement algorithms, writing in a cache, main write options, classifying cache misses, cache performance, and improving cache performance. Questions that arise at each level are also discussed.

Typology: Slides

2012/2013

Uploaded on 03/22/2013

dhirendra
dhirendra 🇮🇳

4.3

(78)

279 documents

1 / 59

Toggle sidebar

Related documents


Partial preview of the text

Download Memory Hierarchy and Cache Systems: Processor-Main Memory Interface and Virtual Memory and more Slides Computer Science in PDF only on Docsity! Principle of Locality: Memory Hierarchies • Text and data are not accessed randomly • Temporal locality – Recently accessed items will be accessed in the near future (e.g., code in loops, top of stack) • Spatial locality – Items at addresses close to the addresses of recently accessed items will be accessed in the near future (sequential code, elements of arrays) • Leads to memory hierarchy at two main interface levels: – Processor - Main memory -> Introduction of caches – Main memory - Secondary memory -> Virtual memory (paging systems) Docsity.com Processor - Main Memory Hierarchy • Registers: Those visible to ISA + those renamed by hardware • (Hierarchy of) Caches: plus their enhancements – Write buffers, victim caches etc… • TLB’s and their management • Virtual memory system (O.S. level) and hardware assists (page tables) • Inclusion of information (or space to gather information) level per level – Almost always true Docsity.com Questions that Arise at Each Level (c’ed) • What happens if there is no room for the item we bring in? – Replacement algorithm; depends on organization • What happens when we change the contents of the info? – i.e., what happens on a write? Docsity.com Caches (on-chip, off-chip) • Caches consist of a set of entries where each entry has: – line (or block) of data: information contents – tag: allows to recognize if the block is there – status bits: valid, dirty, state for multiprocessors etc. • Cache Geometries – Capacity (or size) of a cache: number of lines * line size, i.e., the cache metadata (tag + status bits) is not counted – Associativity – Line size Docsity.com Cache Organizations • Direct-mapped • Set-associative • Fully-associative Docsity.com Replacement Algorithm • None for direct-mapped • Random or LRU or pseudo-LRU for set- associative caches – Not an important factor for performance for low associativity. Can become important for large associativity and large caches Docsity.com Writing in a Cache • On a write hit, should we write: – In the cache only (write-back) policy – In the cache and main memory (or higher level cache) (write-through) policy • On a write miss, should we – Allocate a block as in a read (write-allocate) – Write only in memory (write-around) Docsity.com The Main Write Options • Write-through (aka store-through) – On a write hit, write both in cache and in memory – On a write miss, the most frequent option is write-around – Pro: consistent view of memory (better for I/O); no ECC required for cache – Con: more memory traffic (can be alleviated with write buffers) • Write-back (aka copy-back) – On a write hit, write only in cache (requires dirty bit) – On a write miss, most often write-allocate (fetch on miss) but variations are possible – Pro-con reverse of write through Docsity.com Cache Performance • CPI contributed by cache = CPIc = miss rate * number of cycles to handle the miss • Another important metric Average memory access time = cache hit time * hit rate + Miss penalty * (1 - hit rate) Docsity.com Improving Cache Performance • To improve cache performance: – Decrease miss rate without increasing time to handle the miss (more precisely: without increasing average memory access time) – Decrease time to handle the miss w/o increasing miss rate • A slew of techniques: hardware and/or software – Increase capacity, associativity etc. – Hardware assists (victim caches, write buffers etc.) – Tolerating memory latency: Prefetching (hardware and software), lock-up free caches Docsity.com Improving L1 Cache Access Time • Processor generates virtual addresses • Can cache have virtual address tags? – What happens on a context switch? • Can cache and TLB be accessed in parallel? – Need correspondence between page size and cache size + associativity • What about virtually addressed physically tagged caches? Docsity.com Illustration of Page Table Program A Program B Physical memory V.p.0 V.p.0 V.p.1 V.p.1 V.p.2 V.p.2 V.p.3 V.p.n V.p.q Frame 0 Frame 1 Frame 2 Frame m 1 1 0 1 2 m 0 Page table for Program A Valid bits Page table for Program B 0 1 1 1 0 m Note: vp 2 of Program A used to be mapped to pp m but has been replaced by vp 1 of Program A; Vp 0 of Program B was never mapped Docsity.com Virtual Address Translation 1 Virtual page number Offset Offset Physical frame number Page table Docsity.com From Virtual Address to Memory Location (highly abstracted) ALU Virtual address Page table Physical address Memory hierarchy Docsity.com TLB organization Offset Virtual page number Index tag Physical frame number v d prot Copy of PTE Docsity.com From Virtual Address to Memory Location (highly abstracted; revisited) ALU Virtual address TLB Physical address hit cache Main memory miss hit miss Docsity.com Speeding up L1 Access • Cache can be (speculatively) accessed in parallel with TLB if its indexing bits are not changed by the virtual-physical translation • Cache access (for reads) is pipelined: – Cycle 1: Access to TLB and access to L1 cache (read data at given index) – Cycle 2: Compare tags and if hit, send data to register Docsity.com Synonyms v.p. x, process A v.p. y, process B v.p # index Map to same physical page Map to synonyms in the cache To avoid synonyms, O.S. or hardware enforces these bits to be the same Docsity.com Obvious Solutions to Decrease Miss Rate • Increase cache capacity – Yes, but the larger the cache, the slower the access time – Solution: Cache hierarchies (even on-chip) – Increasing L2 capacity can be detrimental on multiprocessor systems because of increase in coherence misses • Increase cache associativity – Yes, but “law of diminishing returns” (after 4-way for small caches; not sure of the limit for large caches) Docsity.com What about Cache Line Size? • For a given application, cache capacity and associativity, there is an optimal cache line size • Long cache lines – Good for spatial locality (code, vectors) – Reduce compulsory misses (implicit prefetching) – But takes more time to bring from next level of memory hierarchy (can be compensated by “critical word first” and subblocks) – Increase possibility of fragmentation (only fraction of the line is used – or reused) Docsity.com Index + Tag Cache Victim Cache 1. Hit 2.Miss in L1; Hit in VC; Send data to register and swap 3. From next level of memory hierarchy 3’. evicted Docsity.com Operation of a Victim Cache • 1. Hit in L1; Nothing else needed • 2. Miss in L1 for line at location b, hit in victim cache at location v: swap contents of b and v (takes an extra cycle) • 3. Miss in L1, miss in victim cache : load missing item from next level and put in L1; put entry replaced in L1 in victim cache; if victim cache is full, evict one of its entries. • Victim buffer of 4 to 8 entries for a 32KB direct-mapped cache works well. Docsity.com Bringing more Associativity -- Column-associative Caches • Split (conceptually) direct-mapped cache into two halves • Probe first half according to index. On hit proceed normally • On miss, probe 2nd half ; If hit, send to register and swap with entry in first half (takes an extra cycle) • On miss (on both halves) go to next level, load in 2nd half and swap • Slightly more complex than that (need one Docsity.com Options for Page Coloring • Option 1: It assumes that the process faulting is using the whole cache – Attempts to map the page such that the cache will access data as if it were by virtual addresses • Option 2: do the same thing but hash with bits of the PID (process identification number) – Reduce inter-process conflicts (e.g., prevent pages corresponding to stacks of various processes to map to the same area in the cache) • Implemented by keeping “bins” of free pages Docsity.com Tolerating/hiding Memory Latency • One particular technique: prefetching • Goal: bring data in cache just in time for its use – Not too early otherwise cache pollution – Not too late otherwise “hit-wait”cycles • Under the constraints of (among others) – Imprecise knowledge of instruction stream – Imprecise knowledge of data stream • Hardware/software prefetching – Works well for regular stride data access Docsity.com Why, What, When, Where • Why? – cf. goals: Hide memory latency and/or reduce cache misses • What – Ideally a semantic object – Practically a cache line, or a sequence of cache lines • When – Ideally, just in time. – Practically, depends on the prefetching technique Docsity.com Sequential & Stride Prefetching in Power 4/5 • When prefetch line i from L2 to L1 – Prefetch lines (i+1) and (i+2) from L3 to L2 – Preftch lines (i=3),…(i+6) from main memry to L3 Docsity.com Software Prefetching • Use of special instructions (cache hints: touch in Power PC, load in register 31 for Alpha, prefetch in Intel micros) • Non-binding prefetch (in contrast with proposals to prefetch in registers). – If an exception occurs, the prefetch is ignored. • Must be inserted by software (compiler analysis) • Advantage: no special hardware • Drawback: more instructions executed. Docsity.com Metrics for Prefetching • Coverage: Useful prefetches/ number of misses without prefetching • Accuracy: useful prefetches/ number of prefetches • Timeliness: Related to number of hit-wait prefetches • In addition, the usefulness of prefetching is related to how critical the prefetched data was Docsity.com Write Buffers (c’ed) • Writes from write buffer to next level of the memory hierarchy can proceed in parallel with computation • Now loads must check the contents of the write buffer; also more complex for cache coherency in multiprocessors – Allow read misses to bypass the writes in the write buffer Docsity.com Critical Word First • Send first, from next level in memory hierarchy, the word for which there was a miss • Send that word directly to CPU register (or IF buffer if it’s an I-cache miss) as soon as it arrives • Need a one line buffer to hold the incoming line (and shift it) before storing it in the cache Docsity.com Sectored (or subblock) Caches • First cache ever (IBM 360/85 in late 60’s) was a sector cache – On a cache miss, send only a subblock, change the tag and invalidate all other subblocks – Saves on memory bandwidth • Reduces number of tags but requires good spatial locality in application • Requires status bits (valid, dirty) per subblock • Might reduce false-sharing in multiprocessors – But requires metadata status bits for each subblock Docsity.com MSHR’s • The outstanding misses do not necessarily come back in the order they were detected – For example, miss 1 can percolate from L1 to main memory while miss 2 can be resolved at the L2 level • Each MSHR must hold information about the particular miss it will handle such as: – Info. relative to its placement in the cache – Info. relative to the “missing” item (word, byte) and where to forward it (CPU register) Docsity.com Implementation of MSHR’s • Quite a variety of alternatives – MIPS 10000, Alpha 21164, Pentium Pro, III and 4 • One particular way of doing it: – Valid (busy) bit (limited number of MSHR’s – structural hazard) – Address of the requested cache block – Index in the cache where the block will go – Comparator (to prevent using the same MSHR for a miss to the same block) – If data to be forwarded to CPU at the same time as in the cache needs addresses of registers (one Docsity.com Cache Hierarchy • Two, and even three, levels of caches in most systems • L2 (or L3, i.e., board-level) very large but since L1 filters many references, “local” hit rate might appear low (maybe 50%) (compulsory misses still happen) • In general L2 have longer cache blocks and larger associativity • In general L2 caches are write-back, write allocate Docsity.com
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved