Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Optimizing Mass Storage Hierarchies: RAM, Disk, and Tertiary Storage, Lab Reports of Electrical and Electronics Engineering

The optimization of mass storage hierarchies consisting of ram, disk, and tertiary storage. The paper analyzes the tradeoffs between adding more ram and more disk to a system, and determines the optimal configurations based on the cost and performance of each storage technology. The analysis considers the crossover point where it makes sense to add the first bit of ram or disk, and the disparity between the amounts of ram and disk as the system budget and technology ratios change.

Typology: Lab Reports

Pre 2010

Uploaded on 09/17/2009

koofers-user-980
koofers-user-980 🇺🇸

10 documents

1 / 26

Toggle sidebar

Related documents


Partial preview of the text

Download Optimizing Mass Storage Hierarchies: RAM, Disk, and Tertiary Storage and more Lab Reports Electrical and Electronics Engineering in PDF only on Docsity! 1 Optimization of Mass Storage Hierarchies Bruce Jacob Advanced Computer Architecture Lab EECS Department, University of Michigan blj@umich.edu Tech Report CSE-TR-228-95 May 12, 1994 Optimization is often a question of where one should put one’s money in improving performance. As far as large storage hierarchies go, intuition suggests (and common practice supports) adding as much as is affordable of the fastest technology available. Many cache hierarchy studies have shown that this is often not the optimal approach, and we show that for mass storage hierarchies it always tends to be the wrong approach. For large data sets, as is the case for network file servers, a machine with no RAM and several gigabytes of disk performs 30% faster than a machine with no disk and a cost-equivalent amount of RAM. This paper presents a mathematical analysis of the optimization of an I/O hierarchy, as well as trace driven simulations of a network file server in support of the analysis. 1.0 Introduction Over the past several years, there has been a substantial increase in the speed and capacity demands placed on computer memory systems. Great strides have been made in the capacity of mass storage devices such as magnetic disk, tape, and optical media, but improvements in the speed of these devices has not kept up with the improvements in CPU speeds and semiconductor memory. Caching is used to hide the deficiency, but the widening gap between semiconductor memory and magnetic storage is making it easy to lose much performance through poor system configuration. Larger miss costs need to be offset by higher hit rates, inducing system administrators to buy more and more main memory for workstations; this is not necessarily the correct thing to do, especially in the case of file servers. In this environment, it is very important to spend one’s money wisely, as the cost figures are staggering and $5,000 misplaced can result in a difference in perfor- mance of more than a factor of two. Historically, much emphasis and attention has been paid to optimizing cache hierarchies [5][6][11][14][19], to managing large stores of data, and to having enough memory in a system [9]; however, little attention has been paid to how all the optimizations fit together. Previous studies have demonstrated that the optimal size and number of cache levels increase with an increasing data set size [5][6][14], and yet as our systems grow larger we continue to pour RAM into systems instead of adding to the caches on disk, or even more levels in the hierarchy. 2 Optimization of Mass Storage Hierarchies Typically, system administrators play a reactive game, adding more RAM when the system is thrashing, and adding more disk when the users complain of too little file space. We demonstrate that this technique can cost over 30% in perfor- mance, and most system configurations tend to sit in the worst-case region. 1.1 Direction The goal is to optimize the mass storage hierarchy under given cost constraints as well as reality constraints; for instance, DRAMs are not available in a continuum of speeds and sizes, but instead come in a few different speeds and only a few different cost-effective sizes. Already, much research has gone to optimizing single-user workstation-class machines on a network, and the results are largely buy more RAM. The focus of this paper is instead how to optimize machines connected to mass storage . These few but important machines are typically file servers, with allotted budgets large enough to allow an administrator to have possibly too many configurations to choose from. 1.2 Background Caching is used to mask the unacceptable latency of accessing slow devices. The hard disk, which was once used as the mass store, is now used as a cache to even larger tertiary storage, and the semiconductor memory becomes merely a higher-level cache. The enormity of I/O space is changing the way computer systems are being used, designed, and opti- mized. There have been a number of research projects exploring this new frontier of computing. Nakagomi, et al, [13] suggested replacing the entire hierarchy with a single large, fast magneto-optical array, trading the top level access time for a much higher bandwidth. In addition, less burden would be placed on the server machine due to a reduction in the overhead of copying data back and forth between several levels of storage devices. Several papers [3][7][20] suggest the use of non- volatile RAM to improve I/O performance by providing a reliable write-buffer to reduce physical write traffic and improve response time. Finally, future devices such as holostore [17] are projected to fit nicely in between semiconductor memory and magnetic disk in the storage hierarchy. Holostore technology offers capacity that approaches the density of magnetic disks with access times closer to semiconductor memory. Another class of research involves the reconfiguring of traditional I/O devices in novel ways, rather than adding new types. Combining devices such as magnetic disk drives into disk arrays exploits parallelism yielding an increase in band- width [10]. Tape striping performs a similar optimization for magnetic tape drives [8]. However, in the many papers on making large storage systems faster, there has been little mention of the enormous wealth of research done to optimize cache hierarchies. Most of the research mentioned looks at improving a single level of the hierarchy, or reducing the number of levels in the system. It is perhaps an obvious point, but one which has been largely ignored; the work done in finding optimal memory hierarchies is very applicable to the area of mass storage hierarchies. 1.3 Results This paper presents a new twist on old models for cache hierarchies, as applicable to mass storage hierarchies. The model uses a measure of program traces that is independent of cache sizes and configurations; something lacking in many cache analysis reports. The model is used to predict hierarchy configuration behavior in the face of cost constraints, and simula- tions of a network file server are used to check the model. The results are that finding an optimal configuration at some cost point is partly a function of finding the optimal configu- ration at a smaller cost point, i.e. the optimization problem has optimal subproblems. This is good news to system admin- istrators; it suggests that if a certain amount of money is spent creating an optimal system configuration, that money will not be wasted when it comes time to upgrade the system (as long as the technologies involved do not change radically). Optimization of Mass Storage Hierarchies 5 2.5 Magnetic Tape Traditionally, magnetic tape has been used as a backup file system not very well integrated with the rest of the devices in the hierarchy. The large storage capacity and low price-per-bit makes this medium very attractive for tertiary storage as user file spaces continue to grow. Since tape is a serial device, the latency for accessing data on a tape can be very long due to the time to serially search through the tape for the correct data. In addition, writes may only be done in an append- only fashion. If properly placed into a storage hierarchy, however, magnetic tape can be effectively used. Assuming the higher levels of the hierarchy service most of the file accesses, the average access time for accessing data in a storage system backed by magnetic tape can remain reasonable. Similar to magnetic disks, tape striping has been proposed to improve the band- width of magnetic tape devices [8]. Bandwidth numbers for cartridge tape drives range from 100 KB/s to 50 MB/s, average seek times are around 10 sec to 40 sec, depending upon the technology and capacity of the cartridge [8]. 2.6 Optical and Magneto-Optical Disk Slow to become mainstream for a variety of reasons, optical disks nevertheless have the potential to replace magnetic tape as the tertiary storage media of choice. Optical disks have a price-per-bit similar to magnetic tape and allow efficient, ran- dom access to large volumes of data and could very well prove to be the more effective of the two [15]. In fact, Nakagomi, et al. [13] suggest replacing everything from the magnetic disk array on down to the tape backup devices with one large (terabyte-sized) magneto-optical device. Optical disk bandwidth figures range from less than 1 MB/s to just over 10 MB/s, average seek times range from 60 ms for smaller drives to 7 seconds for large jukeboxes [16]. 3.0 Mathematical Analysis Quite a bit has been published on the optimization of cache hierarchies. Chow showed that the optimum number of cache levels scales with the logarithm of the capacity of the cache hierarchy [5][6]. Rege and Garcia-Molina demonstrated that it is often better to have more of a slower device than less of a faster device [9][18]. Welch showed that the optimal size of each level must be proportional to the amount of time spent servicing requests out of that level [19]. This paper builds upon previous cache studies with one major twist. Previous studies have been able to find solutions for optimal hierarchy configurations, but the results contained dependencies upon the cache configuration; the number of lev- els or the sizes of the levels or the hit rates of the levels. This paper presents a model for a cache hierarchy that is depen- dent upon the characterization of the technologies that make up the hierarchy as well as a characterization of the application workloads placed upon the hierarchy and nothing else. This makes it very easy to find closed form solutions to the optimization problem. 3.1 Welch’s Analysis Welch took a model of a memory hierarchy from Chow [5] and concluded that if you have a fixed budget and a known set of probabilities of hitting any given level of a memory hierarchy, you can find the optimal appropriation of funds to the various hierarchy levels, so that the average effective access time is minimized. This optimal appropriation leads to such a balanced memory hierarchy when the proportion of money spent at a given level is equal to the proportion of time spent at that level servicing requests [19]. 6 Optimization of Mass Storage Hierarchies Formally, if every hierarchy level has probability of being accessed, every level has a total cost (equal to cost/byte times capacity), and the time to access hierarchy level is , then for every fixed total system cost the average time per access is minimized when which is when the proportion of dollars spent at each level is equal to the proportion of time spent at each level. We would like to be able to rearrange this to get which would suggest that to minimize the average access time , the fraction should be conserved across all levels of the hierarchy, for a given system cost . If this were true, then to maintain a balanced memory hierarchy when you increase the capacity of the hierarchy, the amount spent at each level needs to increase, proportional to the technolo- gy’s performance and inversely proportional to the amount of time spent at the level. Assuming that the probability distri- butions do not change by an enormous amount for small changes in , the simple result is that one should add a little to every level in the hierarchy when spending money on upgrading. The problem is that the probability distributions do change; each is a function of the memory hierarchy configuration and of the workload placed upon it, which is why the variable shows up in the results; you cannot get rid of it. The way Welch’s theorem stands, you can know when you have reached an optimal configuration, but solving for one gets a little tricky. 3.2 Present Work: The System Model The problems with previous analytical cache studies are that in order to make the analysis tractable, they often make a number of assumptions which make the analyses less realistic. These assumptions include • the availability of a continuum of device technologies, • that the fault probabilities (miss ratios) of the individual hierarchy levels obey a power function, • that the cost per bit of a technology is derivable from its speed, and • that every technology obeys the same cost-speed relation. This analysis depends only upon the cost per bit and access times of the technologies that make up the hierarchies, and a characterization of program traces. The assumptions are that cold start and compulsory misses can be ignored for the moment (compulsory misses in a network file server on a scale of months are constant, and so disappear when finding the minimum), and so can write behavior. This last assumption is a rather large one, and is the subject of our ongoing research. 3.2.1 Stack Distance Curves In this analysis, we make use of two related characterizations of program traces; the stack distance curves (the cumulative reference curve and its derivative). They give an insight into the locality of a program trace by measuring the LRU stack distance between successive references to the same item; i.e. if an LRU stack were being maintained (as it would in the case of a cache), how far down the item would be on its next reference. This gives a good indication of how well any given trace would perform with a cache of a specified size; if you know that 80% of all requests are within 64K of their previous reference, then a 64K cache would have a hit rate of 80% on that trace. This example makes use of the first curve, the cumulative reference curve; it plots the cumulative number of references against the stack depth (it becomes a cumulative probability curve if it is normalized by the number of references). i P i Bi i ti S Bi∑ = Tavg B iti∑ = Piti T avg Bi S = S Tavg Bi Piti = Tavg S Tavg⁄ S S Pi Optimization of Mass Storage Hierarchies 7 The second graph is the derivative of the first; it effectively plots the change hit rate as a function of the cache size. At small values of x (cache size), small changes make large differences, and further out, larger changes in cache sizes are necessary to account for the same difference in hit rate. The result is the number of references at a given stack depth (alter- nately the probability of each stack depth, if normalized) plotted against the stack depth in bytes, and the area under the curve represents the number of references that are hits to a given cache size. We expect the graphs to look something like the following: In this paper, we will use normalized graphs so that the y-coordinate can be interpreted as a probability per reference, and the area under the differential can be defined to be 1. 3.2.2 The Analytical System Model The stack distance curves are useful in the following way: the differential curve represents the number of references at any given stack depth, so a cache hierarchy can be modeled in the following manner. The L1 cache has a size of and a time to reference of . It is hit on every reference (whether the reference is a hit or not) and so each reference requires at least time . The total number of accesses is the area under the curve, so the total time spent in the L1 cache is given by where is the differential. The number of hits to the L1 cache is equal to the area under the curve from 0 to , so the number of misses is the rest of the area. This, then, is the number of requests that will be seen by the L2 cache. If the size and access time of the L2 cache are defined as and , the total time spent by the L2 cache will be Since we have normalized the graphs to represent probability, the area from to infinity is the L1 miss ratio rather than the miss count , and the equations become a measure of time per reference, instead of the total running time of the entire trace. The average time per reference of the whole hierarchy is the sum of the average times spent in each of the hierarchy levels; the general time formula is given by Cumulative Reference Curve Differential Curve FIGURE 1. The two stack distance curves: the cumulative probability curve and its differential, the byte distance probability curve. Each plots stack distance against number of references, or, if normalized by the total number of references, the probability of each reference. If the curves are normalized, the cumulative probability tops out at a value of one, meaning that the area under the differential curve is defined to be 1. s L 1 tL1 tL1 tL1 p x( )dx 0 ∞ ∫ p x( ) sL1 sL2 tL2 tL2 p x( )dx sL 1 ∞ ∫ sL1 10 Optimization of Mass Storage Hierarchies 3.3.1 Finding Optimal Configurations We can now find the optimal configurations for all values of the system budget. In the following sections, let , , and represent the sizes of the levels in the hierarchy (units = MB); let , , and represent the access time for those technologies (units = sec), and let and represent the costs of those technologies (units = dollars/MB). represents the total system budget, in dollars. Let us assume that the differential curve has no local minima (which is intuitively realistic for average workloads; remem- ber that we are intent on optimizing a server, so even if the original full trace has a curve that looks like the following: we only see the portion of the traces which reaches the server; that portion which spills out of the client-side cache). The spillover portion is represented by the tail of the curve to the right of the vertical line in the graph. This has some interest- ing implications concerning the size of client-side caches; for example if the client’s cache is too small and the portion of the I/O requests that spills out to the server has a locality curve with local maxima and minima, then the server could believe there to be global minimums in access time where in fact there are only local minimums. We have an optimal configuration when is minimized. Using Equation 2 and Equation 3 and remembering that we allow to vary with each given system budget and find minimums when ; giving solutions where (EQ 6) The ratio on the right hand side turns out to be a recurring theme, so we assign it a label of , the performance ratio. (EQ 7) The ratio of the values of the differential at and is a measure of the difference in hit rates; each value represents the number of references that would hit in the cache if the cache were increased by a small (essentially zero) amount. The performance ratio gets larger as disk performance increases relative to tertiary storage, and as the price of disk decreases relative to the price of RAM. We find the optimal configuration point where the number of reference hits gained by increasing the amount of RAM by some small amount, compared to the number of reference hits gained by increasing the amount of disk by some small amount, is equal to the performance ratio. is a measure of the effectiveness of the disk s RAM sDisk sJukebox tRAM tDisk tJukebox cRAM cDisk B size of client cache Tavg ad d f x( ) xd a b ∫     f a ( )−= sRAM sRAM ∂ ∂ T avg 0 = sRAM∂ ∂Tavg t− Diskp s RAM ( ) t Jukebox c RAM c Disk p ⋅ ⋅ B s RAM c RAM − c Disk ( )+= p sRAM( ) p sDisk( ) tJukeboxcRAM tDiskcDisk = Ψ Ψ tJukeboxc RAM tDiskc Disk = sRAM sDisk Ψ Optimization of Mass Storage Hierarchies 11 cache level; it represents the time saved by adding more disk to the system (by taking the time away from the tertiary stor- age system) and it represents the cost-effectiveness of disk as compared to RAM. Equation 6 is therefore a method of finding the optimal solution; given a probability curve that represents the expected workload, for any value of we can find the corresponding optimal amount of disk that should be in the system, and for any value of , we can find the corresponding optimal amount of RAM that should be in the system. We shall see that it is not quite this simple because at small values of disk the optimal amount of RAM is often a negative value, but this obviously is easily overcome. 3.3.2 The Crossover Point One of the simplest questions to ask at this point is where should the first dollar go? It is not quite as simple as comparing Equation 4 to Equation 5, as we also need to account for the addition of a new level in the hierarchy (at dollar $0, the only thing in the system is the tertiary storage). The amount of time saved per reference by adding anything to the system is equal to the time to reference the jukebox multiplied by the expected number of refer- ences that will be hits in the new level added (those that will not have to go to the tertiary storage), with the additional expense of having to access the new technology on every reference. We shall see that the first dollar will always be spent upon disk. This agrees with results from [10], [18], and [19]; when the amount of money to spend on the cache is small, capacity counts for more than speed. Assuming for the moment that RAM will not be part of the optimal configuration until system budget numbers get large, another question that we can ask is where is the crossover point? We will answer the question, and in doing so, find at an alternate way of arriving at (and explaining) Equation 6. At what cost point does it makes sense to add the first dollar of RAM; at what system budget is the performance gained by adding a dollar of RAM equal to the performance gained by adding a dollar of disk? The crossover point is the cost point at which it makes just as much sense to add the next bit of RAM as it does to add the next bit of disk. Consider the following scenario: We have a system with only disk in it; the disk has size and for the same amount of money we can add an amount of RAM or an amount of disk. We wish to find the point at which the area under the curve from 0 to is equal to the area under the curve from to , where the area from 0 to is scaled by the time to reference the disk level, and the area from to is scaled by the time to reference the tertiary storage level. We have the following relations: (EQ 8) s RAM sDisk b ∆b +a b b a ∆b a b ∆b a b ∆b TsavedbyaddingRAM t Disk p x( )dx 0 sRAM ∫ tRAM −= 12 Optimization of Mass Storage Hierarchies (EQ 9) (EQ 10) We find the crossover point when Equation 8 and Equation 9 are equal, and if we let the sizes of each incremental amount to add to the system approach zero, we find a solution when This yields (EQ 11) Since , we lose the term and approximate this as (EQ 12) Using Equation 10, this yields something that looks very similar to Equation 6: (EQ 13) The right hand side is the performance ratio , and the relation says that the crossover point occurs farther out when the effectiveness of disk is high, and closer in when the effectiveness of disk is low. In general, every optimal point is a kind of crossover point; it is a point at which (as the amounts approach zero) it makes as much sense to add an amount of RAM as it does to add an amount of disk to the system. If it were otherwise, the solu- tion would not be optimal; if there were an advantage to adding zero RAM over adding zero disk, then the optimal solu- tion at that cost point would instead have more RAM and less disk in it. This being the case, we can think of the entire crossover point discussion as a method for finding any optimal point after the crossover point. The starting value for RAM would be instead of 0, and the term would not appear in Equation 7 (which is fine, since the term was dispensed with in Equation 12). The result would be an alternate derivation of Equation 6. 3.3.3 Modeling the Differential as an Exponential If we assume that looks like for some , we have (EQ 14) Combining with Equation 3, we conclude that for that optimal configurations at each budget , the sizes of RAM and disk are given by TsavedbyaddingDisk t Jukebox p x ( ) dx s Disk s Disk ∆ Disk+ ∫ = cRAMsRAM cDisk∆Disk= tDisk p x( )dx 0 sRAM ∫ tRAM−     s RAM 0→ lim t Jukebox p x ( ) dx s Disk s Disk ∆ Disk + ∫       ∆ disk 0 → lim = tDiskp sRAM( )∆Disk tRAM−( ) sRAM 0→ lim tJukeboxp ∆Disk( )∆Disk ( ) ∆Disk 0→ lim = tRAM tDisk« tRAM p sRAM( ) p sDisk( )      ∆Disk sRAM, 0→ lim sDisk sRAM t Jukebox t Disk ⋅       ∆Disk sRAM, 0→ lim = p 0( ) p sDisk( ) tJukeboxcRAM tDiskcDisk = Ψ sRAM tRAM− p x( ) αe αx− α sDisk sRAM− ln Ψ( ) α = B Optimization of Mass Storage Hierarchies 15 There is also the term scaling the ratio in Equation 15 downward as gets larger; this makes sense, as a large value for makes the probability curve steeper near the y-axis and decays to zero very rapidly, allowing a smaller amount of RAM to cover a larger number of references. On the other hand, a small value for ( ) makes the function decay very slowly, with very little area under the curve near the y-axis; this makes it difficult for a small amount of memory to cover a substantial amount of references, and so large values of cheap storage become necessary. will always be greater than 1, so we do not need to worry about Equation 20 giving us invalid solutions for . How- ever, Equation 15 can have solutions that lie outside the valid range. This occurs when is the amount of disk one could buy if the entire budget is spent on disk. The inequality suggests the same conclu- sion that was drawn from the exponential model; that the crossover point scales with the effectiveness of the disk system. The crossover point is where the minimum solution for goes from being negative to positive; when the amount of disk you can buy with the budget is greater than or equal to the root of . If we assume for the moment that , that RAM costs about 60 times as much as disk and that magnetic disk devices are 100 times faster than tertiary storage, the crossover point is when ; when the amount of disk in the system reaches roughly 6 gigabytes. Unlike the exponential example, the polynomial does not have identical slopes for the addition of RAM and disk to the system; here the amount added will always be in the same ratio but will favor disk. 3.3.5 Conclusions The following table summarizes the results of the curve fitting expedition: A few things seem to be independent of our choice for a well-defined approximation of the locality curve: • the amount of disk in the optimal solution will never be less than the amount of RAM, • the amount of disk will never be zero, • the disparity between disk and RAM increases as increases (whether it be in a ratio between the two or in a constant difference), and, of course, • the ratio of the values of the locality function at the size of RAM and disk are equal to the performance ratio. A few things seem at least at first glance to be very dependent upon our choice of approximations, and therefore weaken the model somewhat: • the rate at which RAM and Disk levels grow (parallel lines in exponential approximation, diverging lines in polynomial approximation), and TABLE 1. Comparison of modeling the differential as an exponential and a polynomial curve. Approximation Function for p(x) Crossover Point for Optimal Size of RAM (only valid after crossover) , , α α α α 0 α 1< < Ψ sDisk B cDisk Ψ 1 α⁄ 1 −< B cDisk⁄ sRAM α Ψ α 1= B cDisk⁄ Ψ 1 −= α 1= αe αx− B cRAMln Ψ( )= B 256≅ sRAM B cRAM cDisk+ c Disk c RAM cDisk + ln Ψ( ) α ⋅       −= α 1− x 1 +( ) α B cDisk Ψ 1−( )= B 3000≅ sRAM B cRAM cDiskΨ 1 α⁄+ cDisk Ψ1 α⁄ 1 −( ) c RAM c Disk Ψ 1 α⁄ + −= Ψ 16 Optimization of Mass Storage Hierarchies • the effect the performance ratio has upon the crossover point (for the exponential curve, the crossover point increases with the log of ; for the polynomial curve, the crossover point is proportional to ). It seems to be a fairly robust model, although seemingly small changes in the modeling of the locality curve produce large changes at the other end. 4.0 Simulations In order to check the validity of the model, we present a trace-driven comparison of the effectiveness of a number of dif- ferent I/O hierarchies, given a set of cost values. We decided that the minimum increment to spend on an upgrade would be $256, which should buy roughly 1/2 GB of disk space or 8 MB of RAM. 4.1 Workload Description The data that was used for the workload in the trace-driven hierarchy simulations was collected by CITI via their logging AFS server [4]. The only data that the server sees are those accesses not serviced from the client’s local cache. We use approximately one month’s worth of trace data captured from a server named “marge” from April 14, 1992 through May 8, 1992. The full traces represent over 20 million records of several different types of AFS server requests. The commands that we are interested in are fetchdata, storedata, removefile, createfile, removedir , and makedir . The rest of the commands do not actually read and write data; they are used to synchronize the local cache with the server, in the case of fetchstatus , etc. The environment in which the traces were captured was the University of Michigan Institutional File System (IFS). The IFS consists of AFS servers and clients running on various platforms scattered across the U-M campus. The types of applications span a wide range, including general-purpose computing such as e-mail and text-editing as well as software archives and traditional engineering type jobs. The file sizes range from very small (less than one KByte) to fairly large (several megabytes). We believe that the traces accurately portray typical file server activity. 4.2 Simulator Description Our simulator is a combination of a number of similar modules that implement objects such as memory, disk drives, and automated tertiary storage devices such as MO jukeboxes or cartridge tape autoloaders, and a skeleton frame that connects object modules together. The object modules follow the same interface so they are completely interchangeable and new modules can be implemented and inserted into the hierarchy with minimal effort. TABLE 2. Summary of specs used for simulator Technology Capacity Block Size Cost per MB Latency Bandwidth DRAM file buffers 8MB-64MB, in increments of 8MB 8 KBytes $32.00 ($256 buys 8MB) negligible 160 MB/sec Magnetic Disk 512MB-6GB, in increments of 512MB 16 KBytes $0.50 ($256 buys 0.5GB) 10 ms, plus a rotational delay of 5 ms 10 MB/sec Tertiary Storage 100 GB 64 KBytes $0.001 1 sec 1 MB/sec Ψ Ψ Optimization of Mass Storage Hierarchies 17 The object modules are responsible for implementing the various caches and keeping track of the usage statistics. At each level, if the item requested is not present it is requested from the next level down. The cost per megabyte, transfer time, and latency figures are the same as those described in Section 2.0. The running times reported by the simulator are an upper bound, in that the time taken at a given level depends upon the number of hits as well as the number of misses; if there is a miss at a level, time is taken to fill and prefetch; it is not the case that misses are handled entirely as a hit at the next lower level. In order to simplify implementation, writes were treated similarly to reads in a read-modify-write manner. Future studies will deal more with this issue. Przybylski ignores writes altogether in his cache analysis [14], but it is not the case that writes can be simply dispensed with in an I/O hierarchy, as the cost must be accounted for somewhere. 4.2.1 Memory Object The memory module keeps lists of blocks of user data in a roughly LRU-type set associative cache configuration. For all accesses, the time spent is equal to the number of blocks requested divided by the transfer rate. 4.2.2 Disk Object The disk module assumes a fixed number of cylinder groups, with an essentially infinite number of platters - no matter how large the disk is defined to be, it is still considered one physical disk drive; the size of each cylinder group (the num- ber of blocks allowed in each group’s linked list) is determined by the disk’s capacity. The group in which a given file is placed is determined by a hashing function based upon the user number and file number, so that a user’s files tend to be in or near the same cylinder group. The latency is the approximate time to move the “head” of the disk to the requested cylinder group from the group of the previous request. It is calculated to be the distance in cylinder groups times the given latency, divided by the number of groups. The total transfer time is the latency plus the number of blocks requested divided by the given transfer rate. 4.2.3 Tertiary Storage Object The tertiary storage module is an approximation of a cartridge tape autoloader or magneto-optical jukebox. Each storage unit is defined to hold 5 GB of data, and the module can read several units at once. The total transfer time is calculated similarly to the disk object in that each reader remembers where the last request was and calculates latency by multiplying distance by the given latency divided by the number of regions. The total transfer time is this latency plus the number of blocks requested divided by the given transfer rate. The tertiary storage module is always the lowest level of the storage hierarchy, thus a data access never misses there. The only reason this module is instantiated in the hierarchy is to compute miss times when the data is not present in the next higher level of the hierarchy. 20 Optimization of Mass Storage Hierarchies Figure 4 shows the effect of adding more and more cost to the hierarchy, and shows within each cost point the perfor- mance break-downs of RAM, Disk, and tertiary storage. This shows the expected asymptotic behavior of adding memory to the hierarchy. It is interesting to point out that the run- ning time drops by almost a factor of two when 3 gigabytes of disk are added to the system, at a cost of about $1,500. From then on, three times that dollar amount worth of RAM and disk is added to the system for a performance increase of only another 10-15%. Running Time (secs) Total System Budget (dollars) FIGURE 4. The total running times of optimal configurations for different system costs. The running times are broken down by technology; for instance at cost point $256, the total system running time is decreased roughly %15 by adding 512MB of disk, which now accounts for roughly one fourth of the total time. The running time of any given level decreases whenever an amount is added to the level immediately above. The slivers that represent RAM time are difficult to see, but can be detected because of this behavior; for instance, between cost points $1536 and $1792 the time for Jukebox does not decrease, but the time for Disk does. Here is where the first 8MB of RAM is added. 0 512 1024 1536 2048 2560 3072 3584 4096 4608 51200.0 10000.0 20000.0 30000.0 40000.0 50000.0 60000.0 70000.0 80000.0 Jukebox Disk RAM Performance Breakdown Across Hierarchy Levels Optimization of Mass Storage Hierarchies 21 Figure 5 plots the RAM costs of the optimal configurations for each cost point, as well as the scaled performance of the system (proportional to 1 over the running time). The step-wise nature of the cost and size curves is due to our restriction that we only allow additions of 8MB RAM or 512MB disk at a time. At the start, there is a question of whether to add disk or RAM. Since the client AFS caches have filtered out much of the available locality before it gets to the server, a small amount (8MB) of RAM will not achieve a high enough hit rate to off- set the fact that we would be going to tape often. The half-gigabyte disk drive, however, will show a much better hit rate, and this serves to offset the disk’s much lower access time. The crossover point is seen to be somewhere between $512 an $1536. The graph demonstrates an effect not predicted by the analysis; at a point, the disk curves top out and no longer increase, while the RAM curves increase rapidly since every dollar after that point is spent on RAM. This is due to the fact that the traces are finite and so cannot be approximated by a real function that asymptotes to zero; here, the traces actually reach zero and as a result once the disk size increases to the value where the differential curve hits the x-axis, there are no more references that can be used to reduce the running time of the tertiary device (by turning them into hits at the disk level). We believe that this is what is happening around the $4,000 cost point; from here on it seems that there is no more benefit to adding any more disk. This would suggest that the effective working set size of the AFS traces is somewhere around six gigabytes; for larger working sets, the point would be appropriately pushed off to the right. FIGURE 5. The sizes and costs of RAM and Disk for the optimal configurations . Note that the MBytes RAM curve is a constant multiple of the Cost RAM curve; the Disk curves are similarly related. Thus the MBytes RAM curve which is lost down near the x-axis is shaped exactly like the Cost RAM curve higher up. This is where the linear nature of the Size/Cost relation is evident; while it still makes sense to add disk to the system (while the size of the disk is still less than the size of the data set in the traces) the optimal amount of RAM and Disk both increase linearly as a function of System Cost. As soon as there is enough disk to cover the data in the traces, no more disk needs be added to the system and all future increments in system budget go toward RAM. It is very important to note that this curve contains the effect of all cold start misses as well as all compulsory misses; nothing has been removed; the performance is very pessimistic. Optimal Size-Cost Relationships Total System Budget (dollars) 0 1024 2048 3072 4096 51200.0 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 7000.0 8000.0 Amount RAM (MB) Amount RAM ($) Amount Disk (MB) Amount Disk ($) Performance - scaled 22 Optimization of Mass Storage Hierarchies 5.0 Comparison of Simulated and Analytical Results The traces used in Section 4.0 were analyzed for their locality behavior, producing the curves shown in Figure 6. The analysis produced the cumulative probability curve and its differential, and the differential was fit by the two functions used in Section 3.3.1. 0 1024 2048 3072 4096 5120 LRU Distance between Successive References (MBytes) 0.000 0.001 0.002 0.003 0.004 0.005 Cumulative Probability (scaled) Differential of Probability Curve Curve Fit to Differential , αe α x − α 0.004206= 0 1024 2048 3072 4096 5120 LRU Distance between Successive References (MBytes) 0.000 0.001 0.002 0.003 0.004 0.005 Cumulative Probability (scaled) Differential of Probability Curve Curve Fit to Differential , α 1− x 1 +( ) α α 1.009946= Probability Probability FIGURE 6. Fitting curves to the observed differentials. The graphs plot the locality behavior observed in the simulator. The first graph is fit with an exponential function, the second with a polynomial. The fitted curves are used for the values of that they produce, in order to verify the analysis.α Optimization of Mass Storage Hierarchies 25 being written. Second, and maybe more importantly, the simulator made a number of assumptions that the analytical model did not. For instance, the model ignores the effects of cold start and compulsory misses. The simulator did not, and the graphs demonstrate this (the performance curves are pretty bad). Also, the model ignores writes for the moment; this allows for a cleaner set of equations, but causes problems when the simulated results (which do not ignore writes, but rather treat them like reads) are compared against the calculated ones. 6.0 Conclusions and Future Work We have demonstrated the comparative effectiveness of adding disk buffers or RAM buffers to a given system. The stack distance curves are an invaluable tool for doing experimental cache work; they are independent of the cache size and con- figuration and yet can be used to predict cache performance figures, as they are differentiable. We have found a closed form solution for determining the optimal configuration of a two-level hierarchy. Since we have restricted focus to dealing with machines connected to mass storage devices, we are typically dealing with servers, whose file request streams exhibit a lower degree of locality than most people are used to thinking about. This arrangement yields a number of surprising results, like the fact that disk is far more important to systems than most people believe; a system with a few gigabytes of disk acting solely as a cache for the file system will perform better than a system with a cost-equivalent amount of RAM instead. In the derivation of the solution for optimality, we present the performance ratio; a characterization of the effectiveness of the disk level. The ratio is used in calculating the optimal configuration for a given workload, scaling upward the point at which RAM should be added to the system, as the value of the ratio increases. As far as future work goes, there are a number of items to work upon. Since the hierarchy model does not consider writes, this needs to be added and its effect upon the results needs to be measured. Along the same lines, the simulator needs to be rewritten to ignore compulsory and cold start misses in order to make the comparisons fair. There have been a number of studies that suggest the number of levels in the hierarchy must grow as the size of the hier- archy grows. This makes intuitive sense, and should be investigated. Acknowledgments The present work is based upon the work done by myself and Seth Silverman for a class project in the Winter of 1994 (Peter Chen’s EECS 598). I wrote the simulator, Seth ran the simulations and made the resulting graphs, and Eric Sprangle pointed us in the right direction to begin with. Enormous thanks go to the men of CITI: Peter Honeyman, Saar Blumson, and Dan Muntz, for graciously allowing us to use their AFS traces. Without their data, most of this work would not have been done. References 1. Advertisements, Computer Shopper, April 1994. 2. Antonelli, C.J. and Honeyman, P., “Integrating Mass Storage and File Systems,” Twelfth IEEE Symposium on Mass Storage Systems , pp. 133-138, April 1993. 3. Baker, M., Asami, S., Deprit, E., Ousterhout, J. and Seltzer, M., “Non-Volatile Memory for Fast Reliable File Systems,” Fifth International Conference on Architectural Support for Programming Languages and Operat- ing Systems (ASPLOS-V), pp. 20-22, October 1992. 26 Optimization of Mass Storage Hierarchies 4. Blumson, S., Honeyman, P., Ragland, T. E. and Stolarchuk, M. T., “AFS Server Logging,” CITI Technical Report 93-10, November 1993. 5. Chow, C. K., “On Optimization of Storage Hierarchies,” IBM Journal of Research and Development , Vol. 18, No. 3, May 1974. 6. Chow, C. K., “Determination of Cache’s Capacity and its Matching Storage Hierarchy,” IEEE Transactions on Computers , Vol C-25, No. 2, February 1976. 7. Copeland, G., Keller, T., Krishnamurthy, R. and Smith, M., “The Case for Safe RAM,” Proceedings of the Fif- teenth International Conference on Very Large Databases , pp. 327-335, August 1989. 8. Drapeau, A. L. and Katz, R. H., “Striped Tape Arrays,” Twelfth IEEE Symposium on Mass Storage Systems , pp. 257-265, April, 1993. 9. Garcia-Molina, H., Park, A., and Rogers, L., “Performance Through Memory,” Proceedings of the 1987 ACM SIGMETRICS Conference, pp. 122-131, May, 1987. 10. Katz R. H., Gibson, G. A. and Patterson, D. A., “Disk System Architectures for High Performance Comput- ing,” Proceedings of the IEEE , pp. 1842-1858, December 1989. 11. MacDonald, J. E. and Sigworth, K. L., “Storage Hierarchy Optimization Procedure,” IBM Journal of Research and Development , Vol. 19, No. 2, March 1975. 12. Mitsubishi Electric, Data Book: Semiconductor Memories/RAM , 1990. 13. Nakagomi, T., Holzbach, M., Van Meter, R. and Ranade, S., “Re-Defining the Storage Hierarchy: An Ultra- Fast Magneto-Optical Disk Drive,” Twelfth IEEE Symposium on Mass Storage Systems , April 1993. 14. Przybylski, S. A., Cache and Memory Hierarchy Design: a Performance-Directed Approach , Morgan Kauf- mann, San Mateo, CA, 1990. 15. Quinlan, S., “A Cached WORM File System,” Software Practice and Experience , pp. 1289-1299, December 1991. 16. Ranade, S., Mass Storage Technologies , Meckler Publishig, Westport, CT, 1991. 17. Redfield, R. and Willenbring, J., “Holostore Technology for Higher Levels of Memory Hierarchy,” Eleventh IEEE Symposium on Mass Storage Systems , pp. 155-159, October 1991. 18. Rege, S. L., “Cost, Performance, and Size Tradeoffs for Different Levels in a Memory Hierarchy,” Computer , Vol. 9, No. 4, April 1976. 19. Welch, T., “Memory Hierarchy Configuration Analysis,” IEEE Transactions on Computers , Vol. C-27, No. 5, May 1978. 20. Wu, M. and Zwaeneopoel, W., “eNVy: A Non-Volatile, Main Memory Storage System,” March, 1994.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved