Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Energy Conservation through Diverted Accesses in Storage Systems, Papers of Computer Science

Diverted accesses, a technique to conserve disk energy in storage systems by segregating original and redundant data on different disks. The approach allows the system to concentrate requests on original disks, leaving redundant disks idle, resulting in substantial energy savings. The document also evaluates the effect of redundancy on system characteristics and provides modeling results.

Typology: Papers

Pre 2010

Uploaded on 08/31/2009

koofers-user-od6
koofers-user-od6 🇺🇸

10 documents

1 / 12

Toggle sidebar

Related documents


Partial preview of the text

Download Energy Conservation through Diverted Accesses in Storage Systems and more Papers Computer Science in PDF only on Docsity! Exploiting Redundancy to Conserve Energy in Storage Systems ∗ Eduardo Pinheiro Rutgers University edpin@cs.rutgers.edu Ricardo Bianchini Rutgers University ricardob@cs.rutgers.edu Cezary Dubnicki NEC Labs America dubnicki@nec-labs.com ABSTRACT This paper makes two main contributions. First, it introduces Di- verted Accesses, a technique that leverages the redundancy in stor- age systems to conserve disk energy. Second, it evaluates the previ- ous (redundancy-oblivious) energy conservation techniques, along with Diverted Accesses, as a function of the amount and type of redundancy in the system. The evaluation is based on novel an- alytic models of the energy consumed by the techniques. Using these energy models and previous models of reliability, availability, and performance, we can determine the best redundancy configura- tion for new energy-aware storage systems. To study Diverted Ac- cesses for realistic systems and workloads, we simulate a wide-area storage system under two file-access traces. Our modeling results show that Diverted Accesses is more effective and robust than the redundancy-oblivious techniques. Our simulation results show that our technique can conserve 20-61% of the disk energy consumed by the wide-area storage system. Categories and Subject Descriptors D.4 [Operating systems]: Storage management General Terms Design, experimentation Keywords Energy management, energy modeling, disk energy 1. INTRODUCTION Large storage systems, such as those of popular Internet services, outsourced storage services, and wide-area storage utilities, con- sume significant amounts of energy. For example, one report indi- cates that the storage subsystem can represent 27% of the energy consumed in a data center [16]. Even worse, this fraction tends to increase as storage requirements are rising by 60% annually [17]. ∗This research has been supported by NSF under grant #CCR- 0238182 (CAREER award). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMetrics/Performance’06, June 26–30, 2006, Saint Malo, France. Copyright 2006 ACM 1-59593-320-4/06/0006 ...$5.00. Because the energy consumption of storage systems is reflected in their electricity bills, several research groups have been seek- ing to reduce it [2, 3, 9, 14, 15, 19, 27, 28, 29]. However, only a few of these efforts [14, 15, 27] have explicitly leveraged redun- dancy and they did so in a limited context. Redundancy is present in all practical storage systems, since it is mostly through redun- dancy that these systems achieve high reliability, availability, and throughput. Redundancy is typically implemented by replicating the “original data” (as in mirrored disk arrays, cluster-based storage systems [13], or wide-area storage systems [21, 23]) or by storing additional information that can be used to reconstruct the original data in case of disk failures (as in RAID 5 or erasure-code-based wide-area storage systems [8, 10, 12]). We refer to the replicas and the additional information as “redundant data”. Our approach is to leverage this redundancy to conserve disk en- ergy without performance degradation. In particular, we propose a technique called Diverted Accesses that segregates original and re- dundant data on different disks. We refer to these disks as original and redundant disks, respectively. The segregation allows the sys- tem to concentrate the requests on the original disks (under light or moderate demand for disk bandwidth), leaving the redundant disks idle. During the idle periods, the disks can be sent to low-power mode. The redundant disks only need to be activated in three cases: (1) when the demand for bandwidth is high; (2) when one or more disks fail; and (3) periodically to reflect changes made to the origi- nal data. In this last case, the writes to the original disks need to be logged until the corresponding redundant disks are activated. Because the benefits of our technique vary with the amount and type of redundancy built into the system, our evaluation analytically quantifies the effect of redundancy on several system characteris- tics, including disk energy consumption and the potential of differ- ent techniques to conserve energy. More specifically, we develop energy models for Diverted Accesses and previous redundancy- oblivious techniques, and couple them with well-known models of reliability, availability, and throughput. Our modeling results show that Diverted Accesses can provide substantial energy sav- ings across a wide range of redundancy, request rate, and write per- centage parameters. Other techniques are only useful in small parts of this parameter space. Designers can use our models to determine the best redundancy configuration for new storage systems. Our approach is to select the configuration that achieves the required throughput, reliability, and availability but consumes the least amount of energy. Our re- sults show that non-intuitive redundancy configurations are often the best ones when all metrics are considered. Finally, to demonstrate our technique for a system that requires high redundancy, we simulate a wide-area storage system with sta- ble nodes and data replication under two realistic file-access traces. The goal is to mimic a world-wide corporation that owns and oper- ates its dedicated, distributed storage resources. Our results show that Diverted Accesses can reduce disk energy consumption by 20- 61%. These results are close to those predicted by our models. We conclude that considering redundancy can provide significant disk energy savings beyond those of previous techniques. Further- more, we conclude that designing a storage system requires quan- tifying several metrics, which are all affected by the redundancy configuration. Our models are key in this design process. The remainder of this paper is organized as follows. The next section discusses the related work and our contributions. Section 3 describes Diverted Accesses. Section 4 describes the energy mod- els for the conservation techniques we study. Section 5 overviews previously proposed models for throughput, reliability, and avail- ability, and discusses the selection of the best redundancy configu- ration. Section 6 presents our modeling results. Section 7 presents our real-trace results. Section 8 concludes the paper. 2. BACKGROUND AND RELATED WORK 2.1 Redundancy Redundancy is typically implemented in storage systems through replication, parity schemes, or erasure codes. These methods can be defined in terms of their redundancy configurations by (n, m) tuples, where each block of data is striped, replicated, or encoded into n fragments, but only m fragments (m ≤ n) are needed to reconstruct the data. For instance, a RAID 1 storage system is rep- resented by (n = 2, m = 1), since there are two copies of each block but only one copy is enough to reconstruct the block. A re- cent wide-area storage system based on erasure codes [10] used (n = 48, m = 5) to resist massive correlated failures. Several papers have studied these redundancy approaches, e.g. [1, 26]. Replication typically requires more bandwidth and stor- age space than the parity schemes. However, parity schemes can only tolerate small numbers of concurrent failures. Erasure codes require less bandwidth and storage than replication (for the same levels of reliability and availability), can tolerate more failures than parity schemes, but involve coding and decoding overheads. Contributions. Our work complements these previous studies as we consider the impact of redundancy configuration on disk energy consumption and conservation. 2.2 Disk Energy Conservation Several techniques have been proposed for disk energy conser- vation in storage systems. Threshold-Based Techniques. The simplest threshold-based tech- nique is Fixed Threshold (FT). In FT, a disk is transitioned to low- power mode after a fixed threshold time has elapsed since the last access. Inspired by competitive policies, the threshold is usually set to the break-even time, i.e. the time a disk would have to be in low-power mode to conserve the same energy consumed by transi- tioning the disk down and back up to active mode. FT is one of the techniques to which we compare Diverted Accesses. Data-Movement Techniques. In this category are those techniques that migrate or copy data across disks. The Massive Array of Idle Disks (MAID) technique [3] uses extra cache disks to cache recently accessed data. On each access miss in the cache disks, the accessed block is copied to one of the cache disks. If all cache disks are full, one of them evicts its LRU block to make space for the incoming block. The goal is to concentrate the accesses on the cache disks, so that the non-cache disks can remain mostly idle and, thus, be transitioned to low-power mode. In contrast to the copy-based approach of MAID, Popular Data Concentration (PDC) [19] migrates data across disks according to frequency of access or popularity. The goal is to lay data out in such a way that popular and unpopular data are stored on different disks. This layout leaves the disks that store unpopular data mostly idle, so that they can be transitioned to low-power mode. MAID and PDC use FT for power management. Redundancy-aware Techniques. Only three works have exploited redundancy to conserve energy in storage systems: EERAID [14], eRAID [15], and RIMAC [27]. (EERAID is actually the only one that pre-dates our technical report on Diverted Accesses [20].) The most closely related work is eRAID, which is similar to Diverted Accesses but only considered RAID 1 storage. EERAID and RI- MAC were targeted at RAID 5 organizations and as such are only capable of conserving 1/N of the energy of an array with N disks. In contrast with these three systems, we are interested in a broader range of storage systems, including those that are based on erasure codes, in which n may not be equal to m + 1 as in RAIDs 1 and 5. EERAID, eRAID, and RIMAC represent only a couple of points in this spectrum. Other Techniques. Carrera et al. [2] and Gurumurthi et al. [9] proposed disks with more than one speed and showed that they can provide significant energy savings for different server workloads. Zhu et al. [28] exploited intelligent disk speed setting and data migration to conserve energy in arrays comprised of multi-speed disks without degrading response time. Carrera et al. also showed that a combination of laptop and SCSI disks can be even more ben- eficial in terms of energy, but only for over-provisioned servers. Papathanasiou and Scott [18] propose replacing server-class disks with larger arrays of laptop disks. Zhu et al. [29] proposed storage cache replacement algorithms that selectively keep blocks of data in memory, so that certain disks can stay in low-power mode for longer periods. Contributions. Our work introduces Diverted Accesses, a novel and effective technique for leveraging redundancy. Further, our work presents energy models for FT, MAID, PDC, and Diverted Accesses that take redundancy configurations into account, and a case study of the application of Diverted Accesses in the context of a realistic wide-area storage system. Finally, ours is the first study of these techniques as a function of redundancy. 2.3 Storage System Design Anderson et al. [5] proposed Ergastulum, a tool that quickly evaluates the space of possible data layouts and storage system configurations and finds a near-optimal design. Ergastulum uses performance models to determine whether a potential design is ac- ceptable. Minerva [7] is similar but not as efficient as Ergastulum. Hippodrome [6] uses Ergastulum to adjust the design of the storage system as a result of dynamic changes in the workload. Contributions. We extend the previous work on Ergastulum, Min- erva, and Hippodrome by considering the reliability, availability, and energy consumption of different designs. Since we only fo- cus on selecting the redundancy configuration (n, m), our space of possible designs is much more constrained than in these systems. For our purposes, exhaustive search works fine. 3. DIVERTED ACCESSES Our approach to reducing energy consumption in storage sys- tems leverages their redundancy. The key observation is that the redundant data is only read in two scenarios: (1) during periods of high demand for disk bandwidth, to increase performance; and (2) when disk failures occur, to guarantee reliability and availability. Pl Ph Ed Eu A T Td B B’ Tu Time Ed T Td C Eu Tu C’ I I Figure 1: I ≥ T + Tu. Accesses arrive at times A, B, and C. Accesses are actually performed at times A, B’, and C’. Pl Ph Ed Eu A T Td B B’ Tu TimeI I C Figure 2: T ≤ I < T + Tu. Accesses arrive at times A, B, and C. Accesses are actually performed at times A, B’, and C. Since the average number of disk accesses per second is (npw + m(1 − pw))/t, the idle time at each disk is simply the inverse of the access rate times the number of disks. The average power for each idle time is then defined by the three cases below: PhN, case1 (TPh + (I − T − Tt)Pl + Et)N/I, case2 ((I − Tu + T )Ph + (I − T − Td)Pl + Et)N/(2I), case3 (2) where case1 represents the scenario in which I < T , case2 rep- resents I ≥ T + Tu, and case3 represents all other cases (i.e., T ≤ I < T + Tu). (To avoid more cases, our modeling extends idle times that end during a transition to low-power mode until the end of the transition.) More intuitively, the top equation represents the scenario in which idle times are too short and do not trigger a transition to low-power mode. The energy is simply that of all disks on, so the average power will be PhN . The middle equation represents the scenario in which there is enough time for the disks to transition to low- power mode, perhaps spend some time in low-power mode, and transition back. In this case, all idle times except the first are effec- tively reduced by the spin up time. Figure 1 illustrates this scenario. The figure shows accesses arriving at times A, B, and C; actual disk accesses occur at times A, B’, and C’. Note that, after the first idle time, the behavior between B and C will consistently be repeated while this idle time is in effect. Thus, we use the period between B and C in computing the average power. The last equation represents the scenario in which there is not enough time for the full transi- tion to and from low-power mode. In this case, the next idle time is shortened to a point that no power-mode transitions can happen. Figure 2 illustrates this scenario. The behavior between A and C will be consistently repeated while this idle time is in effect. For this reason, we use the entire period (two idle times) in computing the average power. MAID. In this strategy, Dmaid extra disks are used to cache recently- accessed files. Upon each block request, if the block is not yet on one of the cache disks, the corresponding fragments are accessed at the non-cache disks and also copied to one of the cache disks. To avoid high latencies, requests may bypass the cache disks during periods of high load. Thus, the maximum throughput of a MAID system is that of N + Dmaid disks. In terms of energy, this high load scenario leads to extremely short idle times and an average power of (N + Dmaid)Ph. Below, we model MAID under light and moderate loads. Modeling MAID requires knowledge of the temporal locality of accesses to files. As an approximation, we assume that we know the fragment cache miss ratio mmaid of the cache disks. We can then estimate the idle times of the cache disks and the non-cache disks. To conserve energy, we can leverage the cache disks to ac- cumulate the writes to a non-cache disk until a cache disk miss (a read) accesses it. At that point, the accumulated writes can be performed on the non-cache disk. This approach promotes energy conservation at the cost of lower reliability. We need to calculate two distinct idle times: for the MAID caches (Icache) and for the non-cache disks (IN ). Icache = (Dmaidt)/((npw) + (m(1 − pw))) IN = (Nt)/(m(1 − pw)mmaid) (3) The average power consumed by the cache disks can be com- puted as in FT (equations 2), except that we need to replace Icache for I and Dmaid for N . The average power of the non-cache disks can also be computed using those equations, as long as we replace IN for I . The overall average power is the sum of the cache and non-cache powers. Note that our modeling of MAID is simplistic. In the extreme cases in which there are no read accesses (pw = 1) or all read accesses hit the cache disks (mmaid = 0), the idle time of the non-cache disks is modeled as infinite. In other words, we assume the cache disks to have infinite write buffering capacity. Further, we assume that the energy spent in copying data to the cache disks is negligible. Our goal with these simplifications is to provide an upper bound on the energy conservation potential of MAID. PDC. Like FT and MAID, the original PDC proposal [19] did not consider redundancy explicitly. However, unlike the other tech- niques, PDC can hurt reliability significantly if it is applied to all fragments arbitrarily. To avoid this problem, we modify PDC to migrate data in such a way that the n fragments of each block remain on different disks. Over time, the most popular fragments would then be stored on the “first” set of n disks, the second most popular fragments would be stored on the “second” set of n disks and so on. In order to model PDC, we need to introduce the notions of disk popularity and file system coverage. Previous research has shown that the popularity distribution of various Web and network work- loads follows a power law. In particular, Zipf’s power law with coefficient α to describe the popularity of files. Zipf’s law states that the probability of a file being accessed is proportional to 1/rα, where r is the rank (popularity) of the file and α is the degree of skewness of popularity. For example, when α = 0, all files are equally likely to be accessed. The larger α is, the more heavy-tailed the distribution is. Based on a similar idea, we use Zipf’s power law with coefficient β to describe the popularity of the groups of n disks in steady state, i.e. when all fragments have been migrated to their best locations. File system coverage represents the percentage of blocks that are actually accessed in a given period of time (a day, a week, or the length of a workload). To compute the idle times, we need to take the disk popular- ity and the coverage c into account. We do so using an “idleness weight” w for each set i of n disks: wi = (Nc/n) ( 1 − 1/iβ ∑dNc/ne j=1 1/j β ) (4) The idleness weight of a group of disks ranges from 0 to Nc/n (the number of groups) and is proportional to the fraction of ac- cesses that is not directed to the group (the term inside parentheses on the right of equation 4). In other words, the most popular group will have a weight that tends to 0, whereas the least popular group will have a weight that approaches Nc/n. These factors can be used to weight the idle time I from equation 1 in computing the average power consumed by the disks for each idle time in PDC: dNc/ne ∑ i=1 { wiPhn, c1 (TPh + (Iwi − T − Tt)Pl + Et)n/I, c2 ((Iwi − Tu + T )Ph + (Iwi − T − Td)Pl + Et)n/(2I), c3 (5) where c1 represents the scenario in which Iwi < T , c2 represents Iwi ≥ T + Tu, and c3 represents all other cases. Intuitively, this equation sums up the average power consumed by each group of disks for each idle time, noting that each group may actually fall in a different case. Our modeling of PDC is optimistic for three reasons. First, our coverage parameter assumes that all write accesses are updates of existing data, rather than writes of new data. Second, we assume the energy consumed in data migration to be negligible. Third, due to the complexity of the its data layout, PDC is essentially im- practical in the presence of redundancy; it would be very hard to implement in single-node storage systems, and even harder in dis- tributed storage systems. Nevertheless, our goal is to compute an upper bound on the potential benefits of PDC. DIV. Diverted Accesses stores the original data on D disks and the redundant data on R = N − D disks. The redundant disks can be sent to low-power mode, until they are required for reliability or to provide higher bandwidth under high load. Again, idle times are short under high load, leading to an average power consumption of NPh. Next, we model DIV under light and moderate loads. First, we compute the idle times on the original disks, ID: ID = Dt m (6) Note that all (read and write) requests translate into accesses to the D disks. Writes are also buffered, so the expected idle time on the redundant disks (IR) is the expected time for the write buffer to fill up times R. IR = Rt ∗ wbSize blockSize ∗ (n − m)pw , (7) such that wbSize ≥ blockSize. With these idle times, the average power for DIV can be com- puted as the sum of the power consumed by the original and the redundant disks. These average powers can be computed the same way as in FT (equations 2) with minor differences. For the original disks, the average power is: PhD, c1 (TPh + (ID − T − Tt)Pl + Et)D/ID , c2 ((ID − Tu + T )Ph + (ID − T − Td)Pl + Et)D/(2ID), c3 (8) where c1 represents the scenario in which ID < T , c2 represents ID ≥ T + Tu, and c3 represents all other cases. For the redundant disks, the average power is: PhR, c1 (TPh + (IR − T − Tt)Pl + Et)R/IR, c2 ((IR − Tu + T )Ph + (IR − T − Td)Pl + Et)R/(2IR), c3 (9) where c1 represents the scenario in which IR < T , c2 represents IR ≥ T + Tu, and c3 represents all other cases. MAID+DIV. MAID can be combined with DIV. In MAID+DIV, the idea is to place a few cache disks in front of DIV-structured disks. Dmaid extra disks are used to cache recently accessed blocks like in MAID, whereas the D + R = N non-cache disks are orga- nized as in DIV. Given the extra cache disks, we can accumulate writes aggres- sively on them, as in MAID. The writes are propagated to the non- cache disks on read misses on the cache disks (original disks) or periodically in a large batch (redundant disks). The resulting idle times, Icache, ID, and IR, for MAID+DIV are slight variations of these times in MAID and DIV. Specifically, Icache is defined ex- actly as in equation 3; ID is also defined as in equation 3, except that N is replaced by D; and IR is defined as in equation 7, except that wbSize is replaced by batchSize. The average power consumed by the cache disks in MAID+DIV is computed as in MAID, whereas the average power of the non- cache disks is computed as in DIV. The overall average power is the sum of these components, as the energy used by data copying is assumed negligible. PDC+DIV. Here, we combine PDC with DIV. The idea is to segre- gate original and redundant fragments and only migrate the original ones according to popularity. Migration is performed in such a way that the m fragments remain on different disks. Over time, the most popular fragments would then be stored on the “first” set of m orig- inal disks, the second most popular fragments would be stored on the “second” set of m original disks and so on. Due to the concentration of accesses, the computation of the idle times uses idleness weights w for each group i of m original disks, just as in our modeling of PDC (equation 4) except that N is re- placed by D. These weighting factors can be used to weight the idle time ID from equation 6 in computing the average power con- sumed by the original disks for each idle time in PDC+DIV. This power can be computed as in equation 5, except that n is replaced by m and I is replaced by ID. The average power consumed by the redundant disks can be computed exactly as in DIV. The overall average power is the sum of these two original and redundant powers. Again, we assume all writes to be updates to existing data and the energy of data migra- tion to be negligible. 5. DESIGNING REAL SYSTEMS In this section, we present well-known models for reliability, availability, and performance. Using these models along with our energy models, we can select the best redundancy configuration for new storage systems. 5.1 Reliability and Availability We define the reliability r(x) of each disk as the probability that the disk does not lose or damage the data it stores (e.g., due to a permanent mechanical drive failure) within time x. To simplify the notation, we refer to r(x) simply as r. We define the availability a of each disk as the probability that the disk is not temporarily inaccessible (e.g., due to a loose cable or a period of offline main- tenance) at a given time. We assume disk faults to be independent. Given these assumptions, the combinatorial models that quantify reliability and availability as a function of the redundancy configu- ration (n, m) are similar [24]. Equation 10 is the availability model when at least m fragments must be available upon a disk access. A is typically close to 1, so availability is often referred to in terms of the number of nines after the decimal period. The reliability model is the same, except that a is replaced by r. A = n−m ∑ i=0 ( n i ) a n−i (1 − a)i (10) Note that, although our independence assumption and simple combinatorial models produce only rough approximations for RAID systems, they are more accurate for distributed storage systems, which are less likely to experience correlated faults. Diverted Accesses. The reliability and availability definitions above apply to DIV as well, even though it uses NVRAM for tempo- rary storage of recently written redundant data. The reason is that battery-backed RAM can achieve similar reliability to disks in real storage systems, as long as administrators periodically replace bat- teries. Along the same lines, NVRAM and disks should exhibit similar availabilities, since downtimes are likely to be dominated by the unavailability of their supporting components, such as hosts and cabling. The key observation is that the use of NVRAM in DIV does not reduce the level of redundancy in typical systems, as mentioned in Section 3. 5.2 Performance: Maximum Throughput Our performance model is the aggregate maximum throughput of the disks in the system, given a fixed configuration. Recall that the number of disks is a function of the redundancy configuration, N = Dn/m. The maximum throughput is defined by the redundancy config- uration and the disk bandwidth for the workload: fragSize = blockSize/m delay = S + R + fragSize/X widthr = N × fragSize/delay widthw = N × (fragSize/delay) × m/n P = totalBW = pwwidthw + (1.0 − pw)widthr (11) where blockSize is the maximum between the block size exported by the storage system interface and the weighted mean of a distribu- tion of request sizes (each size being a multiple of the block size); fragSize is the size of each fragment; widthr and widthw rep- resent the effective bandwidth for reads and writes, respectively; as previously defined, S, R, and X are the average seek time, the av- erage rotational delay, and the disk transfer rate, respectively; and pw is the probability of writes. These equations apply to all energy conservation techniques, even though disks that are in low-power mode limit the maximum through- put of the system. However, all techniques activate all disks when the offered load requires it. 5.3 Putting It All Together Determining the best redundancy configuration for a storage sys- tem involves meeting its storage capacity, reliability, availability, and throughput requirements for the least amount of energy. More specifically, we need to explore the space of potential configura- tions (n, m) and energy conservation techniques, trying to mini- mize energy (or average power), subject to the constraints on stor- age capacity, reliability, availability, and throughput. Given the rel- atively small search space of n ∗ m possible configurations, for example in n ∈ [1..16] and m ∈ [1..16], the problem becomes easy to solve by enumeration (and modeling, of course). Parameter Default Value Request Rate (requestRate) 64 reqs/sec Write Ratio (pw) 33% Disk Popularity (β, PDC) 1.0 File System Coverage (c, PDC) 70% # MAID Cache Disks (Dmaid) 0.1N MAID Fragment Miss Ratio (mmaid) 40% MAID+DIV Batch Size (batchSize) 1 GB DIV Write Buffer Size (wbSize) 4 MB Block size (blockSize) 8 KB # Disks without Redundancy (D) 64 # Disks with Redundancy (N ) Varies Disk reliability (r) 0.999 Disk availability (a) 0.99 High Power (Ph) 10.2 W Low Power (Pl) 2.5 W Avg. Seek Time (S) 3.4 ms Avg. Rot. Time (R) 2.0 ms Transfer Rate (X) 55.0 MB/sec Idleness Threshold (T ) 19.2 secs Spin up Time (Tu) 10.9 secs Spin down Time (Td) 1.5 secs Energy transition down+up (Et) 148.0 J Table 2: Configurable parameters and their default values. 6. MODELING RESULTS In this section, we analyze the tradeoffs between different (n, m) configurations, in terms of energy, reliability, availability, and per- formance. Because the parameter space has at least 6 dimensions – n, m, energy, reliability, availability, and performance –, it is impossible to visualize it all at the same time. Thus, we plot 2-D graphs showing the interesting parts of the space. We computed the energy results in this section using a synthetic workload generated as follows. We draw 10,000 inter-arrival times from a Pareto distribution with a default average of 64 requests per second and an infinite variance. Requests are for 8-KB blocks. The disk parameters are based on the IBM 36Z15 Ultrastar model. The disk reliability assumes an exponential distribution of faults during one year and an MTTF of 2 million hours. We assume a low disk availability of 0.99 to encompass not only the availability of the disk itself, but also that of its supporting components, such as the power supply, controllers, and cabling. The modeling of DIV assumes the same reliability and availability values for disks and NVRAM. Table 2 summarizes the default parameter values we used. Note that the values for Ph, Pl, Tu, Td, Et were actually measured, rather than taken from our disk’s datasheet. Although we study a wide range of parameter values, we care- fully selected the default values for the workload-related parame- ters. In particular, the defaults for requestRate and pw lie within the ranges created by our two realistic access traces (see Section 7) for these parameters. Further, we selected the default for Dmaid based on simulations of the cache disk miss rate under our traces; 0.1N cache disks leads to the best tradeoff between miss rate and number of disks for our traces. The default value for mmaid lies in the middle of the range created by our two traces for the default number of cache disks. Our traces do not include information about coverage, so we arbitrarily chose 70% as its default value but quan- tify the effect of changes in this parameter explicitly. Finally, we do not have definitive information about β either; we set it to 1.0, but have found that this parameter has a negligible impact on the energy gains achieved by PDC and PDC+DIV. 6.1 Energy We now evaluate the effectiveness of the energy conservation techniques, as a function of the redundancy configuration. We start (n, m) N P A Tech E (MB/s) (W) (8, 5) 32 8.1 0.999999 EO 326 (8, 5) 32 8.1 0.999999 FT 326 (8, 5) 32 9.1 0.999999 MAID 367 (8, 5) 32 8.1 0.999999 PDC 265 (8, 5) 32 8.1 0.999999 DIV 267 (8, 5) 32 9.1 0.999999 MAID+DIV 275 (8, 5) 32 8.1 0.999999 PDC+DIV 228 (3, 1) 60 66.0 0.999999 EO 612 (3, 1) 60 66.0 0.999999 FT 612 (3, 1) 60 72.6 0.999999 MAID 674 (3, 1) 60 66.0 0.999999 PDC 473 (3, 1) 60 66.0 0.999999 DIV 326 (3, 1) 60 72.6 0.999999 MAID+DIV 365 (3, 1) 60 66.0 0.999999 PDC+DIV 280 (8, 1) 160 160.4 1.000000 EO 1632 (8, 1) 160 160.4 1.000000 FT 1633 (8, 1) 160 176.5 1.000000 MAID 1774 (8, 1) 160 160.4 1.000000 PDC 1262 (8, 1) 160 160.4 1.000000 DIV 631 (8, 1) 160 176.5 1.000000 MAID+DIV 717 (8, 1) 160 160.4 1.000000 PDC+DIV 585 Table 3: Sample candidate solutions for simple example. 6.2 Defining a Redundancy Configuration We illustrate the use of our models in the design of a redundancy configuration with a simple example. Suppose you need to design a system that requires: at least 20 disks to store all the data, at least 5 MB/s of throughput, at least 6 nines of reliability, and at least 5 nines of availability. We evaluated all combinations of n, m ∈ [1..16] for this exam- ple and our default model parameters. Table 3 summarizes some of the candidate combinations. From left to right, the table lists the redundancy configuration, the number of disks, its throughput, its availability, the energy conservation technique, and the average power consumption. We do not list reliabilities, as all configura- tions that meet the other requirements easily meet the reliability requirement. In the table, the first group of rows shows the optimal configu- ration, (8, 5) with PDC+DIV for energy conservation. In this con- figuration, the two best techniques, DIV and PDC+DIV, consume 18% and 30% less energy than EO, respectively. It is interesting to note that the optimal configuration is somewhat unintuitive; the intuitive ones are either invalid or sub-optimal. For example, the simplest redundant configuration ((2, 1), not shown) uses 40 disks, delivers enough throughput, but provides insufficient availability. The second group of rows shows results under another simple and intuitive mirrored configuration, (3, 1). For this configuration, DIV and PDC+DIV conserve 47% and 54% of the energy consumed by EO, respectively. The last group shows results for yet another in- tuitive configuration, (8, 1). In this scenario, DIV and PDC+DIV consume 61% and 64% less energy than EO, respectively. Note that the maximum throughput of (8, 1) is substantially higher than that of (8, 5), because the former configuration uses more disks and larger fragments than the latter. 6.3 Summary From these results, it is clear that DIV (independently or in com- bination with other techniques) is an effective technique. In most of the parameter space, the DIV energy savings are large and con- sistent. DIV is particularly effective for high n, low m, and read- mostly workloads. Wide-area storage utility, digital library, and file sharing systems, for example, exhibit these ideal properties. DIV is the very reason why MAID+DIV and PDC+DIV behave well; MAID and PDC independently are neither robust nor energy- efficient in most cases. MAID+DIV behaves better than DIV in part of the space, mostly due to our highly favorable modeling of write accesses in MAID+DIV (and MAID). However, in other parts, the cache disks contribute little besides energy overhead; in those sce- narios, MAID+DIV consumes more energy than EO. PDC+DIV conserves more energy than DIV when redundancy is limited. However, we also modeled PDC+DIV (and PDC) un- der favorable assumptions: perfect popularity categorization and no migration costs. Furthermore, PDC+DIV has one major draw- back: it is very complex to implement in practice, especially in the context of a distributed storage system. In fact, determining the best data layout for energy and bandwidth is clearly NP-hard. Based on these observations, we argue that DIV is the only effec- tive, robust, and practical redundancy-aware energy conservation technique. As we mentioned before, the other redundancy-aware techniques, EERAID, eRAID, and RIMAC, provide more limited savings than DIV as they only apply to RAID systems. It is also clear from our results that the task of a storage system designer is not simple. Choosing the right redundancy configura- tion requires making informed decisions based on all the system requirements. Our simple example showed that non-intuitive re- dundancy configurations may actually lead to the best results. 7. CASE STUDY: WIDE-AREA STORAGE We now evaluate DIV in the context of a realistic system under both real and synthetic workloads. In particular, we study a sim- ple wide-area storage system with stable nodes and data replication using simulation. The idea is to mimic a storage system owned and operated by a single world-wide institution, enterprise, or data utility on dedicated machines. We simulate a storage system comprised by geographically dis- tributed nodes. Each file stored in the system is broken down into 8-KB blocks, each of them replicated at k randomly selected nodes. A block-read request is routed directly to a randomly chosen replica. Block writes are routed to all replicas. To model this simple stor- age system, we describe it using n = k, m = 1, since k copies are always available and only one is required to retrieve the orig- inal data. Even though we could consider power-managing entire storage nodes, we continue focusing on disks (one disk per node, for simplicity). The simulator is trace-driven and selects a random node to re- ceive each client request in the trace. The request is then routed to the destination node and the reply is routed back. The network latency is assumed fixed at 50 ms. (We do not experiment with vari- able network latency to avoid adding sources of noise to the energy computation, which would make it hard to isolate where benefits come from.) We simulate DIV, FT, and EO. The DIV simulation keeps our technique active at all times (a real system would include a separate mechanism to turn DIV on/off). We simulate the same IBM disks we have been studying. 7.1 Comparing Modeling and Simulation In this section, we compare the results of our most important energy models (FT and DIV) against those of our simulator using synthetic workloads. Although technically not a “validation” of the models, this comparison is intended to build confidence in our main modeling results, as the simulator eliminates several of the modeling assumptions. Note however that our modeling of MAID, PDC, and their combinations with DIV is an upper bound on their energy conservation potential, so we do not consider them here. Our synthetic workload generator takes the request rate and per- centage of writes as input and produces a trace with 10,000 re- k N ReqRate pw Buffer Savings (%) Error (reqs/s) (MB) Sim Model (%) 3 6 10 0.50 0 0.0 0.0 0.0 3 6 10 0.50 8 45.7 46.1 0.6 3 6 10 0.50 ∞ 49.3 50.3 2.0 3 15 100 0.75 0 0.0 0.0 0.0 3 15 100 0.75 8 24.5 24.8 -0.3 3 15 100 0.75 ∞ 49.6 50.1 1.0 5 20 100 0.75 0 0.0 0.0 0.0 5 20 100 0.75 8 21.7 22.2 0.8 5 20 100 0.75 ∞ 59.8 60.1 0.8 3 3 0.1 0.0 0 13.0 12.8 -0.3 3 15 0.1 0.0 0 57.8 58.1 0.7 3 3 10 0.0 0 0.0 0.0 0.0 3 15 10 0.0 0 0.0 0.0 0.0 Table 4: Sample DIV (top) and FT (bottom) comparison results. quest arrivals drawn from a Pareto distribution with infinite vari- ance. Each request is directed to a different disk (in round-robin fashion) and accesses a block of 8 KB. Note that the generation of our synthetic traces differs markedly from our modeling approach. In particular, each request arrival corresponds to a single disk ac- cess in our synthetic traces. We executed a large number of simulations varying six system and workload parameters: pw ∈ [0, 0.25, 0.5, 0.75, 1], k ∈ [3, 5], req rate ∈ [0.01, 0.1, 10, 100, 1000], energy conservation tech- nique ∈ [FT, DIV ], N ∈ [3, 6, 9, 10, 15, 20], and wbSize ∈ [0, 1, 8,∞]. Parameter combinations that do not make sense (e.g., k > N ) or require more bandwidth than that of N disks were dis- carded. We then compared the energy results of the remaining 795 simulations with the corresponding modeling results. Table 4 shows a fraction of our validation results. The last col- umn shows the percentage difference between the energy consump- tion predicted by model and simulator. Our modeling results match the simulation results closely; the average error is 1.3%, the stan- dard deviation is 2.8%, and the maximum error is 18%. If requests are directed to disks randomly (rather than in round-robin fashion), these values become 3.4%, 8.1%, and 28%, respectively. The simulation and modeling trends match very closely. Again, DIV is most effective for high redundancy, read-mostly workloads, and larger write buffers. Also, FT is again only effective for very low request rates. These results build confidence in our parameter space study (Section 6). 7.2 Real Workload Results We also wanted to simulate our system for real traces. Unfor- tunately, the real file-system traces available publicly are not ap- propriate to evaluate wide-area storage systems such as the one we simulate; these traces are typically for local-area systems, which are amenable to small data and meta-data accesses. To approxi- mate the characteristics of the accesses to wide-area systems, we used two proxy traces from AT&T and the IRCache project. Proxy traces log the file accesses of a large set of clients to a large content base; the same characteristics of our wide-area system. The AT&T trace was collected between 01/16/99 and 01/22/99, whereas the IRCache trace was collected at three locations from 09/29/04 to 10/05/04. To mimic a system in which files are stored on disk and later retrieved, we pre-processed the traces to transform all file accesses into file read operations. Since the traces do not include information about file creation, we also introduced a write access for each unique file at a random time before the first access to it. After pre-processing, the AT&T trace exhibits 21,150,244 block requests, a 34% write percentage, an average request rate of 35 reqs/s, and a peak rate of 2266 reqs/s. The IRCache trace Buffer Energy Spin Idle Reqs Energy Size Downs Time Delayed Savings (MB) (MJ) (s) (%) (%) 0 120.3 5068 0.3 0.0 2.5 1 99.0 168540 27.9 0.8 19.7 2 78.1 110760 60.2 0.5 36.7 8 55.9 29120 265.9 0.2 54.7 32 50.3 8000 1088.5 0.1 59.3 ∞ 48.4 1168 34551.4 0.0 60.8 0 499.7 2134 0.9 0.0 0.0 1 307.6 220746 89.9 0.5 38.4 2 278.2 111846 186.7 0.3 44.3 8 255.8 28254 766.8 0.1 48.8 32 250.1 7194 3070.7 0.0 49.9 ∞ 248.3 390 80632.6 0.0 50.3 Table 5: DIV results: AT&T (top) and IRCache (bottom) traces. includes 42,976,431 block requests, a 34% write percentage, an average request rate of 71 reqs/s, and a peak rate of 10635 reqs/s. Before simulating, we need to define the ideal redundancy con- figuration for these workloads. First, we set the maximum through- put requirements as their peak request rates. Second, we set the target availability for the system at two different levels: 6 nines (IRCache) and 9 nines (AT&T). Third, we set the target reliabili- ties two nines higher than the target availabilities. Assuming these constraints and parameters, our optimization pro- cedure finds N = 20 disks and n = k = 5 (AT&T) and N = 81 disks and n = k = 3 (IRCache) as the best configurations. We assess the DIV behavior on these realistic traces by simulating the system with this configuration and different write buffer sizes. We list these results in table 5. From left to right, the table lists the write buffer sizes, the amount of energy consumed during the trace, the number of disk spin downs, the average idle times, the percentage of requests that were delayed (due to contention or disk spin ups), and the DIV energy savings with respect to EO. The table shows that DIV conserves between 20% and 61% of the energy, depend- ing on the size of the write buffers. Small buffers (e.g., 8 MB) can achieve most of the energy savings, but when we cripple DIV by eliminating its write buffers, it conserves little if any energy. Be- cause the periods of high load are short, simulating DIV all the time only leads to serious performance degradation when writes are de- layed by disk spin ups. However, as the table also shows, these delays are extremely infrequent. Although we do not include this information in the tables, our DIV model matches the simulations closely; the average errors are 3.4% (AT&T) and 1.8% (IRCache), whereas the maximum errors are 10% (AT&T) and 8% (IRCache). 8. CONCLUSIONS In this paper, we introduced Diverted Accesses, a novel and ef- fective energy conservation technique designed to leverage the re- dundancy in storage systems. We also introduced models that pre- dict the disk energy consumption of Diverted Accesses and the previously proposed techniques, as a function of the system’s re- dundancy configuration. Our evaluation coupled a wide parameter space exploration with simulations of a real storage system under two realistic workloads. This study was the first to consider the previous techniques in the presence and as a function of redun- dancy. Our modeling and simulation results showed that Diverted Accesses is very effective and robust throughout most of the pa- rameter space; other techniques are either not robust or impracti- cal. Furthermore, we found non-intuitive redundancy configura- tions to be ideal in a simple example, showing that designing a storage system requires quantifying and trading off several metrics; our energy models are key in this design process. For our realistic system and workloads, and ideal configuration, Diverted Accesses was able to conserve 20-61% of the disk energy consumed by an energy-oblivious system. We conclude that Diverted Accesses should be extremely useful for large-scale storage systems, such as outsourced storage services or wide-area storage utility, digital library, and file sharing systems. In fact, we believe that our technique would be even more beneficial (in absolute energy consumption terms) if applied to entire nodes rather than just their disks. Acknowledgements We would like to thank our shepherd Arif Merchant, as well as Enrique V. Carrera, Uli Kremer, Athanasios Papathanasiou, Anand Sivasubramaniam, and Yuanyuan Zhou for comments that helped us significantly improve the paper. We would also like to thank Prof. Arthur Goldberg from New York University for giving us the AT&T trace. 9. REFERENCES [1] R. Bhagwan, D. Moore, S. Savage, and G. M. Voelker. Replication Strategies for Highly Available Peer-to-Peer Storage. In Proceedings of International Workshop on Future Directions in Distributed Computing, May 2002. [2] E. V. Carrera, E. Pinheiro, and R. Bianchini. Conserving Disk Energy in Network Servers. In Proceedings of the 17th International Conference on Supercomputing, June 2003. [3] D. Colarelli and D. Grunwald. Massive Arrays of Idle Disks For Storage Archives. In Proceedings of the 15th High Performance Networking and Computing Conference, November 2002. [4] Data Domain. Data Domain DD400 Enterprise Series. http://www.datadomain.com/, 2005. [5] E. Anderson et al. Ergastulum: Quickly Finding Near-Optimal Storage System Designs. Technical Report HPL-SSP-2001-05, HP Laboratories SSP, June 2002. [6] E. Anderson et al. Hippodrome: Running Circles Around Storage Administration. In Proceedings of the International Conference on File and Storage Technology, pages 175–188, January 2002. [7] G. A. Alvarez et al. Minerva: An Automated Resource Provisioning Tool for Large-Scale Storage Systems. ACM Transactions on Computer Systems, 19(4):483–518, November 2001. [8] A. Goldberg and P. N. Yianilos. Towards an Archival Intermemory. In Proceedings of IEEE Advances in Digital Libraries, ADL 98, 1998. [9] S. Gurumurthi, A. Sivasubramaniam, M. Kandemir, and H. Franke. DRPM: Dynamic Speed Control for Power Management in Server Class Disks. In Proceedings of the International Symposium on Computer Architecture, June 2003. [10] A. Haeberlen, A. Mislove, and P. Druschel. Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures. In Proceedings of the 2nd Symposium on Networked Systems Design and Implementation, May 2005. [11] Y. Hu and Q. Yang. DCD – Disk Caching Disk: A New Approach for Boosting I/O Performance. In Proceedings of the 23rd International Symposium on Computer Architecture, June 1995. [12] J. Kubiatowicz et al. OceanStore: An Architecture for Global-scale Persistent Storage. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, November 2000. [13] E. K. Lee and C. A. Thekkath. Petal: Distributed Virtual Disks. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, 1996. [14] D. Li and J. Wang. EERAID: Energy-Efficient Redundant and Inexpensive Disk Array. In Proceedings of the 11th ACM SIGOPS European Workshop, Sept 2004. [15] D. Li and J. Wang. Conserving Energy in RAID Systems with Conventional Disks. In Proceedings of the International Workshop on Storage Network Architecture and Parallel I/Os, Sept 2005. [16] Maximum Throughput, Inc. Power, Heat, and Sledgehammer, April 2002. [17] Fred Moore. More Power Needed, November 2002. Energy User News. [18] A. Papathanasiou and M. Scott. Power-efficient Server-class Performance from Arrays of Laptop Disks. Technical Report 837, Department of Computer Science, University of Rochester, May 2004. [19] E. Pinheiro and R. Bianchini. Energy Conservation Techniques for Disk Array-Based Servers. In Proceedings of the 18th International Conference on Supercomputing (ICS’04), June 2004. [20] E. Pinheiro, R. Bianchini, and C. Dubnicki. Exploiting Redundancy to Conserve Energy in Storage Systems. Technical Report DCS-TR-570, Rutgers University, March 2005. [21] A. Rowstron and P. Druschel. Storage Management and Caching in PAST, a Large-Scale, Persistent Peer-to-Peer Storage Utility. In Proceedings of the International Symposium on Operating Systems Principles, 2001. [22] S. Gurumurthi et al. Interplay of Energy and Performance for Disk Arrays Running Transaction Processing Workloads. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, March 2003. [23] Y. Saito, C. Karamonolis, M. Karlsson, and M. Mahalingam. Taming Aggressive Replication in the Pangaea Wide-Area File System. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation, Dec 2002. [24] D. Siewiorek and R. Swarz. Reliable Computer Systems Design and Evaluation. A K Peters, third edition, 1998. [25] Sun Microsystems. Sun StorEdge 3320. http://www.sun.com/storage/, 2005. [26] H. Weatherspoon and J. Kubiatowicz. Erasure Coding vs. Replication: A Quantitative Comparison. In Proceedings of the 1st International Workshop on Peer-to-Peer Systems, March 2002. [27] X. Yao and J. Wang. RIMAC: A Redundancy-based, Hierarchical I/O Architecture for Energy-Efficient Storage Systems. In Proceedings of the 1st ACM EuroSys Conference, Apr 2006. [28] Q. Zhu, Z. Chen, L. Tan, Y. Zhou, K. Keeton, and J. Wilkes. Hibernator: Helping Disk Arrays Sleep Through the Winter. In Proceedings of the 20th ACM Symposium on Operating Systems Principles, Oct 2005. [29] Q. Zhu and Y. Zhou. Power-Aware Storage Cache Management. IEEE Transactions on Computers, 54(5), 2005.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved