Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Designing In-Storage Computing System for Emerging High-Performance Drive, Lecture notes of Computer Architecture and Organization

INSIDER, a redesigned storage system to help users fully utilize the performance of emerging storage drives with moderate programming efforts. It introduces an FPGA-based reconfigurable drive controller as the in-storage computing unit and integrates with the existing system stack. the challenges faced by system designers due to the shift of the system bottleneck from the storage drive to the host/drive interconnection and host I/O stacks. It also talks about the limitations of existing approaches to leverage drive-embedded ARM cores or ASIC for task offloading. INSIDER as a solution to these challenges and evaluates its performance improvement and accelerator cost efficiency. INSIDER is open-sourced and has been adapted to the AWS F1 instance for public access.

Typology: Lecture notes

2021/2022

Uploaded on 05/11/2023

charlene
charlene 🇺🇸

4.8

(5)

33 documents

1 / 16

Toggle sidebar

Related documents


Partial preview of the text

Download Designing In-Storage Computing System for Emerging High-Performance Drive and more Lecture notes Computer Architecture and Organization in PDF only on Docsity! INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive Zhenyuan Ruan∗ Tong He Jason Cong University of California, Los Angeles Abstract We present INSIDER, a full-stack redesigned storage sys- tem to help users fully utilize the performance of emerging storage drives with moderate programming efforts. On the hardware side, INSIDER introduces an FPGA-based recon- figurable drive controller as the in-storage computing (ISC) unit; it is able to saturate the high drive performance while retaining enough programmability. On the software side, IN- SIDER integrates with the existing system stack and provides effective abstractions. For the host programmer, we introduce virtual file abstraction to abstract ISC as file operations; this hides the existence of the drive processing unit and minimizes the host code modification to leverage the drive computing capability. By separating out the drive processing unit to the data plane, we expose a clear drive-side interface so that drive programmers can focus on describing the computation logic; the details of data movement between different system com- ponents are hidden. With the software/hardware co-design, INSIDER runtime provides crucial system support. It not only transparently enforces the isolation and scheduling among offloaded programs, but it also protects the drive data from being accessed by unwarranted programs. We build an INSIDER drive prototype and implement its corresponding software stack. The evaluation shows that IN- SIDER achieves an average 12X performance improvement and 31X accelerator cost efficiency when compared to the ex- isting ARM-based ISC system. Additionally, it requires much less effort when implementing applications. INSIDER is open- sourced [5], and we have adapted it to the AWS F1 instance for public access. 1 Introduction In the era of big data, computer systems are experiencing an unprecedented scale of data volume. Large corporations like Facebook have stored over 300 PB of data at their warehouse, with an incoming daily data rate of 600 TB [62] in 2014. A recent warehouse-scale profiling [42] shows that data analytics has become a major workload in the datacenter. Operating on such a data scale is a huge challenge for system designers. Thus, designing an efficient system for massive data analytics has increasingly become a topic of major importance [23, 27]. The drive I/O speed plays an important role in the overall data processing efficiency—even for the in-memory comput- ing framework [68]. Meanwhile, for decades the improve- ∗Corresponding author. ment of storage technology has been continuously pushing forward the drive speed. The two-level hierarchy (i.e., chan- nel and bank) of the modern storage drive provides a scal- able way to increase the drive bandwidth [41]. Recently, we witnessed great progress in emerging byte-addressable non- volatile memory technologies which have the potential to achieve near-memory performance. However, along with the advancements in storage technologies, the system bottleneck is shifting from the storage drive to the host/drive intercon- nection [34] and host I/O stacks [31, 32]. The advent of such a "data movement wall" prevents the high performance of the emerging storage from being delivered to end users—which puts forward a new challenge to system designers. Rather than moving data from drive to host, one natural idea is to move computation from host to drive, thereby avoid- ing the aforementioned bottlenecks. Guided by this, existing work tries to leverage drive-embedded ARM cores [33,57,63] or ASIC [38, 40, 47] for task offloading. However, these ap- proaches face several system challenges which make them less usable: 1) Limited performance or flexibility. Drive- embedded cores are originally designed to execute the drive firmware; they are generally too weak for in-storage comput- ing (ISC). ASIC, brings high performance due to hardware customization; however, it only targets the specific workload. Thus, it is not flexible enough for general ISC. 2) High pro- gramming efforts. First, on the host side, existing systems develop their own customized API for ISC, which is not com- patible with an existing system interface like POSIX. This requires considerable host code modification to leverage the drive ISC capability. Second, on the drive side, in order to access the drive file data, the offloaded drive program has to understand the in-drive file system metadata. Even worse, the developer has to explicitly maintain the metadata consistency between host and drive. This approach requires a significant programming effort and is not portable across different file systems. 3) Lack of crucial system support. In practice, the drive is shared among multiple processes. Unfortunately, ex- isting work assumes a monopolized scenario; the isolation and resource scheduling between different ISC tasks are not explored. Additionally, data protection is an important con- cern; without it, offloaded programs can issue arbitrary R/W requests to operate on unwarranted data. To overcome these problems, we present INSIDER, a full- stack redesigned storage system which achieves the following design goals. Saturate high drive rate. INSIDER introduces the FPGA- based reconfigurable controller as the ISC unit which is able to process the drive data at the line speed while retaining pro- grammability (§3.1). The data reduction or the amplification pattern from the legacy code are extracted into a drive program which could be dynamically loaded into the drive controller on demand (§3.2.2). To increase the end-to-end throughput, IN- SIDER transparently constructs a system-level pipeline which overlaps drive access time, drive computing time, bus data transferring time and host computing time (§3.5). Provide effective abstractions. INSIDER aims to provide effective abstractions to lower the barrier for users to leverage the benefits of ISC. On the host side, we provide virtual file abstraction which abstracts ISC as file operations to hide the existence of the underlying ISC unit (§3.3). On the drive side, we provide a compute-only abstraction for the offloaded task so that drive programmers can focus on describing the computation logic; the details of underlying data movement between different system components are hidden (§3.4). Provide necessary system support. INSIDER separates the control and data planes (§3.2.1). The control plane is trusted and not user-programmable. It takes the responsibili- ties of issuing drive access requests. By performing the safety check in the control plane, we protect the data from being accessed by unwarranted drive programs. The ISC unit, which sits on the data plane, only intercepts and processes the data between the drive DMA unit and storage chips. This compute- only interface provides an isolated environment for drive pro- grams whose execution will not harm other system compo- nents in the control plane. The execution of different drive programs is hardware-isolated into different portions of FPGA resources. INSIDER provides an adaptive drive bandwidth scheduler which monitors the data processing rates of differ- ent programs and provides this feedback to the control plane to adjust the issuing rates of drive requests accordingly (§3.6). High cost efficiency. We define cost efficiency as the ef- fective data processing rate per dollar. INSIDER introduces a new hardware component into the drive. Thus, it is critical to validate the motivation by showing that INSIDER can achieve not only better performance, but also better cost efficiency when compared to the existing work. We build an INSIDER drive prototype (§4.1), and imple- ment its corresponding software stack, including compiler, host-side runtime library and Linux kernel drivers (§4.2). We could mount the PCIe-based INSIDER drive as a normal stor- age device in Linux and install any file system upon it. We use a set of widely used workloads in the end-to-end sys- tem evaluation. The experiment results can be highlighted as follows: 1) INSIDER greatly alleviates the system inter- connection bottleneck. It achieves 7X∼11X performance compared with the host-only traditional system (§5.2.1). In most cases, it achieves the optimal performance (§5.2.2). 2) INSIDER achieves 1X∼58X (12X on average) performance and 2X∼150X (31X on average) cost efficiency compared to the ARM-based ISC system (§5.5). 3) INSIDER only re- quires moderate programming efforts to implement applica- tions (§5.2.3). 4) INSIDER simultaneously supports multiple offloaded tasks, and it can enforce resource scheduling adap- tively and transparently (§5.3). 2 Background and Related Work 2.1 Emerging Storage Devices: Opportunities and Challenges Traditionally, drives are regarded as a slow device for the sec- ondary persistent storage, which has the significantly higher access latency (in ms scale) and lower bandwidth (in hundreds of MB per second) compared to DRAM. Based on this, the classical architecture for storage data processing presented in Fig. 3a has met users’ performance requirements for decades. The underlying assumptions of this architecture are: 1) The interconnection performance is higher than the drive perfor- mance. 2) The execution speeds of host-side I/O stacks, includ- ing the block device driver, I/O scheduler, generic block layer and file system, are much faster than the drive access. While these were true in the era of the hard-disk drive, the landscape has totally changed in recent years. The bandwidth and la- tency of storage drives have improved significantly within the past decade (see Fig. 1 and Fig. 2). However, meanwhile, the evolution of the interconnection bus remains stagnant: there have been only two updates between 2007 and 2017.1 For the state-of-the-art platform, PCIe Gen3 is adopted as the interconnection [66], which is at 1 GB/s bidirectional transmission speed per link. Due to the storage density2 and due to cost constraints, the four-lane link is most commonly used (e.g., commercial drive products from Intel [7] and Sam- sung [14]), which implies the 4 GB/s duplex interconnec- tion bandwidth. However, this could be easily transcended by the internal bandwidth of the modern drive [24, 33, 34]. Their internal storage units are composed of multiple chan- nels, and each channel equips multiple banks. Different from the serial external interconnection, this two-level architec- ture is able to provide scalable internal drive bandwidth—a sixteen-channel, single-bank SSD (which is fairly common now) can easily reach 6.4 GB/s bandwidth [46]. The grow- ing mismatch between the internal and external bandwidth prevents us from fully utilizing the drive performance. The mismatch gets worse with the advent of 3D-stacked NVM- based storage which can deliver comparable bandwidth with DRAM [35, 54]. On the other hand, the end of Dennard scal- ing slows down the performance improvement of CPU, mak- ing it unable to catch the ever-increasing drive speed. The long-established block layer is now reported to be a major 1Although the specification of PCIe Gen 4 was finalized at the end of 2017, there is usually a two-year waiting period for the corresponding motherboard to be available in the market. Currently there is no motherboard supporting PCIe 4.0, and we do not include it in the figure. 2CPU has limited PCIe slots (e.g., 40 lanes for an Xeon CPU) exposed due to the pin constraint. Using more lanes per drive leads to low storage density. In practice, a data center node equips 10 or even more storage drives. it prevents the drive data from being accessed by unwarranted drive programs. In addition, the compute-only abstraction brings an isolated environment for the accelerator cluster; its execution will not harm the execution of other system compo- nents in the control plane. The execution of different offloaded tasks in the accelerator cluster is further hardware-isolated into different portions of FPGA resources. 3.2.2 Accelerator Cluster As shown in the rightmost portion of Fig. 4, the accelerator cluster is divided into two layers. The inner layer is a pro- grammable region which consists of multiple application slots. Each slot can accommodate a user-defined application accel- erator. Different than the multi-threading in CPU, which is time multiplexing, different slots occupy different portions of hardware resources simultaneously, thus sharing FPGA in spa- tial multiplexing. By leveraging partial reconfiguration [44], host users can dynamically load a new accelerator to the spec- ified slot. The number of slots and slot sizes are chosen by the administrator to meet the application requirements, i.e., number of applications executing simultaneously and the re- source consumption of applications. The outer layer is the hardware runtime which is responsible for performing flow control (§3.5) and dispatching data to the corresponding slots (§3.6). The outer layer is set to be user-unprogrammable to avoid safety issues. 3.3 The Host-Side Programming Model In this section we introduce virtual file abstraction which is the host-side programming model of INSIDER. A virtual file is fictitious, but pretends to be a real file from the perspective of the host programmer—it can be accessed via a subset of the POSIX-like file I/O APIs shown in Table 2. The access to virtual file will transparently trigger the underlying system data movement and the corresponding ISC, creating an illu- sion that this file does really exist. By exposing the familiar file interface, the effort of rewriting the traditional code into the INSIDER host code is negligible. We would like to point out that INSIDER neither imple- ments the full set of POSIX IO operations nor provides the full POSIX semantics, e.g., crash consistency. The argument here is similar to the GFS [37] and Chubby [29] papers: files provide a familiar interface for host programmers, and ex- posing a file-based interface for ISC can greatly alleviate the programming overheads. Being fully POSIX-compliant is not only expensive but also unnecessary in most use cases. 3.3.1 Virtual File Read Listing 1 shows a snippet of the host code that performs virtual file read. We will introduce the design of virtual file read based on the code order. Fig. 5 shows the corresponding diagram. System startup During the system startup stage, IN- SIDER creates a hidden mapping file .USERNAME.insider in the host file system for every user. The file is used to store the virtual file mappings (which will be discussed soon). For security concerns, INSIDER sets the owner of the mapping file to the corresponding user and sets the file permission to 0640. // register a virtual file string virt = reg_virt_file(real_path ,acc_id); // open the virtual file int fd = vopen(virt.c_str(),O_RDONLY); if (fd != -1) { // send drive program parameters (if there are any) send_params(fd, param_buf , param_buf_len); int rd_bytes = 0; // read virtual file while (rd_bytes = vread(fd, buf, buf_size)) { // user processes the read data process(buf, rd_bytes); } // close virtual file, release resources vclose(fd); } Listing 1: Host-side code of performing virtual file read. 1). int vopen(const char *path, int flags) 2). ssize_t vread(int fd, void *buf, size_t count) 3). ssize_t vwrite(int fd, void *buf, size_t count) 4). int vsync(int fd) 5). int vclose(int fd) 6). int vclose(int fd, size_t *rfile_written_bytes) 7). string reg_virt_file(string file_path, string acc_id) 8). string reg_virt_file(tuple<string, uint, uint> file_sg_list, string acc_id) 9). bool send_params(int fd, void *buf, size_t count) Table 2: INSIDER host-side APIs. vwrite, vsync will be discussed in §3.3.2 while others will be discussed in §3.3.1. Registration. The host program determines the file data to be read by the in-drive accelerator by invoking reg_virt_file (method 7 in Table 2); it takes the path of a real file plus an application accelerator ID, and then maps them into a virtual file. Alternatively, reg_virt_file (method 8) accepts a vector of <file name, offset, length> tuples to support the gather- read pattern.3 This allows us to create the virtual file based on discrete data from multiple real files. During the registra- tion phase, the mapping information will be recorded into the corresponding mapping file, and the specified accelerator will be programmed into an available slot of the in-drive re- configurable controller. INSIDER currently enforces a simple scheduling policy: it blocks when all current slots are busy. File open. After registration, the virtual file can be opened via vopen. The INSIDER runtime will first read the mapping file to know the positions of the mapped real file(s). Next, the runtime issues the query to the host file system to retrieve the accessing permission(s) and the ownership(s) of the real file(s). Then, the runtime performs the file-level permission check to find out whether the vopen caller has the correct ac- cess permission(s); in INSIDER, we regard the host file system and INSIDER runtime as trusted components, while the user programs are treated as non-trusted components. If it is an unauthorized access, vopen will return an invalid file descrip- tor. Otherwise, the correct descriptor will be returned, and the corresponding accelerator slot index (used in §3.6) will be 3Currently INSIDER operates drive data at the granularity of 64 B, there- fore the offset and length fields have to be multiples of 64 B. It is a limitation of our current implementation rather than the design. 1) System startup File system .USERNAME. Insider cr ea te 2) Registration File system .USERNAME. Insider Real File Drive 2. 1 up da te 2.2 program slot 3) File open File system .USERNAME. Insider Real File 3. 1 re al fi le in fo <name, off, len> 3. 4 ex ten ts in fo fro m fi lef ra g3.2 check perm ission 3.3 set append- only attribute Drive Drive Program 3.5 runtim e param s Firm- ware 5) File close 5.1 unset append-only Host Resource fd, buffers, ... 5.2 release 5.3 reset 4) File read Real file Virt file Drive Prog. 4. 1 vr ea d 4.2 intercepts 4.3 output DM A DriveFile system .USERNAME. Insider Figure 5: The system diagram of performing virtual file read. Only major steps are shown in this figure, see the text description in §3.3.1 for details. sent to the INSIDER drive. After that, the INSIDER runtime asks the INSIDER kernel module to set the append-only at- tribute (if it is not already set by users before) on the mapped real file(s); this is used to guarantee that the current blocks of the real file(s) will not be released or replaced during the virtual file read.4 Later on, INSIDER retrieves the locations of real file extents via the filefrag tool and transfers them to the drive. Finally, the host program sends runtime parameters of the accelerator program (discussed in §3.4), if there are any, via send_params to the drive. File read. Now the host program can sequentially read the virtual file via vread. It first triggers the INSIDER drive to read the corresponding real file extents. The accelerator intercepts the results from the storage chips and invokes the corresponding data processing. Its output will be transferred back to the host via DMA, creating an illusion that the host is reading a normal file (which actually turns out to be a virtual file). The whole process is deeply pipelined without stalling. The detailed design of pipelining is discussed in §3.5. It seems to be meaningless to read the ISC results randomly, thus we do not implement a vseek interface. File close. Finally, the virtual file is closed via vclose. In this step, the INSIDER runtime will contact the INSIDER kernel module to clear the append-only attribute if it was previously set in vopen. The host-side resource (e.g., file descriptor, the host-side buffer for DMA, etc.) will be released. Finally, the runtime sends the command to the INSIDER drive to reset the application accelerator to its initial state. Virtual file read helps us to alleviate the bandwidth bot- tleneck in the drive → host direction. For example, for the feature selection application [64], the user registers a virtual file based on the preselected training data and the correspond- ing accelerator. The postselected result could be automatically read via vread without transferring the large preselected file from drive to host. Thus, the host program can simply use the virtual file as the input file to run the ML training algorithm. 3.3.2 Virtual File Write Virtual file write works mostly in the same way but reverses the data path direction. We focus on describing the difference. Registration. Virtual write requires users to preallocate enough space for the real file(s) to store the write output. If users leverage the fallocate system call to preallocate the file, they have to make sure to clear the unwritten flag on the file 4With the append-only attribute, ftruncate will fail to release blocks, and the file defragmentation tool, e.g., xfs_fsr will ignore these blocks [21]. extents.5 Otherwise, later updates on the real file may only be perceived via the INSIDER interface but not the OS interface. File open. Besides the steps in §3.3.1, INSIDER runtime invokes f sync to flush dirty pages of the real file(s) to drive if there are any. This guarantees the correct order between previous host-initiated write requests and the upcoming IN- SIDER drive-initiated write requests. File write. In the file write stage, users invoke vwrite to write data to the virtual file. The written data is transferred to INSIDER drive through DMA, and then will be intercepted and processed by the accelerator. The output data will be written into the corresponding real file blocks. INSIDER also provides vsync (method 4 in Table 2), which can be used by users to flush in-core vwrite data to the INSIDER drive. File close. Besides the steps in §3.3.1, INSIDER runtime will drop the read cache of the real file(s), if there are any, to guarantee that the newly drive-written data can be perceived by the host. This is conducted via calling posix_fadvise with POSIX_FADV_DONTNEED. Via invoking a variant of vclose (method 6 in Table 2), users can know the number of bytes written to the real file(s) by the underlying INSIDER drive. Based on the returned value, users may further invoke ftrun- cate to truncate the real file(s). Virtual file write helps us alleviate the bandwidth bottle- neck in the host→ drive direction, since less data needs to be transferred through the bus (they then gets amplified in drive). For example, the user can register a virtual file based on a compressed real file and a decompression drive program. In this scenario, only compressed data needs to be transferred through the bus, and the drive performs in-storage decompres- sion to materialize the decompressed file. Since the virtual file write is mostly symmetric to the virtual file read, in the following we will introduce other system designs based on the direction of read to save space. 3.3.3 Concurrency Control In INSIDER, a race condition might happen in the following cases: 1) Simultaneously a single real file is being vwrite and vread; 2) Simultaneously a real file is being vwrite by different processes; 3) A single real file is being vread, and meanwhile it is being written by a host program. In these cases, the users may encounter non-determinate results. 5In Linux, some file systems, e.g., ext4, will put the unwritten flag over the file extents preallocated by fallocate. Any following read over the extents will simply return zero(s) without actually querying the underlying drive; this is designed for security considerations since the preallocated blocks may contain the data from other users. Figure 6: A simple example of the INSIDER drive accelerator code. The problem also applies to Linux file systems: for example, different host processes may write to a same file. Linux file systems do not automatically enforce the user-level file concur- rency control and leave the options to users. INSIDER makes the same decision here. When it is necessary, users can reuse the Linux file lock API to enforce the concurrency control by putting the R/W lock to the mapped real file. 3.4 The Drive-Side Programming Model In this section we introduce the drive-side programming model. INSIDER defines a clear interface to hide all details of data movements between the accelerator program and other system components so that the device programmer only needs to focus on describing the computation logic. INSIDER pro- vides a drive-side compiler which allows users to program in-drive accelerators with C++ (see Fig. 6 for a sample pro- gram). Additionally, the INSIDER compiler also supports the traditional RTL (e.g., Verilog) for experienced FPGA pro- grammers. As we will see in §5.2, only C++ is used in the evaluation, and it can already achieve near-optimal perfor- mance in our scenario (§5.2.2). Drive program interface consists of three FIFOs—data in- put FIFO, data output FIFO and parameter FIFO, as shown in the sample code. Input FIFO stores the intercepted data which is used for the accelerator processing. The output data of the accelerator, which will be sent back to host and acquired by vread, is stored into output FIFO. The host-sent runtime parameters are stored in parameter FIFO. The input and the output data are wrapped into a sequence of flits, i.e., struct APP_Data (see Fig. 6). The concept of flit is similar to the "word size" in host programs. Each flit contains a 64-byte payload, and the eop bit is used for marking the end of the input/output data. The length of data may not be multiples of 64 bytes, the len field is used to indicate the length of the last flit. For example, 130-byte data is composed by three flits; the last flit has eop = true and len = 2. The sample program first reads two parameters, upper bound and lower bound, from the parameter FIFO. After that, in each iteration, the program reads the input record from the input FIFO. Then the program checks the filtering condition and writes the matched record into the output FIFO. Users can define stateful variables which are alive across iterations, e.g., line 11 - line 13 in Fig. 6, and stateless variables as well, e.g., line 22. These variables will be matched into FPGA reg- isters or block RAMs (BRAMs) according to their sizes. The current implementation does not allow placing variables into FPGA DRAM, but it is trivial to extend. INSIDER supports modularity. The user can define mul- tiple sub-programs chained together with FIFOs to form a complete program, as long as it exposes the same drive accel- erator interface shown above. Chained sub-programs will be compiled as separate hardware modules by the INSIDER com- piler, and they will be executed in parallel. This is very similar to the dataflow architecture in the streaming system, and we can build a map-reduce pipeline in drive with chained sub- programs. In fact, most applications evaluated in §5.2 are im- plemented in this way. Stateful variables across sub-programs could also be passed through the FIFO interface. 3.5 System-Level Pipelining Logically, in INSIDER, vread triggers the drive controller to fetch storage data, perform data processing, and transfer the output back to host. After that, the host program can finally start the host-side computation to consume the data. A naive design leads to the execution time t = tdrive_read +tdrive_comp.+ tout put_trans.+ thost_comp. As we will see in §5.2, this leads to a limited performance. INSIDER constructs a deep system-level pipeline which in- cludes all system components involved in the end-to-end pro- cessing. It happens transparently for users; they simply use the programming interface introduced in §3.3 and §3.4. With pipelining, the execution time is decreased to max(tdrive_read , tdrive_comp., tout put_trans., thost_comp). Overlap tdrive_read with tdrive_comp. We carefully design the INSIDER hardware logic to ensure that it is fully pipelined, so that the storage read stage, computation stage and output DMA stage overlap one another. Overlap drive, bus and host time We achieve this by 1 Pre-issuing the file access requests during vopen which would trigger the drive to perform the precomputation; 2 Allocat- ing the host memory in the INSIDER runtime to buffer the drive precomputed results. With 1 , the drive has all the posi- tion information of the mapped real file, and it can perform computation at its own pace. Thus, the host-side operation is decoupled from the drive-side computation. 2 further de- couples the bus data transferring from the drive-side compu- tation. Now, each time that the host invokes vread, it simply pops the precomputed result from host buffers. To prevent the 0 2 4 6 8 10 12 Grep KNN Statistics SQL Integration Feature selection Bitmap S pe ed up Host-bypass/x8 Host-bypass/x16 Host-bypass-pipeline/x8 Host-bypass-pipeline/x16 INSIDER/x8 INSIDER/x16 Figure 9: Speedup of optimized host-only versions and INSIDER version compared to the host-only baseline (§5.2.1). 0 2 4 6 8 10 12 Grep KNN Statistics SQL Intergration Feature Sel. Bitmap S pe ed up customized IO stack pipeline & offload data reduction (a) INSIDER/x8 (i.e, the bus-limited case). 0 2 4 6 8 10 12 Grep KNN Statistics SQL Integration Feature Sel. Bitmap S pe ed up customized IO stack pipeline & offload data reduction (b) INSIDER/x16 (i.e, the bus-ample case). Figure 10: The breakdown of the speedup achieved by INSIDER compared with the host-only baseline (§5.2.1). this corresponds to Host-bypass-pipeline in Fig. 9. Finally, we leverage the ISC capability to offload computing tasks to the drive. For this version we largely reuse code from the baseline version since the virtual file abstraction allows us to stay at the traditional file accessing interface (§3.3) and INSIDER transparently constructs the system-level pipeline (§3.5). This corresponds to INSIDER in Fig. 9. Note that the end-to-end execution time here includes the overheads of INSIDER APIs like vopen, vclose, but it does not include the overhead of reconfiguring FPGA, which is in the order of hundreds of milliseconds and is proportional to the region size [67]. We envision that in practice the appli- cation execution has time locality so that the overheads of reconfiguring will be amortized by multiple following calls. The speedup of version INSIDER is derived from three as- pects: 1) customized I/O stack (§4.2), 2) task offloading (§3.4) and system-level pipelining (§3.5), and 3) reduced data vol- ume (which leads to lower bus time). See Fig. 10 for the speedup breakdown in these three parts. In the x8 setting, which has lower bus bandwidth, data reduction is the major source of the overall speedup. By switching from x8 to x16, the benefit of data reduction decreases, which makes sense since now we use a faster interconnection bus. Nevertheless, it still accounts for a considerable speedup. Meanwhile, pipelin- ing and offloading contribute to a major part of the speedup. As we discussed in §2.1, four-lane (the most common) and eight-lane links are used in real life because of storage density and cost constraints. INSIDER/x16 does not represent a practical scenario at this point. The motivation for showing both the results of x8 and x16 is to compare the benefits of data reduction in both bus-limited and bus-ample cases. 5.2.2 Optimality and Bottleneck Analysis Table 5 shows the performance bottleneck of different exe- cution schemes for seven applications. For Host-bypass, lim- Host- bypass/x8 Host- bypass/x16 INSIDER/x8 INSIDER/x16 Grep PCIe PCIe Drive Drive KNN PCIe Comp. Drive Drive Statistics PCIe PCIe Drive Drive SQL query PCIe Comp. Comp. Comp. Integration PCIe PCIe Drive Drive Feature selec- tion Comp. Comp. PCIe Drive Bitmap de- compression PCIe PCIe Drive Drive Table 5: The end-to-end performance bottleneck of different executing schemes over seven different applications. Here PCIe, Drive and Comp. indicate that the bottleneck is PCIe performance, drive chip performance and the host-side computation performance, respectively (§5.2.2). ited PCIe bandwidth is the major bottleneck for the overall performance. In contrast, after enabling the in-storage pro- cessing, even in the PCIe x8 setting, there is only one case in which PCIe becomes the bottleneck (see INSIDER/x8). For most cases in INSIDER, the overall performance is bounded by the internal drive speed, which indicates that the optimal performance has been achieved. For some cases, like KNN and feature selection, host-side computation is the perfor- mance bottleneck for Host-bypass. This is alleviated in IN- SIDER since FPGA has better computing capabilities for the offloaded tasks. For INSIDER, SQL query is still bottlenecked by the host-side computation of the non-offloaded part. 5.2.3 Development Efforts Table 3 also presents the developing efforts of implementing these applications in terms of lines of code (column LoC) and the developing time (column Devel. Time). With virtual file abstraction, all host programs here only require less than half an hour to be ported to the INSIDER; The main development time is spent on implementing the drive accelerator which requires drive programmers to tune the performance. This time is expected to be reduced in the future with continuous improvements on the FPGA programming toolchain. Addi- 0 2 4 6 8 10 12 14 16 0.5 1 1.5 2 2.5 B an dw id th ( G B /s ) Time (s) statistics SQL pass-through Figure 11: Data rates of accelerators that are executed simultaneously in drive. The drive bandwidth is 16 GB/s, and the bandwidth requested by statis- tics, SQL and pass-through are 12 GB/s, 6.4 GB/s and 8 GB/s, respectively. statistics starts before time 0 s and ends at about time 1.5 s. SQL starts at about time 0.4 s and ends at about time 2.4 s. Pass-through starts at about time 0.8 s and ends at about time 2.6 s. LUT FF BRAM DSP Grep 34416 24108 1 0 KNN 9534 11975 0.5 0 Statistics 14698 15966 0 0 SQL query 9684 14044 1 0 Integration 40112 6497 14 0 Feature selection 41322 44981 24 48 Bitmap decompression 60837 13676 0 0 INSIDER framework 68981 120451 309 0 DRAM and DMA IP cores 210819 245067 345.5 12 XCVU9P [19] 1181768 2363536 2160 6840 XC7A200T [2] 215360 269200 365 740 Table 6: The top half shows the FPGA resource consumption in our experi- ments. Generally, an FPGA chip contains four types of resources: look-up tables (LUTs), flip-flops (FFs), block RAMs (BRAMs, which are SRAM- based), digital signal processors (DSPs). The bottom half shows the initial available resource in FPGA XCVU9P and XC7A200T. tionally, since INSIDER provides a clear interface to separate the responsibilities between host and drive, drive programs could be implemented as a library by experienced FPGA de- velopers. This can greatly lower the barrier for host users when it comes to realizing the benefits of the INSIDER drive. Still, the end-to-end developing time is much less compared to an existing work. Table 1 in work [61] shows that WILLOW requires thousands of LoC and one-month development time to implement some basic drive applications like simple drive I/O (1500 LoC, 1 month) and file appending (1588 LoC, 1 month). WILLOW is definitely an excellent work, and here the main reason is that WILLOW was designed at a lower layer to extend the semantics of the storage drive, while IN- SIDER focuses on supporting ISC by exposing a compute-only interface at drive and file APIs at host. 5.3 Simultaneous Multiprocessing In this section we focus on evaluating the effectiveness of the design in §3.6. We choose statistics, SQL query, and pass-through as our offloaded applications. On the drive accelerator side, we throttle their computing speeds below the drive internal bandwidth so that each of them cannot fully saturate the high drive rate: BWdrive = 16 GB/s,BWstat = 12 GB/s,BWSQL = 6.4 GB/s,BWPT = 8 GB/s. The host-side task scheduling has already been enforced by the host OS, and our goal here is to evaluate the effectiveness of the drive-side bandwidth scheduling. Hence, we modify the host programs so that they only invoke INSIDER APIs without doing the host-side computation. In this case, the application execution time is a close approximation of the drive-side accelerator execution time. Therefore, the data processing rate for each accelerator can be calculated as rate = ∆size(data)/∆time. Fig. 11 presents the runtime data rate of three accelera- tors that execute simultaneously in drive. As we can see, IN- SIDER will try best to accommodate the bandwidth requests of offloaded applications. When it is not possible to do so, i.e., the sum of total requested bandwidth is higher than the drive bandwidth, INSIDER will schedule bandwidth for applications in a fair fashion. 5.4 Analysis of the Resource Utilization Table 6 presents the FPGA resource consumption in our ex- periments. The end-to-end resource usage consists of three parts: 1 User application logic. Row Grep to row Bitmap de- compression correspond to this part. 2 INSIDER framework. Row INSIDER framework corresponds to this part. 3 I/O IP cores. This part mainly comprises the resource for the DRAM controller and the DMA controller. Row DRAM and DMA IP cores correspond to this part. We note that 3 takes the major part of the overall resource consumption. However, these components actually already ex- ist (in the form of ASIC hard IP) in modern storage drives [33], which also have a built-in DRAM controller and need to inter- act with host via DMA. Thus, 3 only reflects the resource use that would only occur in our prototype due to our limited eval- uation environment. The final resource consumption should be measured as 1 + 2 . Row XCVU9P [19] and row XC7A200T show the available resource of a high-end FPGA (which is used in our evaluation) and a low-end FPGA 7, respectively. We notice that in the best case, the low-end FPGA is able to simultaneously accommodate five resource-light applications (grep, KNN, statistics, SQL, integration). The key insight here is that, for the ISC purpose, we only need to offload code snip- pet involving data reduction (related to the virtual file read) or data amplification (related to the virtual file write), therefore the drive programs are frugal in the resource usage. 5.5 Comparing with the ARM-Based System Methodology. We assume that only the FPGA-based ISC unit is replaced by the ARM CPU, and all other designs remain unchanged. We extract the computing programs from the traditional host-only implementation used in §5.2. Since we assume the system-level pipelining (§3.5) is also deployed here, the final end-to-end time of the ARM-based platform could be calculated as Te2e = max(Thost ,Ttrans,TARM), where Thost denotes the host-side processing time and Ttrans denotes the host/drive data transferring time. Here, Thost and Ttrans are taken from the measured data of INSIDER at §5.2. We target Cortex-A72 (using parameters in [12]), which is a high- end quad-core three-way superscalar ARM processor. We conduct runtime profilings over an ARM machine to extract 7We do not directly use XC7A200T in the evaluation since we cannot find a low-end FPGA board with large DRAM, which forces us to use XCVU9P. 101 102 103 104 Grep KNN Statistics SQL Integration Feature Sel. Bitmap T hr ou gh pu t ( M iB /s ) ARM-1C ARM-2C ARM-3C ARM-4C INSIDER Figure 12: End-to-end data processing rates of INSIDER and the ARM-based platforms. ARM-NC means to use N core(s). 0 100 200 300 400 Grep KNN Statistics SQL Feature Sel. Integration Bitmap C os t E ffi ci en cy ( M iB /$ ) ARM INSIDER Figure 13: Cost efficiency (defined as data processing rates per dollar) of INSIDER and the ARM-based platforms. We do not include the cost of storage drive, whose price varies significantly across configurations. the number of program instructions. The program execution time is then calculated optimistically by assuming that it has perfect IPC and perfect parallelism over multiple cores. Fig. 12 (in log scale) shows the end-to-end data processing rates of INSIDER and the ARM-based platform. The speedup of INSIDER is highly related to the computation intensity of examined applications, but on average, INSIDER could achieve 12X speedup. For KNN, which is the most compute-intensive case, INSIDER could achieve 58X speedup; while for SQL query, which has the least computation intensity, the ARM- based platform could achieve the same performance. We further present the cost efficiency of the ARM and INSIDER platforms, which is defined as the data processing rate per dollar. As discussed in §5.4, FPGA XC7A200T is already able to meet our resource demand; thus we use it in this evaluation. The wholesale price of FPGA is much less compared to its retail price according to the experience of Microsoft [36]. For a fair comparison, we use the wholesale prices of FPGA XC7A200T ($37 [20]) and ARM cortex-A72 ($95 [12]). We did not include the cost of storage drive in this comparison. Fig. 13 shows the cost efficiency results. Compared with the ARM-based platform, INSIDER achieves 31X cost efficiency on average. Specifically, it ranges from 2X (in SQL query) to 150X (in KNN). 6 Future Work In-storage computing is still in its infancy. INSIDER is our initial effort to marry this architectural concept with a practical system design. There is a rich set of interesting future work, as we summarize in the following. Extending INSIDER for a broader scenario. First, from the workload perspective, an extended programming model is desired to better support the data-dependent applications like key-value store. The current programming model forces host to initiate the drive access request, thus it cannot bypass the interconnection latency. Second, from the system perspective, it would be useful to integrate INSIDER with other networked systems to reduce the data movement overheads. Compared to PCIe, performance of the network is further constrained, which creates yet another scenario for INSIDER [45]. The design of INSIDER is mostly agnostic to the underlying interconnection. By changing the DMA part into RDMA (or Ethernet), INSIDER can support the storage disaggregation case, helping cloud users to cross the “network wall” and take advantage of the fast remote drive. Other interesting use cases include offloading computation to HDFS servers and NFS servers. Data-centric system architecture. Traditionally, the com- puter system is designed to be computing-centric, in which the data from IO devices are transferred and then processed by CPU. However, the traditional system is facing two main chal- lenges. First, the data movement between IO devices and CPU has proved to be very expensive [53], which can no longer be ignored in the big data era. Second, due to the end of Den- nard Scaling, general CPUs can no longer catch up with the ever-increasing speed of IO devices. Our long-term vision is to refactor the computer system into being data-centric. In the new architecture, CPU is only responsible for control plane processing, and it offloads data plane processing directly into the customized accelerator inside of IO devices, including storage drives, NICs [50, 52], memory [51], etc. 7 Conclusion To unleash the performance of emerging storage drives, we present INSIDER, a full-stack redesigned storage system. On the performance side, INSIDER successfully crosses the “data movement wall” and fully utilizes the high drive performance. On the programming side, INSIDER provides simple but effec- tive abstractions for programmers and offers necessary system support which enables a shared executing environment. Acknowledgements We would like to thank our shepherd, Keith Smith, and other anonymous reviewers for their insightful feedback and com- ments. We thank Wencong Xiao and Bojie Li for all technical discussions and valuable comments. We thank the Amazon F1 team for AWS credits donation. We thank Janice Wheeler for helping us edit the paper draft. This work was supported in part by CRISP, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, the NSF NeuroNex award #DBI-1707408, and the funding from Huawei, Mentor Graphics, NEC and Samsung under the Center for Domain-Specific Computing (CDSC) Industrial Partnership Program. Zhenyuan Ruan is also supported by a UCLA Computer Science Departmental Fellowship. Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, pages 153–165, Pis- cataway, NJ, USA, 2016. IEEE Press. [39] F. T. Hady, A. Foong, B. Veal, and D. Williams. Plat- form Storage Performance With 3D XPoint Technology. Proceedings of the IEEE, 105(9):1822–1833, Sep. 2017. [40] Insoon Jo, Duck-Ho Bae, Andre S. Yoon, Jeong-Uk Kang, Sangyeun Cho, Daniel D. G. Lee, and Jaeheon Jeong. YourSQL: A High-performance Database System Leveraging In-storage Computing. Proc. VLDB Endow., 9(12):924–935, August 2016. [41] Myoungsoo Jung and Mahmut Kandemir. Revisiting Widely Held SSD Expectations and Rethinking System- level Implications. In Proceedings of the ACM SIG- METRICS/International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’13, pages 203–216, New York, NY, USA, 2013. ACM. [42] Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. Profiling a Warehouse-scale Computer. In Proceedings of the 42Nd Annual Interna- tional Symposium on Computer Architecture, ISCA ’15, pages 158–169, New York, NY, USA, 2015. ACM. [43] Kimberly Keeton, David A. Patterson, and Joseph M. Hellerstein. A Case for Intelligent Disks (IDISKs). SIG- MOD Rec., 27(3):42–52, September 1998. [44] Ahmed Khawaja, Joshua Landgraf, Rohith Prakash, Michael Wei, Eric Schkufza, and Christopher J. Ross- bach. Sharing, Protection, and Compatibility for Re- configurable Fabric with AmorphOS. In 13th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI 18), pages 107–127, Carlsbad, CA, 2018. USENIX Association. [45] Byungseok Kim, Jaeho Kim, and Sam H. Noh. Man- aging Array of SSDs When the Storage Device Is No Longer the Performance Bottleneck. In 9th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 17), Santa Clara, CA, 2017. USENIX Asso- ciation. [46] Sungchan Kim, Hyunok Oh, Chanik Park, Sangyeun Cho, and Sang-Won Lee. Fast, energy efficient scan inside flash memory SSDs. In Proceeedings of the Inter- national Workshop on Accelerating Data Management Systems (ADMS), 2011. [47] Sungchan Kim, Hyunok Oh, Chanik Park, Sangyeun Cho, Sang-Won Lee, and Bongki Moon. In-storage Processing of Database Scans and Joins. Inf. Sci., 327(C):183–200, January 2016. [48] Gunjae Koo, Kiran Kumar Matam, Te I, H. V. Kr- ishna Giri Narra, Jing Li, Hung-Wei Tseng, Steven Swan- son, and Murali Annavaram. Summarizer: Trading Com- munication with Computing Near Storage. In Proceed- ings of the 50th Annual IEEE/ACM International Sympo- sium on Microarchitecture, MICRO-50 ’17, pages 219– 231, New York, NY, USA, 2017. ACM. [49] Philip Kufeldt, Carlos Maltzahn, Tim Feldman, Chris- tine Green, Grant Mackey, and Shingo Tanaka. Eusocial Storage Devices: Offloading Data Management to Stor- age Devices that Can Act Collectively. ;login:, 43(2), 2018. [50] Bojie Li, Zhenyuan Ruan, Wencong Xiao, Yuanwei Lu, Yongqiang Xiong, Andrew Putnam, Enhong Chen, and Lintao Zhang. KV-Direct: High-Performance In- Memory Key-Value Store with Programmable NIC. In Proceedings of the 26th Symposium on Operating Sys- tems Principles, SOSP ’17, pages 137–152, New York, NY, USA, 2017. ACM. [51] Shuangchen Li, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, and Yuan Xie. DRISA: A DRAM-based Reconfigurable In-Situ Accelerator. In Proceedings of the 50th Annual IEEE/ACM Interna- tional Symposium on Microarchitecture, MICRO-50 ’17, pages 288–301, New York, NY, USA, 2017. ACM. [52] Yuanwei Lu, Guo Chen, Zhenyuan Ruan, Wencong Xiao, Bojie Li, Jiansong Zhang, Yongqiang Xiong, Peng Cheng, and Enhong Chen. Memory efficient loss re- covery for hardware-based transport in datacenter. In Proceedings of the First Asia-Pacific Workshop on Net- working, APNet 2017, Hong Kong, China, August 3-4, 2017, pages 22–28, 2017. [53] Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. Processing data where it makes sense: Enabling in-memory computation. Micro- processors and Microsystems, 2019. [54] Sanketh Nalli, Swapnil Haria, Mark D. Hill, Michael M. Swift, Haris Volos, and Kimberly Keeton. An Analysis of Persistent Memory Use with WHISPER. In Proceed- ings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’17, pages 135–148, New York, NY, USA, 2017. ACM. [55] Jian Ouyang, Shiding Lin, Zhenyu Hou, Peng Wang, Yong Wang, and Guangyu Sun. Active SSD Design for Energy-efficiency Improvement of Web-scale Data Analysis. In Proceedings of the 2013 International Sym- posium on Low Power Electronics and Design, ISLPED ’13, pages 286–291, Piscataway, NJ, USA, 2013. IEEE Press. [56] Jian Ouyang, Shiding Lin, Song Jiang, Zhenyu Hou, Yong Wang, and Yuanzheng Wang. SDF: Software- defined Flash for Web-scale Internet Storage Systems. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pages 471–484, New York, NY, USA, 2014. ACM. [57] D. Park, J. Wang, and Y. S. Kee. In-Storage Comput- ing for Hadoop MapReduce Framework: Challenges and Possibilities. IEEE Transactions on Computers, PP(99):1–1, 2016. [58] A. Putnam. (Keynote) The Configurable Cloud - Ac- celerating Hyperscale Datacenter Services with FPGA. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pages 1587–1587, April 2017. [59] Erik Riedel, Garth A. Gibson, and Christos Faloutsos. Active Storage for Large-Scale Data Mining and Mul- timedia. In Proceedings of the 24rd International Con- ference on Very Large Data Bases, VLDB ’98, pages 62–73, San Francisco, CA, USA, 1998. Morgan Kauf- mann Publishers Inc. [60] Z. Ruan, T. He, B. Li, P. Zhou, and J. Cong. St-accel: A high-level programming platform for streaming applica- tions on fpga. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 9–16, April 2018. [61] Sudharsan Seshadri, Mark Gahagan, Sundaram Bhaskaran, Trevor Bunker, Arup De, Yanqin Jin, Yang Liu, and Steven Swanson. Willow: A User- Programmable SSD. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 67–80, Broomfield, CO, 2014. USENIX Association. [62] Cassidy R. Sugimoto, Hamid R. Ekbia, and Michael Mattioli. Big Data Is Not a Monolith. The MIT Press, 2016. [63] Devesh Tiwari, Simona Boboila, Sudharshan Vazhkudai, Youngjae Kim, Xiaosong Ma, Peter Desnoyers, and Yan Solihin. Active Flash: Towards Energy-Efficient, In- Situ Data Analytics on Extreme-Scale Machines. In Presented as part of the 11th USENIX Conference on File and Storage Technologies (FAST 13), pages 119– 132, San Jose, CA, 2013. USENIX. [64] Ryan J Urbanowicz, Melissa Meeker, William LaCava, Randal S Olson, and Jason H Moore. Relief-based fea- ture selection: introduction and review. arXiv preprint arXiv:1711.08421, 2017. [65] Louis Woods, Zsolt István, and Gustavo Alonso. Ibex - An Intelligent Storage Engine with Support for Ad- vanced SQL Off-loading. PVLDB, 7(11):963–974, 2014. [66] Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh, Tameesh Suri, Manu Awasthi, Zvika Guz, Anahita Shayesteh, and Vijay Balakrishnan. Performance Analy- sis of NVMe SSDs and Their Implication on Real World Databases. In Proceedings of the 8th ACM International Systems and Storage Conference, SYSTOR ’15, pages 6:1–6:11, New York, NY, USA, 2015. ACM. [67] Jiansong Zhang, Yongqiang Xiong, Ningyi Xu, Ran Shu, Bojie Li, Peng Cheng, Guo Chen, and Thomas Mosci- broda. The feniks fpga operating system for cloud com- puting. In Proceedings of the 8th Asia-Pacific Workshop on Systems, APSys ’17, pages 22:1–22:7, New York, NY, USA, 2017. ACM. [68] Peipei Zhou, Zhenyuan Ruan, Zhenman Fang, Megan Shand, David Roazen, and Jason Cong. Doppio: I/O- Aware Performance Analysis, Modeling and Optimiza- tion for In-Memory Computing Framework. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS ’18, 2018. [69] H. R. Zohouri, N. Maruyama, A. Smith, M. Matsuda, and S. Matsuoka. Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs. In SC ’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 409–420, Nov 2016.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved