# GPU Erasure Coding for Parallel File Systems

This research explores the effectiveness of GPU erasure coding for parallel file systems. Using GPUs to perform erasure coding for parallel file systems can meet the performance and capacity requirements of exascale computing, especially when used for Campaign storage where the high performance requirements for exascale computing is provided by more expensive systems having lower capacity that is located between the campaign storage systems and the computing nodes. In this architecture, the campaign storage is expected to store trilllions of files for the time while active research is being conducted for a project. These projects may store millions of files in single directories. The long lifetime of files in the campaign storage and the high availability requirements makes erasure coding a cost effective solution for providing this storage. These requirements exceed the availability capabilities of traditional RAID-6 with two parity shards. When computing higher numbers of parity shards, the GPU becomes a viable contender. It is expected that typical campaign storage will have 4 parity shards for 16 data shards or even 8 parity shards for 32 data shards.The scope of our research surrounds the performance of erasure coding on GPUs and compares the performance with approaches that use the general purpose CPUs to perform the erasure coding.

## Research Systems

The CCL lab includes two storage clusters, all equipped with Nvidia GPUs. The Red Mountain cluster contains 16 servers with a total of 320 TB of raw storage, each server includes an Nvidia C1020 GPU, interconnected with QDR Infiniband. The Everest cluster contains 4 storage servers with a total of 2.304 PB of raw storage, each server has an Nvidia K40 GPU, interconnected with two QDR Infiniband and 10 GbE. The clusters share auxiliary systems that support testing and collaborative research projects.

## Experiments

We have conducted several tests designed to measure the performance capabilities of the system. The performance areas that we have measured have been individual disk bandwidth, network bandwidth and erasure coding throughput. We provide the results of these experiments herein.

### Disk Bandwidth

We measured disk bandwidth on single disk drives using the unix program dd. We used various sizes of reads and writes to compare the range of performance based on data size. The results are shown in Table 1. We measured several sizes of write operations and read operations to arrive at the maximum sequential performance for a single disk drive in the Dell C8000 chassis. For the 512 byte write the command is:

$sudo dd if=/dev/zero of=/dev/sdaa bs=512 count=100 oflag=direct  and for 512 byte read the command is: $ sudo dd if=/dev/sdaa of=/dev/null bs=512 count=100 

We also measured the disk performance using the fio tool so we compare the results from each method. The fio bandwidth for writing is on the order of 10 times the bandwidth measured with dd, fio has a configuration option for maintaining a queue for reading/writing which we set to 4 in these measurements. This shows that fio more accurately measures the operating performance capability of this disk subsystem by keeping the queue filled. This difference diminishes as the size of the operation increases to 1 MB.

Table 1: Disk bandwidth measured with dd and fio.

### Bandwidth and IOPS Performance Using Multiple Disks

To determine the performance of the disk subsystem we performed measurements of several disk configurations using fio. All of the tests were configured to write a total of 8 GB using 20 jobs with a time limit of 240 seconds. The larger sizes of operations will demonstrate the maximum capabilities of the system. Table 2 shows the bandwidth performance measured with 512 byte, 4 KB, 8 KB, 128 KB, 256 KB and 1 MB. These operation sizes were all measured using the 1, 2, 4, 8, 24, 48 and 96 disk configurations. The disk drives were selected to use IO data paths that provided the greatest resources for the test, i.e. the 2 disk test used 1 disk on the first HBA and the second disk on the second HBA. Table 3 shows the IOPS that were measured at the same time as the bandwidth shown in Table 2.

The tests show that the IOPS peak for the 4 KB read operations. Write operations are mediocre with maximums occuring at the 2 disk configuration where we have one disk drive per HBA. The tests were configured to use the libaio driver, a queue depth of 4, direct IO and all writes are appending to the single file per disk (the raw device, there are no filesystems, i.e /dev/sdf). Bandwidth peaks at 3.4 GB/s with the 1 MB reads and writes. The read tests hit this 3.4 GB/s maximum with 24 disks drives. The maximum read bandwidth for 8 drives is 3.1 GB/s so the system appears to hit the upper limit for bandwidth around 10 disk drives and adding more disk drives does not increase the bandwidth. For the smaller size operations, the additional disks do make a difference and the bandwith increases proportionally to the number of disks for some disk configurations. For others, the 4K and 8K writes and the 512 byte read, this doesn’t hold.

The 8K writes are approximately two times the bandwidth of the 4K writes so we can conclude that the system can perform the same number of IOPS for these operation sizes and we get twice the bandwidth by doubling the size of the operation. However, this does not hold between the 512 byte and the 4 KB operation sizes, though. The speed of a 1 disk 512 byte write is 80 KB/s and the speed of the 1 disk 4KB write is 7 MB/s which is a factor of 87 times and at 24 disks, the 512 byte write bandwidth is 382 KB/s compared with the 4 KB write bandwidth at 20 MB/s giving a ratio of 52 times. In order for the 512 byte write operation to produce the same bandwidth as the 4 KB operation the system would need to perform 13,671 IOPS but the limit on the Everest node is around 8,000 IOPS which is produced with the 4 KB write operation on 2 disks where there are two HBAs in the system. We compared these results with the IO performance on another system we have in the CCL, the Red Mountain cluster, it is able to deliver more than 24,000 IOPS on 512 byte writes and nearly 22,000 IOPS for 4 KB writes \ref{tab:redmountain}. The Red Mountain cluster has 2 TB SATA disk drives vs. the Everest cluster which has 6 TB SAS 3 disk drives. We expected the Everest system to perform as well or better than the Red Mountain system and will further investigate the cause of this discrepency.

Table 2: Disk bandwidth

Table 3: 1 – 48 measured with one Dell C8000 chassis, 96 with two

Table 4: Comparison between Red Mountain nodes and Everest nodes, single disk

### Bandwidth of Data Path

The bandwidth for data transfer to and from the disk drives and memory is limited by the components it must pass through. The path consists of the PCI 3.0 bus, the two LSI 9300-8e SAS 3 controllers, the Dell C8000XD SAS 2 Expanders and the SAS interface on each disk drive. The theoretical bandwidth for each of these components is showin in Table 5.

Table 5: Theoretical maximum data path bandwidth

### Gibraltar Erasure Coding Plugin for Ceph

Ceph has provided a standard interface for integrating erasure coding libraries to the product. This provides an opportunity to incorporate novel and effective components into the Ceph parallel file system. The plugin architecture is modularized into two functional areas: registration of erasure coding profiles and the interface for the erasure coding/decoding services of the library. Gibraltar can provide up to 256 data and parity strips in a stripe, Ceph erasure coded profiles can be constructed with many combinations of k and m. Practical limits of the number of strips in an erasure coded stripe will be limited by the number of failure domains in the cluster.

The Ceph ErasureCodePlugin class is subclassed in our work to activate a Gibraltar instance that is configured according to the parameters provided by the Ceph administrator command to create an erasure code profile. Gibraltar uses the NVIDIA CUDA library to offload computation and retrieve results from the K40 GPU in our system. The subclass ErasureCodePluginGibraltar calls the Gibraltar gib_cuda_driver function to initialize a CUDA context for the profile. The profile can then be used to create an erasure coded pool in Ceph.

The Ceph ErasureCode class is subclassed in our work to implement the Ceph ErasureCodeInterface functions for the Gibraltar library. We have modified the Gibraltar erasure code library to make it compatible with the Ceph architecture. Ceph uses a blocklist data structure and aggregates k + m blocklists for each erasure coded stripe where each strip is referenced by a C pointer to the head of the blocklist. The call to Gibraltar has been modified to provide an array of these pointers and the logic copies each block of data onto a contiguously allocated GPU memory block. The products of the coding or decoding are copied back to the Ceph blocklist data structures in a similar way. In Figure 1 we show how data is passed to the plugin. Ceph uses a Bufferlist object to store object data as it moves through the system. The plugin divides the Bufferlist into k strips by creating an array of C pointers into the Bufferlist object. The plugin appends m parity buffers onto the Bufferlist object and includes the C pointers to these data structures in the array. The Gibraltar library was modified by this work to use the array of pointers to the Ceph Bufferlist object. New versions of the Gibraltar functions to encode and regenerate were created to accept the new interface. The original functions operated on a contiguous data block which was passed in the call. This was sufficient for the RAID system that the library was targeting at the time and reduced register pressure in the GPU by reducing the number of variables.

Figure 1: Ceph calls the erasure coding module with a Bufferlist object. The Plugin divides the data into k strips and adds m parity strips.

### Erasure Coding Measurements

We are interested in the performance of erasure coding algorithms and hardware to determine what configurations provide the optimal solution. We have measured three libraries for erasure coding performance on the Everest cluster using the Ceph plugin interface and benchmarking tools. Ceph includes an erasure coding plugin using the Jerasure libraries [1] and the Intel Storage Architecture libraries [2]. We have provided a plugin for Ceph that uses the Gibraltar library [3]. The experiments that we have run measure the encoding and repair bandwidth for several configurations of data and redundancy. The Gibraltar library used the Nvidia K40m GPUs that are installed on the Everest cluster storage nodes.

Four experiments were conducted to understand the benefits of GPU erasure coding for ceph. Both experiments were run using the Jerasure and ISA-L erasure coding plugins included in Ceph v10.2.2 to provide a baseline for reference to our Gibraltar plugin performance. The Reed-Solomon algorithms were selected because this is well suited to implementation on a GPU. We selected an economical configuration to use: 10 data chunks and 4 parity chunks, with a total data size of 1~MB (which is a typical CephFS setting). We have presented the measurements for 10 data chunks with 2, 3, and 4 parity chunks for both of the experiments in our paper. The experiments used the erasure coding benchmark tool included with v10.2.2 of Ceph. The benchmark instantiates an erasure coding profile as specified and then runs a series of encoding generations and reconstructions over a set of data. We set the CPU core affinity to use a single core on the second CPU in the system because the Nvidia GPU is connected to the PCI bus of this CPU socket. All three plugins were run on the same core. A set of four 10 GB runs were made and the average reported for each run. The results in figure 2 show that the Gibraltar library generates parity at 59% the rate of Jerasure in the 10/2 configuration, at 80% of Jerasure’s performance for 10+3 and practically the same for 10+4. Gibraltar is 39% of ISA-L for 10+3, 48% for 10+4 and 52% for 10+4.

Figure 2: Erasure encoding bandwidth results with k=10, m=4, 1MiB object size encoding 10 GiB of data.

In figure 3 we show the performance of reconstructing erasures for our second experiment. Times are shown for reconstructing 1, 2, 3 and 4 erasures in the configurations tested. Gibraltar performance is 53% of the Jerasure performance for 10+4 with 1 erasure and is 19% faster than Jerasure for the 10+4 with 4 erasures. Gibraltar performance is 33% of the ISA-L performance for 10+4 with 1 erasure and is 62% of the performance of ISA-L with 4 erasures.

Figure 3: Erasure recovery bandwidth results with k=10, m=[2,3,4], 1MiB object size recovering 10 GiB of data. Results are shown with 1, 2, 3 and 4 erasures where the number of erasures is less than or equal to m.

The third experiment measured the profile of the Gibraltar encoding execution using the CUDA nvprof program. The same parameters were used as in the first experiment. We measured 47% of the time was spent copying the data to the GPU memory, 35% of the time was spent generating the parity and the remaining 18% of the time was spent copying the parity back to the host using the 10+4 configuration. The fourth experiment measured the profile of the reconstruction using the same parameters as in experiment 2. In the Gibraltar reconstruction, the Galois inversion matrix is computed on the host and copied to the GPU for the specific erasures that are present in the data, we have added this time to the data transfer to the GPU. Also, to reconstruct parity, Gibraltar must first reconstruct the data and then regenerate the parity, we have added these two times together. We measured 57% of the time was spent copying the data to the GPU memory, 30% of the time was spent reconstructing the data and parity, and 13% of the time was spent copying the reconstructed data from the GPU memory to the host in the 10+4 test with 4 erasures.

## People

University of Alabama at Birmingham

• Purushotham Bangalore
• Sagar Thapaliya
• Walker Haddock

Auburn University

• Anthony Skjellum
• Jared Ramsey

## Related Publications

• Sagar Thapaliya, Purushotham Bangalore, Jay Lofstead, Kathryn Mohror and Adam Moody, Managing I/O Interference in a Shared Burst Buffer System, Parallel Processing (ICPP), 2016 45th International Conference on, Philadelphia PA, 2016.
• Sagar Thapaliya, Purushotham Bangalore, Jay Lofstead, Kathryn Mohror and Adam Moody, IO-Cop: Managing Concurrent Accesses to Shared Parallel File System, 2014 Workshop on Interfaces and Architecture for Scientific Data Storage, In Conjunction with ICPP 2014, Minneapolis MN, Sept 2014.
• Sagar Thapaliya, Purushotham Bangalore, Kathryn Mohror, Adam Moody, Capturing I/O Dynamics in HPC Applications, Peer Reviewed Research Poster (LLNL-POST-646170), PDSW 2013, Denver, CO, November 2013.
• Sagar Thapaliya, Adam Moody, Kathryn Mohror, and Purushotham Bangalore, Inter-application Coordination for Reducing I/O Interference, Peer Reviewed Research Poster (LLNL-POST-641538), Supercomputing 2013, Denver, CO, November 2013.

## References

1. J. S. Plank. Fast Galois Field Arithmetic Library in C/C++. Technical Report CS-07-593,University of Tennessee, April 2007. URL http://web.eecs.utk.edu/plank/plank/papers/CS-07-593.html
2. Greg Tucker. ISA-L open source v2.14 API doc, April 2014. URL https://01.org/sites/default/files/documentation/isa-l_open_src_2.10.pdf.
3. Matthew L. Curry, Anthony Skjellum, H. Lee Ward, and Ron Brightwell. Arbitrary dimension Reed-Solomon coding and decoding for extended RAID on GPUs. pages 1–3. IEEE, November 2008. ISBN 978-1-4244-4208-9. doi: 10.1109/PDSW.2008. 4811887. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4811887.

This project is supported by the National Science Foundation under Grants Nos. CNS-0821497 and CNS-1229282. Any opinions, findings, and conclusions or recommendations expressed on this website are those of the authors and do not necessarily reflect the views of the National Science Foundation.