TY - GEN
T1 - Accurate Matrix Multiplication on Binary128 Format Accelerated by Ozaki Scheme
AU - Mukunoki, Daichi
AU - Ozaki, Katsuhisa
AU - Ogita, Takeshi
AU - Imamura, Toshiyuki
N1 - Funding Information:
This research was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant 19K20286.
Publisher Copyright:
© 2021 Owner/Author.
PY - 2021/8/9
Y1 - 2021/8/9
N2 - Although IEEE 754-2008 binary128 (with a 15-bit exponent and 113-bit significand, i.e., quadruple-precision) is not currently implemented on x86 in hardware, software emulation is available on some compilers. However, the performance is significantly slower compared to the binary64 operation, which is supported natively in hardware. This study proposes a fast implementation of matrix multiplication on matrices stored in the binary128 format on x86 CPUs. The proposed implementation utilizes the Ozaki scheme, which is an accurate matrix multiplication algorithm proposed by Ozaki et al. in 2012. This scheme enables one to perform most computations using the binary64 matrix multiplication (the DGEMM routine in Basic Linear Algebra Subprograms (BLAS)); it can exploit the high-performance of highly-optimized vendor BLAS. Although the achievable performance depends on the input matrices (the inner-product dimension, the absolute range, and the significand bit length), the proposed implementation can achieve better performance and accuracy compared to naive matrix multiplication performed using the GCC's binary128 emulation in many cases. In addition, we discuss GPU acceleration, performance on reduced precision inputs, an implementation based on binary32 matrix multiplication (SGEMM), application to memory-intensive operations, and the possibility of a distributed parallel implementation.
AB - Although IEEE 754-2008 binary128 (with a 15-bit exponent and 113-bit significand, i.e., quadruple-precision) is not currently implemented on x86 in hardware, software emulation is available on some compilers. However, the performance is significantly slower compared to the binary64 operation, which is supported natively in hardware. This study proposes a fast implementation of matrix multiplication on matrices stored in the binary128 format on x86 CPUs. The proposed implementation utilizes the Ozaki scheme, which is an accurate matrix multiplication algorithm proposed by Ozaki et al. in 2012. This scheme enables one to perform most computations using the binary64 matrix multiplication (the DGEMM routine in Basic Linear Algebra Subprograms (BLAS)); it can exploit the high-performance of highly-optimized vendor BLAS. Although the achievable performance depends on the input matrices (the inner-product dimension, the absolute range, and the significand bit length), the proposed implementation can achieve better performance and accuracy compared to naive matrix multiplication performed using the GCC's binary128 emulation in many cases. In addition, we discuss GPU acceleration, performance on reduced precision inputs, an implementation based on binary32 matrix multiplication (SGEMM), application to memory-intensive operations, and the possibility of a distributed parallel implementation.
KW - Accurate
KW - BLAS
KW - Binary128
KW - FP128
KW - Linear algebra
KW - Matrix multiplication
KW - Quadruple precision
UR - http://www.scopus.com/inward/record.url?scp=85117246242&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85117246242&partnerID=8YFLogxK
U2 - 10.1145/3472456.3472493
DO - 10.1145/3472456.3472493
M3 - Conference contribution
AN - SCOPUS:85117246242
T3 - ACM International Conference Proceeding Series
BT - 50th International Conference on Parallel Processing, ICPP 2021 - Main Conference Proceedings
PB - Association for Computing Machinery
T2 - 50th International Conference on Parallel Processing, ICPP 2021
Y2 - 9 August 2021 through 12 August 2021
ER -