TY - GEN
T1 - Reproducible BLAS routines with tunable accuracy using ozaki scheme for many-core architectures
AU - Mukunoki, Daichi
AU - Ogita, Takeshi
AU - Ozaki, Katsuhisa
N1 - Funding Information:
Acknowledgment. This research was partially supported by MEXT as “Exploratory Issue on Post-K computer” (Development of verified numerical computations and super high-performance computing environment for extreme researches) and the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 19K20286.
PY - 2020
Y1 - 2020
N2 - Generally, floating-point computations comprise rounding errors; the result may be inaccurate and not identical (non-reproducible). Particularly, heterogeneous computing has many factors that affect reproducibility. The loss of accuracy and reproducibility could be a crucial issue in debugging complex codes and the reliability of computations. In this paper, we propose high-performance implementations of reproducible basic linear algebra subprograms (BLAS) routines with tunable accuracy for many-core architectures. Our approach is based on an accurate matrix-multiplication method, Ozaki scheme, which can be constructed on level-3 BLAS that performs standard floating-point operations. We demonstrate the performance of three routines: inner product (DOT), matrix-vector multiplication (GEMV), and matrix-multiplication (GEMM) on NVIDIA’s Volta GPU by comparing these with the standard routines provided by the vendor. Furthermore, we demonstrate the reproducibility between CPU and GPU and its accuracy.
AB - Generally, floating-point computations comprise rounding errors; the result may be inaccurate and not identical (non-reproducible). Particularly, heterogeneous computing has many factors that affect reproducibility. The loss of accuracy and reproducibility could be a crucial issue in debugging complex codes and the reliability of computations. In this paper, we propose high-performance implementations of reproducible basic linear algebra subprograms (BLAS) routines with tunable accuracy for many-core architectures. Our approach is based on an accurate matrix-multiplication method, Ozaki scheme, which can be constructed on level-3 BLAS that performs standard floating-point operations. We demonstrate the performance of three routines: inner product (DOT), matrix-vector multiplication (GEMV), and matrix-multiplication (GEMM) on NVIDIA’s Volta GPU by comparing these with the standard routines provided by the vendor. Furthermore, we demonstrate the reproducibility between CPU and GPU and its accuracy.
KW - Accurate
KW - BLAS
KW - Reproducible
UR - http://www.scopus.com/inward/record.url?scp=85083972945&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85083972945&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-43229-4_44
DO - 10.1007/978-3-030-43229-4_44
M3 - Conference contribution
AN - SCOPUS:85083972945
SN - 9783030432287
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 516
EP - 527
BT - Parallel Processing and Applied Mathematics - 13th International Conference, PPAM 2019, Revised Selected Papers
A2 - Wyrzykowski, Roman
A2 - Karczewski, Konrad
A2 - Deelman, Ewa
A2 - Dongarra, Jack
PB - Springer
T2 - 13th International Conference on Parallel Processing and Applied Mathematics, PPAM 2019
Y2 - 8 September 2019 through 11 September 2019
ER -