Abstract: Modern NVIDIA GPU architectures offer dot-product instructions (DP2A and DP4A), with the aim of accelerating machine learning and scientific computing applications. These dot-product ...