
Share
A routine that once took six hours to multiply matrices has been slashed to a lightning-fast one second through clever optimization techniques, showcasing the power of performance engineering in computer science.
Matrix multiplication is a fundamental operation in many areas of computer science, including machine learning and scientific computing. Recently, I had the opportunity to optimize a matrix multiplication routine that initially took an excruciating 6 hours to complete. After some performance engineering wizardry, we managed to reduce this time to just 1 second. Let’s dive into how we achieved this significant improvement.
The initial implementation was straightforward and naive, using triple nested loops to perform the matrix multiplication. This approach is simple but highly inefficient due to poor memory access patterns and lack of parallelism. The experiments were conducted on an AWS c4.8xlarge instance, which has:
While this might seem like overkill for a simple experiment, the high-end specs were necessary to showcase the performance gains clearly.
Blocking, also known as tiling, involves dividing the matrix into smaller submatrices to improve cache utilization. This technique reduces the number of cache misses by keeping frequently accessed data in the cache.
Parallelizing the code using OpenMP allows us to take advantage of multiple cores. This can be done by adding a few pragmas to the existing code.
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
for (int j = 0; j < M; ++j) {
for (int k = 0; k < K; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}

SIMD instructions allow a single operation to be applied to multiple data points simultaneously. This is particularly useful for operations like matrix multiplication.
Storing matrices in an appropriate format can improve memory access patterns. For example, using column-major order for certain operations can lead to better cache performance.
After applying these optimizations, the matrix multiplication routine was benchmarked against the initial naive implementation:
This represents a speedup of approximately 21,600 times. The optimized code not only runs faster but also scales better with increasing matrix sizes.
Performance engineering is indeed a lost art, as Professor Charles Leiserson often emphasizes. By understanding the underlying hardware and applying optimization techniques like blocking, parallelization, SIMD, and memory layout optimization, we can achieve significant performance improvements. In this case, reducing the computation time from 6 hours to 1 second demonstrates the power of these techniques.
If you’re working on computationally intensive tasks, consider these strategies to optimize your code and make the most out of your hardware.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
24 January 2024
88 articles
Related Articles
Related Articles
More Stories