
Copyright © National Academy of Sciences. All rights reserved.
The Future of Computing Performance:   Game Over or Next Level?
THE END OF PROGRAMMING AS WE KNOW IT  113
dimension, because one of the matrices will be accessed in an order that 
allows little  reuse of data in the cache. However, if we decompose the 
problem  into  smaller  matrix  multiplication  problems,  we  can  capture 
locality, reusing each word fetched from memory many times. 
Suppose  we  have  a  memory  capable  of  holding  256  kB  (32  kW)  1 
mm from our floating-point unit. The local memory is large enough to 
hold three 100 × 100 submatrices, one for each input operand and one 
for the partial result. We can perform a 100 × 100 matrix multiplication 
entirely out of the local memory, performing 2 × 10
6
 operations with only 
4 ×10
4
 memory references—a ratio of 50 operations per reference. We can 
apply this blocking recursively. If there is aggregate on-chip memory of 
32 MB (4 MW), we can hold three 1,000 × 1,000 submatrices at this level 
of the storage hierarchy. In a seminal paper by Hong and Kung, that idea 
was proved to be optimal for matrix multiplication in the sense that this 
kind  of  blocked  algorithm  moves  the  minimum  amount  of  data  pos-
sible between processor and memory system.
9
 Other array computations, 
including convolutions and fast Fourier transformations, can be blocked 
in this manner—although with different computation-to-communication 
ratios—and there are theoretical results on the optimality of communi-
cation  for  several  linear  algebra  problems  for  both  parallel  and  serial 
machines. 
The recursive nature of the blocked algorithm also led to the notion 
of “cache-oblivious” algorithms, in which the recursive subdivision pro-
duces successively smaller subproblems that eventually fit into a cache or 
other fast memory layer.
10
 Whereas other blocked algorithms are imple-
mented to match the size of a cache, the oblivious algorithms are opti-
mized for locality without having specific constants, such as cache size, in 
their implementation. Locality optimizations for irregular codes, such as 
graph algorithms, can be much more difficult because the data structures 
are built with pointers or indexed structures that lead to random memory 
accesses. Even some of the graph algorithms have considerable locality 
that can be realized by partitioning the graph subgraphs that fit into a 
local  memory  and  reorganizing  the  computations  to  operate  on  each 
subgraph with reuse before moving on to the next subgraph. There are 
many algorithms and software libraries for performing graph partition-
9 
See Hong Jia-Wei and H.T. Kung, 1981, I/O complexity: The red-blue pebble game, Pro-
ceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, Milwaukee, 
Wis., May 11-13, 1981, pp. 326-333
10 
Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran, 1999, 
Cache-oblivious algorithms, Proceedings of the 40th IEEE Symposium on Foundations of 
Computer Science, New York, N.Y., October 17-19, 1999, pp. 285-297.