Slide 7 of 16
EXAMPLE: quantlib correlated random number generation for financial models: rearranging loops (reducing stride) made about 20x--40x speed boost.
The working set
Smaller is almost always better
Cache misses undetected by many tools – but have a huge impact on performance
Use unit or small stride in data structures
Mind your memory hierarchy
Registers, instruction pipeline
Level 1, 2, ... cache
Secondary and remote storage (if available)
- Do you have good tools to measure effect of changes (in CPU cycles vs real time)?
- Can you estimate working set size?
- Do you know anything about target hardware/OS and its interaction with your memory-access patterns?
- Will you have to compromise because you have more than one hardware target, for example?
- Modern languages have a more-or-less uniform memory model---are there target-specific improvements to can make use of? Is ROMised data and code faster or slower than that loaded into RAM, for example?