Slide 7 of 16
Notes:
EXAMPLE: quantlib correlated random number generation for financial models: rearranging loops (reducing stride) made about 20x--40x speed boost.
The working set
Smaller is almost always better
Cache misses undetected by many tools – but have a huge impact on performance
Use unit or small stride in data structures
Mind your memory hierarchy
Registers, instruction pipeline
Level 1, 2, ... cache
Main memory
Secondary and remote storage (if available)
- Do you have good tools to measure effect of changes (in CPU cycles vs real time)?
- Can you estimate working set size?
- Do you know anything about target hardware/OS and its interaction with your memory-access patterns?
- Will you have to compromise because you have more than one hardware target, for example?
- Modern languages have a more-or-less uniform memory model---are there target-specific improvements to can make use of? Is ROMised data and code faster or slower than that loaded into RAM, for example?