wiki:DM/SAT/Performance
Last modified 9 years ago Last modified on 06/12/2010 01:02:38 PM

Performance Tuning for LSST DM

So far virtually all we have to say on this page is about the C++ layer of our code. We are working on developing tools for Python performance analysis, too.

Performance analysis is easiest under SimpleStageTester at the moment.

Tools

  • valgrind --tool=callgrind python something.py followed by kcachegrind callgrind.out
    • Add options --dump-instr=yes and --trace-jump=yes for more detailed analysis
    • Don't blindly trust its timing estimates (which are based on an x86 simulation). Measure real performance before/after changes.
    • Consider cache simulation options in callgrind after you have understood the call structure and have identified key hotspots to study in more detail. Cache optimizations are somewhat platform-dependent and fragile, so it's usually OK to wait to do them until things have settled down.
    • Experienced people in the group are willing/eager to pass on tips on how to use the tool. Ask for help (good candidates: gpdf, ktl, rhl).
  • Use strace (on Linux) to trace system calls. Learn how to use its options. Watch both CPU and real time used. This is relevant both to Python and C++.
  • Consider doing performance analysis on any non-trivial unit tests and other test cases. When using callgrind, see if you can understand how the number of actual calls to functions in your code relates to the amount of "real work" the code is doing.

Lessons Learned

  • Rule #0: Pick the right algorithm first, before worrying about details of coding efficiency. Learn about how to think about the scaling properties of algorithms "O(N) notation".
  • Rule #1: Get the code right before you worry about optimising --- if your routine doesn't show up in callgrind, you may be done [at least for now --- as other code gets faster you may become the bottlneck]. This does not mean that you should ignore questions of efficiency --- it's often no harder to do the right thing the first time around.
  • Follow coding standards for stereotyped actions such as iterating over pixels in an image (#1319): http://dev.lsstcorp.org/doxygen/release/afw/current/imageiterators.html .
  • Avoid conversions between char* and std::string as much as possible; try to use just one or the other, consistently. Use std::string if the length of the string needs to be used frequently. Generally prefer std::string. The use of (const) char* is sometimes still appropriate if it involves only, or nearly so, quoted string literals, as constants of type std::string are not quite properly supported in C++98.
    • If strings are being used as keys or for some other control function deep within science algorithms, you probably don't want to use just bare strings of any type. Please consult the SAT for advice.
  • Learn the invariants and guarantees of STL algorithms, especially for mutating sequence and sort operations (#1320).
  • STL containers, even if allocated on the stack as autos, tend to do heap allocations (#1332).
  • Don't inline excessively (causes code bloat and dependency problems) but do inline trivial forwarding functions (#1321) unless doing so will produce undesirable dependencies. Consult the SAT for advice about dependencies.
    • The presence of a "throw" makes a function impossible to inline, because exceptions require a stack frame.
  • Exceptions are faster than they were in the 1990s, but they are still vastly slower than conditional branches. Use exceptions for "exceptional conditions", not for things that you know will happen often. (#1324) Use them when the alternate path is really different from the basic path and/or when non-trival cleanup is needed after the failure, e.g., of allocated memory.

Other Suggestions

  • Use const well. It helps compilers figure out what optimizations they can perform.
  • Understand the caching implications of how you iterate through images and other multidimensional data structures.

Partial List of Changes Made

Note that some of these are in code written by experienced programmers (e.g. r*l); it's the job of profilers to tell us where we need to put more effort.

  • Tickets: search for keyword "performance"
  • Revisions to LSST wrapper for CFITSIO to reduce key parsing and the number of lseeks and I/Os (#1315) produced 30-70% speedups in PT1 production (the former) and ISR alone (the latter).
  • Removal of slow tracing in !LsstImpl_DC3, in changeset [15299], produced 10-70% speedups in PT1 production (the former) and narrower test cases (the latter).

Performance-related tickets:

Ticket Component Summary Status
#1315 afw afw fits I/O is very slow closed
#1319 afw Unnecessary repeated calls, per-pixel, to row_end() closed
#1320 afw Inefficient use of nth_element in afw::math::Statistics::_percentile closed
#1321 afw Suggested inlines of call-throughs to boost::gil methods in ImageBase closed
#1322 afw Commonly used trivial getters in MaskedImage (getImage, getMask, getVariance) fail to inline closed
#1323 afw Constant regex definitions should be "static const" to avoid reparsing closed
#1324 pex_policy Tests for the existence of mostly-absent keys in policy dictionaries should use Policy::exists closed
#1332 afw Avoid using heap on every call to PolynomialFunction2::operator() closed