wiki:DC3APostMortemMeeting-Performance
Last modified 10 years ago Last modified on 08/04/2009 10:08:46 PM

[Return to the DC3a post-mortem / DC3b scoping and planning meeting, 2009.5.18-20]

Parallel session on Performance, Scalability, Reliability

Wednesday 2009.5.20, 11:15, Room 505a; convener: Gregory Dubois-Felsmann

  • Performance testing and further development within DC3a / alert production
    • Requirements for performance analysis tools
      • Nagios/Ganglia?-style cluster monitoring during production
      • Logging of CPU and memory data as part of the self-monitoring of the pipelines
      • Cache and ILP utilization analysis / usability of tools such as callgrind
      • Analysis of time lost to sequence points, I/O
      • strace / DTrace
    • Responding to results of performance analysis on DC3a - how much effort can we (afford to) invest?
  • Quantitative performance modeling
    • Principally for the alert production latency analysis...
    • Study tradeoffs in increasing parallelism
    • Evaluate whether very fast single cores (non-commodity, e.g., Power6) could have a role
  • Scaling tests toward the final LSST configuration
    • Test the Nehalem version of hyperthreading (reanimated after its death in the P4 era)
    • Test 16 (or more) core boxes as soon as they are available
    • Try to identify when memory bandwidth and external interface bandwidth might become limitations
  • Advanced implementation and advanced architectures
    • Motivations:
      • Much recent effort in CPU design going into improvements in ILP and short-vector units (e.g., SSE*): are we able to take advantage of these? Will compilers just do this for us?
      • Try turning on chip-specific optimization in compilation; try the Intel compiler
      • Need access to a cluster of late-model CPUs (e.g., SSE4, Nehalem HT)
      • Simple homogeneous computations on large amounts of data: GPUs, Cell, ...
      • Large-memory problems (worst-case crosstalk correction, all-filter Multifit, extended object fitting): shared-memory systems
      • Optimizing use of Blue Waters
    • Organizing R&D - what will be part of DC3b?
    • Modeling the benefits of possible areas of application (remember Amdahl's Law)
  • Performance analysis education
    • Code reviews
    • Formal or informal training
  • Scalable database architecture
  • Fault tolerance

Notes from the meeting

Some level of performance analysis and mitigation is needed in time for PDR, to help us make the case that we have a path toward meeting the design requirements for throughput and latency.

While DC3a per se will end at the end of June, it is understood that work on the performance of the Alert Production can/will continue throughout DC3b, in the context of the cloning of that image analysis code within the Data Release Production.

Cluster-level monitoring. 1. For LSST cluster, relevant people have Nagios experience and can try to bring it up on the cluster. Unlikely but possible to be available before the end of DC3a. 2. For the "Abe" cluster, on which we will get time for larger-scale tests, it is believed that Ganglia may already be installed. NCSA will check on this.

Profiling. No good canned solution for detailed profiling of the C++ applications code within the context of the MPI execution framework. Not clear that valgrind/callgrind can be used at all in that context. Probably need standalone tests in order to support detailed profiling.

Need to define what the "15% of DR1" goal for DC3b means in the context of scaling tests. gpdf to solicit recommendations from the various experts for what this means in their respective areas of expertise.

Need plan for acquisition of access to resources for leading-edge commodity architectures (high core density, late-model CPUs with hyperthreading, etc.). Prefer evaluation systems and borrowed resources. Need to understand when this is not adequate and when we might have to request project funds to support an evaluation system.

Get Intel compiler licensed for LSST use. Priority is to do this at NCSA. It may already be site-licensed. The non-commercial license for the compiler probably does not apply.

SSE (and ILP and related issues): DC3b plan is to build with Intel compiler, run on late-generation CPUs with all advanced code generation switches on, evaluate performance by stage, and spot-check use of SSE instructions. Prefer not to invest in detailed tuning / code improvements to better use SSE/ILP/etc. at this time (assumption is that this is not the way we are going to get large factors).

GPUs: high priority project to get a basic demonstration, and to develop relationships with external experts and experience within the team. In order to get basic results by PDR, need to keep scope limited. Plan to use image convolution (from alert production) as the initial focus (even though it's known that it doesn't amount to a large enough fraction of the total time to make a major difference on its own). Need to produce a stripped-down test case that can readily be shared outside.

Database scalability: don't need it just for the data volumes in DC3b. PDR: point is to demonstrate scalability for its own sake ("establish technical feasibility"). DC3b: reduce risk, reduce margins. Need R&D plan from JB/KTL for both PDR and DC3b.


Decisions

  • Evaluating the usefulness of GPUs, and developing some team expertise in this area, is a high priority for DC3b. It's not expected that within the time frame of DC3b we would actually reach the full promise of this technology.
  • Detailed ILP/SSE-level coding efforts do not offer a chance for a breakthrough in performance and so don't provide enough leverage for DC3b. (They will definitely be appropriate during construction, though.) A quick evaluation in this area should be done simply by trying the Intel compiler at a high optimization level specific to a current-model x64 platform (e.g., Core i7/Nehalem).
  • Working with other alternative architectures is out of scope for DC3b. (Cell, large-memory SMP, etc.)
  • Detailed understanding of the performance of, and the opportunities for performance improvement in alert production is a goal for DC3b. First priority is to assess the consequences of the current timings for the design. This is also needed for PDR, and a meaningful effort at performance improvement by PDR would be highly advisable.
  • Need to demonstrate database scalability within the DC3b time frame. Unclear whether this is a technology demonstration or whether it needs to be incorporated into the main body of DC3b (e.g., in the science validation of DC3b).
  • A basic demonstration of the compute-node failover aspect of the fault-tolerance design is a goal.

Issues

  • What does "15% of DR1" mean? This question has to be answered separately for many of the components and goals of DC3b.
  • Do we need a scalable database in order to meet DC3b goals other than simply demonstrating database scalability?
  • How can we get a) advanced test hardware (for GPUs, SSE4, etc.), and b) large-scale clusters for scalability tests?

Actions

(near-term)

  • Define the specific interpretation of the "15% of DR1" goal for DC3b. How does it apply to rates? Volumes? Catalog sizes? Data transfer rates? Storage capacity and file management? Gregory will run this question around the various team leaders.
  • Identify sites at which the Intel C++ compiler could be made available.
  • Need R&D plan from JB/KTL for database scalability, for both DC3b and PDR.

Tasks

(longer-term, subject to final DC3b scoping decisions)

  • Develop an updated timing model for alert production based on the results of DC3a.
    • Comment from Ray: I'm not sure what this means--that is, what is the motivation, the goal, and the requirements of this model?
  • Develop design alternatives for meeting the throughput and latency requirements of alert production assuming the observed performance in DC3a is not improved. [Dubois-Feldsman]
    • Assess the cost implications of these alternatives. [Freemon]
    • Develop a target for improved performance [Dubois-Feldsman]
      • this might allow falling back to a simpler design and significantly less hardware.
  • Bring up Nagios or an equivalent monitoring package on the LSST cluster at NCSA. [Baker]
    • Define the data to be collected and its sampling requirements
    • Ensure that an appropriate set of metrics for CPU and memory utilization are included.
  • Identify existing monitoring tools on Abe cluster and ensure that they provide comparable insight into CPU utilization. [Daues]
  • Evaluate performance of our software stack built with the Intel Compiler [Plante]
    • Build software stack using the Intel compiler
    • Execute comparison runs using stacks build with the Intel and GCC compilers
    • Incorporate these results into the timing model.
  • Devise and document methods for profiling LSST software using calgrind, gprof, or equivalent.
    • It's understood that this will have negative consequences for cross-slice synchronization.
    • Either intrusive or non-intrusive tools may be acceptable to meet this requirement.
  • Apply profiling to alert production [Daues]
    • Identify opportunities for optimization
    • Extend application to other pipelines if possible
  • Review the algorithm design for alert production and identify possible less-CPU-intensive alternatives that still meet the science requirements. [Dubois-Feldsman]
  • Perform scaling tests of alert production at the specified "15% of DR1" level (after it's been determined what that is). [Plante]
  • Perform scaling tests of data release production at the specified "15% of DR1" level (after it's been determined what that is). [Plante]
  • Perform scaling tests of middleware/infrastructure/networking at the specified "15% of DR1" level (after it's been determined what that is). [Daues, Pietrowicz]
  • Perform tests of scalable database architecture at the "15% of DR1" level. Requires resolution of the issue of how coupled this is to DC3b's other goals. [Becla]
  • Run tests on leading-edge hardware to allow refinement of compute capacity estimates. Test effect of agressive platform-specific optimization switches. [Plante]
  • Acquire (by borrowing or purchasing) evaluation units of leading-edge commodity hardware for evaluation of performance on new CPUs, memory architectures, etc. [Freemon]
  • Acquire a test platform with recent-generation GPUs.
    • Comment from Ray: Note that the lincoln cluster at NCSA features GPUs
  • Use image convolution (from alert production) as an initial application of GPUs. Starting point is a stripped-down test case that can be shared with outside experts for advice. [Lupton]
  • Carry out design and code reviews oriented toward performance improvement and toward training developers in a performance orientation. [Dubois-Feldsman]
  • Demonstrate fault-tolerance features at a basic level. [Daues]