Last modified 10 years ago Last modified on 09/23/2009 11:11:07 AM

DC3b GPU Plan

One of the agreed-upon goals for DC3b is a prototype-level port of at least one key stage-level algorithm to a GPU platform.

The motivations are that a) on the time scale of LSST, the use of this technology will be nearly unavoidable on cost and scaling grounds; b) in DC3a the shortfalls in pipeline latency and throughput were sufficiently large that, unmitigated, they would require substantial redesign and expansion of the computing architecture to provide the necessary parallelism to meet SRD-based performance goals.

The expectation is that on the time scale of DC3b this will not result in an implementation that can be used as part of the DC3b large-scale production. The goal for DC3b is simply to port one or two key algorithms and perform performance and data quality comparisons with the mainstream DC3 LSST version of the algorithm(s).

The initial decision has been to try to follow emerging industry standards and use the OpenCL programming environment to build the code. At the moment, the only fully functional instance of that environment is in Apple's Snow Leopard (Mac OS X version 10.6), running on a modern portable or desktop Mac with a supported GPU (basically, a Mac from the last year and a half or so, and some older ones).

  • [comment by TSA] This runs counter to the advice we got from Mario Juric. He strongly suggested that OpenCL is not sufficiently developed for us to use, and that we should use CUDA instead. He further suggested that an eventual move from CUDA to OpenCL would be an easy one. If we take his advice, which seems sensible to me, this removes the platform limitation - CUDA supports linux, windows, and mac. Mario's detailed comments are here.

The project steps required are as follows:

  • Identify a simple but important algorithm for the initial port (Lupton; done): convolution on MaskedImage
  • Identify the hardware platform for development and execution (Dubois-Felsmann, done): any recent Apple Mac running Snow Leopard
  • Ensure that the platform is fully supported for the LSST stack (Middleware group): the Mac has not reached this level yet, even for the previous version of the OS (10.5 Leopard), though it is close. It has not yet been attempted for Snow Leopard.
    • Ensure that the Middleware group has access to a Snow Leopard machine for stack porting purposes
  • Identify a developer to learn OpenCL GPU programming and perform the initial algorithm port (Dodd, Lupton to find someone)
  • Perform the initial port (? developer ?)
  • Evaluate the time required to perform the port and decide whether a second algorithm can be ported during the scope of DC3b (Lupton and developer)
  • Evaluate the computational and scientific performance of the ported algorithm (Lupton and developer)
    • Decide on the scope of data required to be run through the algorithm to perform a reasonable level of scientific validation (Lupton and Axelrod)
    • Identify the dataset (Lupton); it is assumed that this will be data that was also used in DC3b production
    • Run the ported algorithm on the dataset (developer); possibly also re-run just the equivalent mainstream algorithm stage in question on the dataset again, with additional performance instrumentation
    • Collect and report performance metrics (developer)
    • Assess scientific data quality and report (Lupton, Axelrod, may be delegated)
  • If time permits, perform the second port and the associated computational and scientific performance evaluation (developer, et al.)

Step durations remain to be assigned.

(Discussion of the possible connection with the Image Simulation collaboration between the University of Washington and the University of Split to perform a GPU port of the ImSim code for production purposes will follow.)

(Additional detail and discussion to follow.)

Discussion and updates