wiki:Winter2013/GaussianProcess
Last modified 5 years ago Last modified on 03/26/13 16:29:00

Design Document for Gaussian Process

Overview

GaussianProcess will be a class within lsst.afw.math. It will operate by reading in N D-dimensional data points and the corresponding values of some function f and returning the values of f interpolated at unknown query points. There will be a template variable associated with GaussianProcess, which will control the datatype (usually float or double) used for both the coordinate data and the function values. Users may want to choose float in the case of exceptionally large data sets to conserve memory.

The Theoretical Algorithm

Following Rasmussen and Williams (2006) http://gaussianprocess.org/gpml/ equation (2.19), GaussianProcess returns f_q (the value of the function associated with the query point) as

  • f_q=f_av + Cov_qi Cov^-1_ij (f_j - f_av)

where f_av is the algebraic mean of all of the f_i associated with the data points and Cov_qi and Cov_ij are covariance matrices associating the query point with the data points and the data points with each other, respectively. Repeated indices are summed over. Cov_qi and Cov_ij take the form

  • Cov_ij= I_ij * lambda + C(d_i,d_j)

where C(d_i,d_j) is a user-specified function (the "covariogram") describing the covariance relation between f at the data points d_i and d_j and lambda is a "hyper parameter" which also needs to be user-specified (or somehow optimized using the data).

The variance Var_q or square of the uncertainty in f_q is

  • Var_q= K *[ C(d_q,d_q)+lambda-Cov_qi Cov^-1_ij Cov_qj ]

K is the "Kriging parameter", another hyper parameter that must be user-specified.

Note that the hyper parameter lambda is meant to represent the intrinsic variation or uncertainty in the values of f_i. At the moment, the code only permits a single value of lambda for all data points. It would be very straightforward to make lambda an array so that each data point can have its own characteristic uncertainty in f. I have never actually considered what effect such heteroskedasticity would have on the interpolation.

Storing the Data

GaussianProcess will contain a two-dimensional member array _data and a one-dimensional member array _function. These will contain deep copies of the input data points and function values, respectively. _data will be arranged so that the ith row corresponds to the ith data point. The columns of _data will correspond to the dimensions of the data point space (i.e. there will be two columns for a traditional image).

While the matrix inversions in the algorithm can be replaced by Eigen's linear algebraic "solve" functions, the multiple levels of matrix multiplication can still slow down the interpolation process considerably. Therefore, GaussianProcess includes the ability for the user to request that interpolation only be done using the M nearest neighbors of the query point. In order to rapidly find those nearest neighbors, GaussianProcess organizes the data points into a KD tree using the class KdTree also included in GaussianProcess.cc. The member _data of GaussianProcess will point to the member data of KdTree where the actual data will be stored.

There will be two constructors for GaussianProcess. One constructor will allow the user to supply 1-dimensional ndarrays of length D corresponding to the minimum and maximum allowed values in each dimension of coordinate space. If this constructor is invoked, the data points will be projected onto a space in which the span of each dimension is the difference between the provided maximums and minimums, i.e. each input coordinate value is transformed to

  • x -> (x-x_min)/(x_max-x_min).

This will alter the distances between data points as reckoned by KdTree and the GaussianProcess covariogram. Users can invoke this feature to compensate for cases in which one dimension has a much smaller span than another, even though both dimensions exert equal influence on the function being interpolated. The second constructor does not require the input of minimum and maximum values and leaves the distances between data points as they are input.

Interpolating the Function

There are two main ways that a user can interpolate f using GaussianProcess.

  • The member function interpolate will read in a single query point (stored in a 1-dimensional ndarray) and a specified number of nearest neighbors and use that number of nearest neighbors to calculate both f_q and Var_q
  • The member function batchInterpolate will read in a list of many query points and use all of the data points to calculate f_q and Var_q at all of them. This routine benefits in speed from the fact that you only need to calculate Cov^-1_ij * (f_j - f_av) once for all of the query points. Unfortunately, this simplification does not apply to the calculation of Var_q, so there is a version of batchInterpolate which does not calculate Var_q, in the event that Var_q is unnecessary and speed is of the essence.

Note: it generally takes interpolate as much time to calculate Var_q as f_q. If we think it would be useful to have a version of interpolate that does not return Var_q, that can be done.

I ran a rough test of the algorithm using data supplied to my by Yusra. The data consisted of 189 points on a two-dimensional image and a scalar function distributed across them. The test consisted of interpolating the function on 1,000,000 points not in the original data.

Interpolating the data points one at a time using 10 nearest neighbors (calculating the variance at each) took 23 seconds. Note: in this case, calculating the variance for all 1,000,000 only took an extra 1 second. Most of the time was spent finding the nearest neighbor points and iterating over matrix indices.

Interpolating the data points one at a time using 50 nearest neighbors took 190 seconds. Calculating the variance added 11 seconds relative to not calculating the variance.

Interpolating using batchInterpolate without calculating the variance took 100 seconds.

Interpolating using batchInterpolate with the variance took 184 seconds.

Adding to the Data

In the event that new information is learned about the function, the member function addPoint can add a new data point to those stored in GaussianProcess and KdTree.

Specifying the Covariogram

There is a great degree of freedom in specifying the functional form of the covariogram C(d_i,d_j). Therefore, GaussianProcess will include a pointer to a function *_covariogram which can be redirected to any one of the covariogram functions supplied in the code. GaussianProcess will also include an array _hyperParameters which will be allocated to accommodate however many hyper parameters are associated with the chosen covariogram. Initially, I will write code to support the decaying exponential covariogram

  • C(d_i,d_j)=exp[-0.5 * ||d_i-d_j||^2/l^2]

with l^2 a hyper parameter and (see Rasmussen and Williams equation 4.29)

  • C(d_i,d_j)=(2/pi) * sin^-1 { 2(1,d_i) S (1,d_j)/sqrt[(1+2*(1,d_i)S(1,d_i))(1+2*(1,d_j)S(1,d_j))]}

with S a diagonal matrix containing two hyper parameters.

The user will be able to toggle between these covariograms using the member function setCovariogramType.

These covariogram functions will live in the namespace lsst::afw::math::detail::GaussianProcess. I will also use that namespace for a merge sort algorithm necessary for the nearest neighbor search within KdTree.

Setting Hyper Parameters

In many cases, hyper parameters can be set using knowledge of the function f gleaned from the data. GaussianProcess will facilitate this practice using the member function selfInterpolate, which will interpolate the value of f at one of the data points using a specified number of nearest neighbors but not the point itself (i.e. if you want to test GaussianProcess's performance on the ith data point, you ask for k nearest neighbors and selfInterpolate will find the k+1 nearest neighbors and ignore the 0th, which is the ith data point itself). The user can experiment with different values of the hyper parameters and choose the combination for which selfInterpolate returns the most accurate values of f at the known points.

API

The class and method definitions for GaussianProcess and KdTree can be found at

Additionally, this class requires implementation of the linear algebra package Eigen and the mergeSort method from

Comments

Comment by price on Tue 19 Mar 2013 09:50:43 AM MST

I would prefer not to add yet another kd-tree implementation to the stack (we have two or three already, specialised for different uses), unless this is a general, public implementation that can be used for multiple purposes.

Could you provide your proposed class definition, with the class inheriting from an abstract base class that defines the API?

What is the origin of the code used here? Is it compatible with GPL?

Am I to understand that interpolating a background takes 100 sec for a mere 1k square image?

Comment by danielsf on Wed 20 Mar 2013 11:31:52 AM MST

The kd-tree implementation for GaussianProcess is general. It does not, currently, have the ability to prune, but that can be added.

The source code for GaussianProcess is in ticket 2604. The class definition is presently in $AFW_DIR/include/lsst/afw/math/GaussianProcess.h Presently, GaussianProcess is totally stand-alone. It does not inherit from any other class.

I'm not sure what GPL is? This code has all been written from scratch.

Regarding the timing: The 100 seconds included time spent optimizing the hyper parameters using selfInterpolate with 188 nearest neighbors. Since each call to selfInterpolate requires a neighbor search and a matrix inversion, this is an expensive process. If I neglect the hyper parameter optimization, it only took 12 seconds to call batchInterpolate on 1,000,000 points without variance calculation and 95 seconds with variance calculation. Hyper parameter optimizatoin can be sped up if the user withholds some data points as a validation set and optimizes the hyper parameters by trying to interpolate on those known points.

Comment by price on Wed 20 Mar 2013 11:41:11 AM MST

GPL is the GNU Public License, under which the LSST code is licensed. We can't incorporate any code taken from sources that aren't compatible with the GPL (Numerical Recipes is a common one that isn't compatible).

12 sec for a 1k square image is still very slow: that's over 3 minutes sec for a 4k square image. Are there any clear optimisation paths?

Comment by danielsf on Wed 20 Mar 2013 12:01:13 PM MST

I did re-implement the mergeSort routine from Numerical Recipes, but my reading of their license is that, as long as you don't copy their actual code verbatim, you are free to use their algorithms. Other than that, everything is original, so it is GPL compatible.

I will look for ways to speed up the algorithm. Is there a speed we need to attain?

Comment by price on Wed 20 Mar 2013 12:12:40 PM MST

You should 'never' have to write your own sort function --- use std::sort (http://www.cplusplus.com/reference/algorithm/sort/).

I think you need to aim to shave around two orders of magnitude off your time: it shouldn't take much more than a few seconds to interpolate a 4k square image background.

Comment by price on Wed 20 Mar 2013 12:19:34 PM MST

In fact, if you're only sorting to get the median (as is common for kd-trees), or similar, then use std::nth_element instead: see http://stackoverflow.com/questions/10352442/whats-the-practical-difference-between-stdnth-element-and-stdsort

Comment by danielsf on Fri 22 Mar 2013 01:23:45 PM MST

The sorting algorithm only comes into play when initially building the kd-tree for GaussianProcess. Presently (for 189 input data points), that process takes less than a second, so changing the sort algorithm won't buy us any speed.

We can reduce the time spent on an image by a factor of two if we replace the decaying exponential covariogram with a polynomial covariogram. I just spent some time playing around with a 1/(d^2+epsilon) covariogram. It was reasonably accurate. This covariogram has no obvious theoretical justification but, heuristically, it behaves the way we want it to (nearby points are correlated; distant points are not).

After that, the only way left to increase the speed is to reduce the number of points used for each interpolation (i.e. use interpolate instead of batchInterpolate). Presently, that is no advantage because using interpolate means performing a nearest neighbor search and then inverting a covariance matrix for each pixel in the interpolated image. However, if the input data roughly approximates a grid, we could do away with the kd-tree nearest neighbor search and just find neighbors using the interpolated pixel's position on the grid. If we then used only 3-5 neighbor points for each interpolated pixel, we could invert the covariance matrix by hand and hard-code that in. I suspect this would give us a substantial speed up.

Is it reasonable to assume that, in these high-volume use cases, the input data will be a grid (or an approximate grid)?

Comment by price on Mon 25 Mar 2013 01:59:13 PM MST

A trivial optimisation would be to have euclideanDistance return the square of the distance, since expCovariogram (and your proposed polynomial covariogram) promptly squares it anyway.

You've got a lot of i++ instead of ++i; dunno how much of a difference that makes with modern optimising compilers, but something to look at. You should also use C++ iterators when practical. ndarray seems to support that.

Not sure if there are speed differences between the various types available in ndarray (e.g., allowing for the possibility of non-unit strides may add unnecessary overhead); Jim Bosch could say.

I expect that when interpolating on a grid, significant gains could be made by splitting out the individual dimensions and some caching:

exp(a*(d02+d12)) = exp(a.d02).exp(a.d12)

Also, having looked through the code, it really needs a good edit for style. Please read the coding standards and look at other code for examples on style. There's also a lot of code duplication that should be factored out.

Comment by price on Tue 26 Mar 2013 07:21:40 AM MST

P.S. Yes, the most common use case will be to construct an image based on a sample (which may or may not be regularly sampled). The API should allow this (e.g., return an NxM image).

Comment by jbosch on Tue 26 Mar 2013 09:31:14 AM MST

On grids: there are some use cases in which the points we evaluate will be on a grid (e.g. backgrounds), and some where we won't (e.g. PSFs and Kernels). It'd be good to have an API for both, even if that means having two classes.

Comment by jbosch on Tue 26 Mar 2013 09:35:42 AM MST

On KDTree: instead of taking a function pointer in the constructor to define the distance, I'd recommend adding an extra template parameter for a functor class. That could default to a Euclidean distance functor, which would be by far the most common case.

Comment by jbosch on Tue 26 Mar 2013 09:39:26 AM MST

I'm torn between liking the fact that you made this work in an arbitrary number of dimensions, and the fact that the interface could be a lot more convenient (and maybe the code could be more efficient?) if it was limited to two - I can't think of any use cases for greater than two dimensions, and it would be very nice to be able to use existing geometry classes in some places.

Comment by jbosch on Tue 26 Mar 2013 09:46:28 AM MST

If possible, I'd prefer for the different types of covariograms and hyperparameters to be handled by different classes, ideally a polymorphic hierarchy held by a GaussianProcess data member. In particular, I'd like to get rid of the setHyperParameters function in its present form, and replace it with an API in which it's more clear how the parameters are to be intepreted and how many there are.

Comment by jbosch on Tue 26 Mar 2013 09:55:35 AM MST

For many use cases, we'd really want to interpolate a vector-valued function. While we could do that by having a different GaussianProcess instance for each function value, I imagine it'd be a lot more efficient to have a GaussianProcess object that can handle those directly. I'm guessing that would be a major design change, though, so I'm not sure how to fit it into the discussion here.

Comment by jbosch on Tue 26 Mar 2013 10:16:14 AM MST

A few minor, code-review level notes:

  • I'd strongly recommmend Eigen::LDLT instead of Eigen::LLT.
  • You should either inherit from boost::noncopyable or define explicit copy constructors and assignment operators; the implementations implicitly created by the compiler will be dangerously wrong for these classes.
  • Typedefs should be CamelCase starting with a capital letter.
  • Function argument lists longer than one line need to be indented

Comment by jbosch on Tue 26 Mar 2013 11:14:39 AM MST

Looking more closely at the algorithmic side of this, I'm beginning to suspect that we'll never want to run the GaussianProcess interpolation directly on all of the pixels of an image; it's just intrinsically too slow. I think it's much more likely we'll want to use GaussianProcess to interpolate on a coarser grid, and then use something like splines or polynomials to get a value at each pixel position. That doesn't mean we need to build that functionality into this class, of course. And there are other use cases in which we'll only want to interpolate at each source position, not each pixel position, and for that we may want to use GaussianProcess directly.

Add comment

Design Review on 26-March-2013

Present:

  • SAT: Andy Connolly, Robyn Allsman, K-T Lim
  • Others: Scott Daniel, Andy Becker, Paul Price, Russell Owen, Jim Bosch, Yusra AlSayyad

Algorithm and Requirements

Use cases -- almost anything we do spatial modeling on:

  • Backgrounds
  • Zeropoints
  • Parameters for PSFs/kernels
  • WCSs
  • Over CRs and saturated pixels

Irregularly sampled datasets are key usage

  • If inputs are not a grid, code is slower
  • If inputs are on a grid, possible to optimize (finding nearest-neighbors more quickly)

API is too low-level for end-users, but can wrap to make more friendly

Do we need more than 2 dimensions?

  • If we want to re-use k-d tree, should be based on Points (2-D or 3-D)
    • Can template on container (Point/ndarray)
  • Using full ndarrays for coordinates probably has 2X overhead

Can be used for 1 dimension (code should work)

  • But could be made more efficient

Leave writing a Point-based wrapper for GaussianProcess and KdTree for another ticket, which may also refactor them by specializing/optimizing for Points

Vector-valued functions:

  • Use ndarray for values
  • Not too hard to do
  • Can implement covariance between result values
    • But far away from making use of it, so don't do yet

Implement vector values; make sure scalar values are still easy to use

Interface

KdTree class

Other implementations of k-d trees?

  • Serge has one in source association
  • Dustin has one in astrometry.net (but we treat that as an external package)
  • meas_mosaic (would like to use a more generic one)

Goal of this implementation is just to find number of neighbors

  • Tests do exist to make sure it finds right neighbors

AJC: If we write a generic one, we should look at fast algorithms for treewalk from N-body simulation groups

GaussianProcess class

Objection to setHyperParameters():

  • Each covariogram should be a class that can be evaluated
  • Parameters should be part of the class
  • Could be a functor

Do we need to change covariograms in same class?

  • Probably not, so could be templated for efficiency (do not use virtual functions in covariogram class)

Different lambda for each point?

  • Just change to ndarray

In most use cases, points and values will not be naturally grouped, so having them separate in arguments is OK

Should not pass in sizes of ndarrays; already available from the ndarray itself

For vector-valued functions, should fill an array rather than return one

ndarray template parameters should be const when appropriate

addPoint() could be useful, but then removePoint() should be there, too

  • Somewhat dangerous because tree can become unbalanced
  • Add warning to documentation, but leave method

Make getNeighbors(), getCovarianceRow(), testKdTree() private, remove, or move to a friend test class

Why getCovarianceRow() and not getCovarianceMatrix()?

  • Not sure

Times are currently public members; move into a separate (private) struct returned by getTimes()

Many (private) data members; could be function-local variables?

  • Class members should only be necessary state information (or caches)
    • Otherwise interface has to be changed when only implementation is changing
  • Arrays are there primarily for caching the size of array so it doesn't need to be reallocated
  • Try creating temporary arrays of known size instead

Methods should be const wherever possible

Interpolate methods:

  • Things that get modified should be first in call order
  • Either switch to vector-valued functions or change variances to be scalars

No tab characters in source code