# Design Document for Gaussian Process

#### Table of Contents

## Overview

`GaussianProcess` will be a class within `lsst.afw.math`. It will operate by reading in N D-dimensional data points and the corresponding values of some function `f` and returning the values of `f` interpolated at unknown query points. There will be a template variable associated with `GaussianProcess`, which will control the datatype (usually `float` or `double`) used for both the coordinate data and the function values. Users may want to choose `float` in the case of exceptionally large data sets to conserve memory.

## The Theoretical Algorithm

Following Rasmussen and Williams (2006) http://gaussianprocess.org/gpml/ equation (2.19), `GaussianProcess` returns `f_q` (the value of the function associated with the query point) as

`f_q=f_av + Cov_qi Cov^-1_ij (f_j - f_av)`

where `f_av` is the algebraic mean of all of the `f_i` associated with the data points and `Cov_qi` and `Cov_ij` are covariance matrices associating the query point with the data points and the data points with each other, respectively. Repeated indices are summed over. `Cov_qi` and `Cov_ij` take the form

`Cov_ij`= I_ij * lambda + C(d_i,d_j)

where `C(d_i,d_j)` is a user-specified function (the "covariogram") describing the covariance relation between `f` at the data points `d_i` and `d_j` and `lambda` is a "hyper parameter" which also needs to be user-specified (or somehow optimized using the data).

The variance `Var_q` or square of the uncertainty in `f_q` is

`Var_q= K *[ C(d_q,d_q)+lambda-Cov_qi Cov^-1_ij Cov_qj ]`

`K` is the "Kriging parameter", another hyper parameter that must be user-specified.

Note that the hyper parameter `lambda` is meant to represent the intrinsic variation or uncertainty in the values of `f_i`. At the moment, the code only permits a single value of `lambda` for all data points. It would be very straightforward to make `lambda` an array so that each data point can have its own characteristic uncertainty in `f`. I have never actually considered what effect such heteroskedasticity would have on the interpolation.

## Storing the Data

`GaussianProcess` will contain a two-dimensional member array `_data` and a one-dimensional member array `_function`. These will contain deep copies of the input data points and function values, respectively. `_data` will be arranged so that the ith row corresponds to the ith data point. The columns of `_data` will correspond to the dimensions of the data point space (i.e. there will be two columns for a traditional image).

While the matrix inversions in the algorithm can be replaced by Eigen's linear algebraic "solve" functions, the multiple levels of matrix multiplication can still slow down the interpolation process considerably. Therefore, `GaussianProcess` includes the ability for the user to request that interpolation only be done using the M nearest neighbors of the query point. In order to rapidly find those nearest neighbors, `GaussianProcess` organizes the data points into a KD tree using the class `KdTree` also included in `GaussianProcess.cc`. The member `_data` of `GaussianProcess` will point to the member `data` of `KdTree` where the actual data will be stored.

There will be two constructors for `GaussianProcess`. One constructor will allow the user to supply 1-dimensional ndarrays of length D corresponding to the minimum and maximum allowed values in each dimension of coordinate space. If this constructor is invoked, the data points will be projected onto a space in which the span of each dimension is the difference between the provided maximums and minimums, i.e. each input coordinate value is transformed to

`x -> (x-x_min)/(x_max-x_min)`.

This will alter the distances between data points as reckoned by `KdTree` and the `GaussianProcess` covariogram. Users can invoke this feature to compensate for cases in which one dimension has a much smaller span than another, even though both dimensions exert equal influence on the function being interpolated. The second constructor does not require the input of minimum and maximum values and leaves the distances between data points as they are input.

## Interpolating the Function

There are two main ways that a user can interpolate `f` using `GaussianProcess`.

- The member function
`interpolate`will read in a single query point (stored in a 1-dimensional ndarray) and a specified number of nearest neighbors and use that number of nearest neighbors to calculate both`f_q`and`Var_q`

- The member function
`batchInterpolate`will read in a list of many query points and use all of the data points to calculate`f_q`and`Var_q`at all of them. This routine benefits in speed from the fact that you only need to calculate`Cov^-1_ij * (f_j - f_av)`once for all of the query points. Unfortunately, this simplification does not apply to the calculation of`Var_q`, so there is a version of`batchInterpolate`which does not calculate`Var_q`, in the event that`Var_q`is unnecessary and speed is of the essence.

Note: it generally takes `interpolate` as much time to calculate `Var_q` as `f_q`. If we think it would be useful to have a version of `interpolate` that does not return `Var_q`, that can be done.

I ran a rough test of the algorithm using data supplied to my by Yusra. The data consisted of 189 points on a two-dimensional image and a scalar function distributed across them. The test consisted of interpolating the function on 1,000,000 points not in the original data.

Interpolating the data points one at a time using 10 nearest neighbors (calculating the variance at each) took 23 seconds. Note: in this case, calculating the variance for all 1,000,000 only took an extra 1 second. Most of the time was spent finding the nearest neighbor points and iterating over matrix indices.

Interpolating the data points one at a time using 50 nearest neighbors took 190 seconds. Calculating the variance added 11 seconds relative to not calculating the variance.

Interpolating using `batchInterpolate` without calculating the variance took 100 seconds.

Interpolating using `batchInterpolate` with the variance took 184 seconds.

## Adding to the Data

In the event that new information is learned about the function, the member function `addPoint` can add a new data point to those stored in `GaussianProcess` and `KdTree`.

## Specifying the Covariogram

There is a great degree of freedom in specifying the functional form of the covariogram `C(d_i,d_j)`. Therefore, `GaussianProcess` will include a pointer to a function `*_covariogram` which can be redirected to any one of the covariogram functions supplied in the code. `GaussianProcess` will also include an array `_hyperParameters` which will be allocated to accommodate however many hyper parameters are associated with the chosen covariogram. Initially, I will write code to support the decaying exponential covariogram

`C(d_i,d_j)=exp[-0.5 * ||d_i-d_j||^2/l^2]`

with `l^2` a hyper parameter and (see Rasmussen and Williams equation 4.29)

`C(d_i,d_j)=(2/pi) * sin^-1 { 2(1,d_i) S (1,d_j)/sqrt[(1+2*(1,d_i)S(1,d_i))(1+2*(1,d_j)S(1,d_j))]}`

with `S` a diagonal matrix containing two hyper parameters.

The user will be able to toggle between these covariograms using the member function `setCovariogramType`.

These covariogram functions will live in the namespace `lsst::afw::math::detail::GaussianProcess`. I will also use that namespace for a merge sort algorithm necessary for the nearest neighbor search within `KdTree`.

## Setting Hyper Parameters

In many cases, hyper parameters can be set using knowledge of the function `f` gleaned from the data. `GaussianProcess` will facilitate this practice using the member function `selfInterpolate`, which will interpolate the value of `f` at one of the data points using a specified number of nearest neighbors but not the point itself (i.e. if you want to test `GaussianProcess`'s performance on the ith data point, you ask for k nearest neighbors and `selfInterpolate` will find the k+1 nearest neighbors and ignore the 0th, which is the ith data point itself). The user can experiment with different values of the hyper parameters and choose the combination for which `selfInterpolate` returns the most accurate values of `f` at the known points.

## API

The class and method definitions for `GaussianProcess` and `KdTree` can be found at

Additionally, this class requires implementation of the linear algebra package `Eigen` and the `mergeSort` method from

## Comments

#### Comment by price on Tue 19 Mar 2013 09:50:43 AM MST

I would prefer not to add yet another kd-tree implementation to the stack (we have two or three already, specialised for different uses), unless this is a general, public implementation that can be used for multiple purposes.

Could you provide your proposed class definition, with the class inheriting from an abstract base class that defines the API?

What is the origin of the code used here? Is it compatible with GPL?

Am I to understand that interpolating a background takes 100 sec for a mere 1k square image?

#### Comment by danielsf on Wed 20 Mar 2013 11:31:52 AM MST

The kd-tree implementation for `GaussianProcess` is general. It does not, currently, have the ability to prune, but that can be added.

The source code for `GaussianProcess` is in ticket 2604. The class definition is presently in $AFW_DIR/include/lsst/afw/math/GaussianProcess.h Presently, `GaussianProcess` is totally stand-alone. It does not inherit from any other class.

I'm not sure what GPL is? This code has all been written from scratch.

Regarding the timing: The 100 seconds included time spent optimizing the hyper parameters using `selfInterpolate` with 188 nearest neighbors. Since each call to `selfInterpolate` requires a neighbor search and a matrix inversion, this is an expensive process. If I neglect the hyper parameter optimization, it only took 12 seconds to call `batchInterpolate` on 1,000,000 points without variance calculation and 95 seconds with variance calculation. Hyper parameter optimizatoin can be sped up if the user withholds some data points as a validation set and optimizes the hyper parameters by trying to interpolate on those known points.

#### Comment by price on Wed 20 Mar 2013 11:41:11 AM MST

GPL is the GNU Public License, under which the LSST code is licensed. We can't incorporate any code taken from sources that aren't compatible with the GPL (Numerical Recipes is a common one that isn't compatible).

12 sec for a 1k square image is still very slow: that's over 3 minutes sec for a 4k square image. Are there any clear optimisation paths?

#### Comment by danielsf on Wed 20 Mar 2013 12:01:13 PM MST

I did re-implement the mergeSort routine from Numerical Recipes, but my reading of their license is that, as long as you don't copy their actual code verbatim, you are free to use their algorithms. Other than that, everything is original, so it is GPL compatible.

I will look for ways to speed up the algorithm. Is there a speed we need to attain?

#### Comment by price on Wed 20 Mar 2013 12:12:40 PM MST

You should 'never' have to write your own sort function --- use `std::sort` (http://www.cplusplus.com/reference/algorithm/sort/).

I think you need to aim to shave around two orders of magnitude off your time: it shouldn't take much more than a few seconds to interpolate a 4k square image background.

#### Comment by price on Wed 20 Mar 2013 12:19:34 PM MST

In fact, if you're only sorting to get the median (as is common for kd-trees), or similar, then use `std::nth_element` instead: see http://stackoverflow.com/questions/10352442/whats-the-practical-difference-between-stdnth-element-and-stdsort

#### Comment by danielsf on Fri 22 Mar 2013 01:23:45 PM MST

The sorting algorithm only comes into play when initially building the kd-tree for `GaussianProcess`. Presently (for 189 input data points), that process takes less than a second, so changing the sort algorithm won't buy us any speed.

We can reduce the time spent on an image by a factor of two if we replace the decaying exponential covariogram with a polynomial covariogram. I just spent some time playing around with a `1/(d^2+epsilon)` covariogram. It was reasonably accurate. This covariogram has no obvious theoretical justification but, heuristically, it behaves the way we want it to (nearby points are correlated; distant points are not).

After that, the only way left to increase the speed is to reduce the number of points used for each interpolation (i.e. use `interpolate` instead of `batchInterpolate`). Presently, that is no advantage because using `interpolate` means performing a nearest neighbor search and then inverting a covariance matrix for each pixel in the interpolated image. However, if the input data roughly approximates a grid, we could do away with the kd-tree nearest neighbor search and just find neighbors using the interpolated pixel's position on the grid. If we then used only 3-5 neighbor points for each interpolated pixel, we could invert the covariance matrix by hand and hard-code that in. I suspect this would give us a substantial speed up.

Is it reasonable to assume that, in these high-volume use cases, the input data will be a grid (or an approximate grid)?

#### Comment by price on Mon 25 Mar 2013 01:59:13 PM MST

A trivial optimisation would be to have `euclideanDistance` return the square of the distance, since `expCovariogram` (and your proposed polynomial covariogram) promptly squares it anyway.

You've got a lot of `i++` instead of `++i`; dunno how much of a difference that makes with modern optimising compilers, but something to look at. You should also use C++ iterators when practical. `ndarray` seems to support that.

Not sure if there are speed differences between the various types available in `ndarray` (e.g., allowing for the possibility of non-unit strides may add unnecessary overhead); Jim Bosch could say.

I expect that when interpolating on a grid, significant gains could be made by splitting out the individual dimensions and some caching:

exp(a*(d0

^{2}+d1^{2})) = exp(a.d0^{2}).exp(a.d1^{2})

Also, having looked through the code, it really needs a good edit for style. Please read the coding standards and look at other code for examples on style. There's also a lot of code duplication that should be factored out.

#### Comment by price on Tue 26 Mar 2013 07:21:40 AM MST

P.S. Yes, the most common use case will be to construct an image based on a sample (which may or may not be regularly sampled). The API should allow this (e.g., return an NxM image).

#### Comment by jbosch on Tue 26 Mar 2013 09:31:14 AM MST

On grids: there are some use cases in which the points we evaluate will be on a grid (e.g. backgrounds), and some where we won't (e.g. PSFs and Kernels). It'd be good to have an API for both, even if that means having two classes.

#### Comment by jbosch on Tue 26 Mar 2013 09:35:42 AM MST

On KDTree: instead of taking a function pointer in the constructor to define the distance, I'd recommend adding an extra template parameter for a functor class. That could default to a Euclidean distance functor, which would be by far the most common case.

#### Comment by jbosch on Tue 26 Mar 2013 09:39:26 AM MST

I'm torn between liking the fact that you made this work in an arbitrary number of dimensions, and the fact that the interface could be a lot more convenient (and maybe the code could be more efficient?) if it was limited to two - I can't think of any use cases for greater than two dimensions, and it would be very nice to be able to use existing geometry classes in some places.

#### Comment by jbosch on Tue 26 Mar 2013 09:46:28 AM MST

If possible, I'd prefer for the different types of covariograms and hyperparameters to be handled by different classes, ideally a polymorphic hierarchy held by a `GaussianProcess` data member. In particular, I'd like to get rid of the `setHyperParameters` function in its present form, and replace it with an API in which it's more clear how the parameters are to be intepreted and how many there are.

#### Comment by jbosch on Tue 26 Mar 2013 09:55:35 AM MST

For many use cases, we'd really want to interpolate a vector-valued function. While we could do that by having a different `GaussianProcess` instance for each function value, I imagine it'd be a lot more efficient to have a `GaussianProcess` object that can handle those directly. I'm guessing that would be a major design change, though, so I'm not sure how to fit it into the discussion here.

#### Comment by jbosch on Tue 26 Mar 2013 10:16:14 AM MST

A few minor, code-review level notes:

- I'd strongly recommmend
`Eigen::LDLT`instead of`Eigen::LLT`. - You should either inherit from
`boost::noncopyable`or define explicit copy constructors and assignment operators; the implementations implicitly created by the compiler will be dangerously wrong for these classes. - Typedefs should be
`CamelCase`starting with a capital letter. - Function argument lists longer than one line need to be indented

#### Comment by jbosch on Tue 26 Mar 2013 11:14:39 AM MST

Looking more closely at the algorithmic side of this, I'm beginning to suspect that we'll never want to run the `GaussianProcess` interpolation directly on all of the pixels of an image; it's just intrinsically too slow. I think it's much more likely we'll want to use `GaussianProcess` to interpolate on a coarser grid, and then use something like splines or polynomials to get a value at each pixel position. That doesn't mean we need to build that functionality into this class, of course. And there are other use cases in which we'll only want to interpolate at each source position, not each pixel position, and for that we may want to use `GaussianProcess` directly.

## Design Review on 26-March-2013

Present:

- SAT: Andy Connolly, Robyn Allsman, K-T Lim
- Others: Scott Daniel, Andy Becker, Paul Price, Russell Owen, Jim Bosch, Yusra AlSayyad

### Algorithm and Requirements

Use cases -- almost anything we do spatial modeling on:

- Backgrounds
- Zeropoints
- Parameters for PSFs/kernels
- WCSs
- Over CRs and saturated pixels

Irregularly sampled datasets are key usage

- If inputs are not a grid, code is slower
- If inputs are on a grid, possible to optimize (finding nearest-neighbors more quickly)

API is too low-level for end-users, but can wrap to make more friendly

Do we need more than 2 dimensions?

- If we want to re-use k-d tree, should be based on
`Point`s (2-D or 3-D)- Can template on container (
`Point`/`ndarray`)

- Can template on container (
- Using full ndarrays for coordinates probably has 2X overhead

Can be used for 1 dimension (code should work)

- But could be made more efficient

Leave writing a `Point`-based wrapper for `GaussianProcess` and `KdTree`
for another ticket, which may also refactor them by specializing/optimizing for
`Point`s

Vector-valued functions:

- Use
`ndarray`for values - Not too hard to do
- Can implement covariance between result values
- But far away from making use of it, so don't do yet

Implement vector values; make sure scalar values are still easy to use

### Interface

`KdTree` class

Other implementations of k-d trees?

- Serge has one in source association
- Dustin has one in astrometry.net (but we treat that as an external package)
`meas_mosaic`(would like to use a more generic one)

Goal of this implementation is just to find number of neighbors

- Tests do exist to make sure it finds right neighbors

AJC: If we write a generic one, we should look at fast algorithms for treewalk from N-body simulation groups

`GaussianProcess` class

Objection to `setHyperParameters()`:

- Each covariogram should be a class that can be evaluated
- Parameters should be part of the class
- Could be a functor

Do we need to change covariograms in same class?

- Probably not, so could be templated for efficiency (do not use virtual functions in covariogram class)

Different lambda for each point?

- Just change to
`ndarray`

In most use cases, points and values will not be naturally grouped, so having them separate in arguments is OK

Should not pass in sizes of `ndarray`s; already available from the `ndarray` itself

For vector-valued functions, should fill an array rather than return one

`ndarray` template parameters should be `const` when appropriate

`addPoint()` could be useful, but then `removePoint()` should be there, too

- Somewhat dangerous because tree can become unbalanced
- Add warning to documentation, but leave method

Make `getNeighbors()`, `getCovarianceRow()`, `testKdTree()` private, remove, or move to a friend test class

Why `getCovarianceRow()` and not `getCovarianceMatrix()`?

- Not sure

Times are currently public members; move into a separate (private) `struct` returned by `getTimes()`

Many (private) data members; could be function-local variables?

- Class members should only be necessary state information (or caches)
- Otherwise interface has to be changed when only implementation is changing

- Arrays are there primarily for caching the size of array so it doesn't need to be reallocated
- Try creating temporary arrays of known size instead

Methods should be `const` wherever possible

Interpolate methods:

- Things that get modified should be first in call order
- Either switch to vector-valued functions or change variances to be scalars

No tab characters in source code