Deep kernel learning – Andrea Ruglioni

In this post, we explore deep kernel learning (DKL), a hybrid approach that combines the strengths of deep neural networks (DNNs) with Gaussian processes (GPs). DKL offers a powerful framework for modeling complex data patterns, enhancing the predictive capabilities and interpretability of standard GPs. If you are new to Gaussian processes, we recommend reading our previous posts on Gaussian processes and Multi-output Gaussian processes before diving into DKL.

Gaussian processes (GPs)

Gaussian processes are non-parametric models used for regression and classification tasks. A GP is defined by its mean function \(m(\cdot)\) and covariance function (kernel) \(k(\cdot, \cdot)\).

\[ f \sim \mathcal{GP}(m, k) \]

Remember that, given training data \(X = [x_1, \ldots, x_n]\) and targets \(Y = [y_1 = g(x_1), \ldots, y_n = g(x_n)]\), the predictive distribution for a test point \(x_*\) is:

\[ p(f_* \mid X, Y, \mathbf{x}_*) = \mathcal{N}(\mu(x_* \mid X, Y), s^2(x_* \mid X, Y)) \]

where:

\[ \mu(x_* \mid X, Y) = m(x_*) + k(x_*, X) K_n^{-1} (Y - m(X)) \]

\[ s(x_* \mid X, Y) = k(x_*, x_*) - k(x_*, X) K_n^{-1} k(X, x_*) \]

and \(K_n\) is the kernel matrix for training data \(X\) with noise added to the diagonal.

Here, we can notice how much the kernel is important. It plays a vital role both in the mean and the variance of the predictive distribution (that is why often the mean is set to zero). Therefore, its choice is crucial for the model’s performance, and ad-hoc kernels are designed to capture specific patterns in the data.

Deep kernel learning (DKL)

DKL integrates DNNs with GPs by using a DNN to learn a representation of the data, which is then used to define the GP kernel. This allows the GP to capture complex patterns in the data that a standard kernel might miss.

In DKL, the DNN acts as a feature extractor, transforming input \(X\) into a feature vector \(Z = \phi(X)\). The GP kernel is then defined on these features:

\[ k_{\phi}(\mathbf{x}, \mathbf{x}') = k(\phi(\mathbf{x}), \phi(\mathbf{x}')) \]

Here stands the flexibility of DKL. The most suitable kernel is learnt ad-hoc for the data at hand, and it is not fixed a priori. Indeed, the DNN parameters and the GP ones are jointly optimized to maximize the likelihood of the data.

Implementation with `GPyTorch`

Let’s try to implement a DKL model using the GPyTorch library in Python. Then we will compare its performance with a standard GP. We will use the McCormick function as a synthetic dataset and compare the DKL model’s performance with a standard GP. The McCormick function is defined as:

\[ g(x_1, x_2) = \sin(x_1 + x_2) + (x_1 - x_2)^2 - 1.5x_1 + 2.5x_2 + 1 \]

and it has been widely used as a benchmark function for optimization and regression tasks. To make the proplem more challenging, gaussian noise is added to the training data, as a standard normal distribution with mean 0 and standard deviation 1. Moreover, 25 training points are used to fit the model, distributed on a 5x5 grid. Note that such grid is not fine-grained, as the domain of the function is \(x_1 \in [-1.5, 4]\) and \(x_2 \in [-3, 4]\). Therefore, the model has to generalize well to make accurate predictions on unseen data.

First of all, we need to import the libraries that we need.

Code

%pip install plotly

import numpy as np
import torch
import gpytorch
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import pandas as pd
from torch import nn
from plotly.subplots import make_subplots
from sklearn.metrics import mean_squared_error

Now, we can define the McCormick function and generate the synthetic dataset for training and testing.

Regarding the deep neural network, we define a simple feedforward network with two hidden layers and a softplus activation function. The architecture is defined in the FeatureExtractor class, and it is shown in Figure 1. The remaining part is the same as the standard GP model, and it is deeply explained in the Gaussian processes post.

Code

# Define the McCormick function
def mccormick(x1, x2):
    return np.sin(x1 + x2) + (x1 - x2)**2 - 1.5 * x1 + 2.5 * x2 + 1

# Generate grid data for plotting
x1 = np.linspace(-1.5, 4, 100)
x2 = np.linspace(-3, 4, 100)
x1_test, x2_test = np.meshgrid(x1, x2)
y = mccormick(x1_test, x2_test)

# Generate test data
test_x = torch.tensor(np.vstack((x1_test.ravel(), x2_test.ravel())).T, dtype=torch.float32)
test_y = torch.tensor(y.ravel(), dtype=torch.float32)

# Generate training data
x1 = np.linspace(-1.5, 4, 5)
x2 = np.linspace(-3, 4, 5)
x1_train, x2_train = np.meshgrid(x1, x2)
train_x = np.vstack((x1_train.ravel(), x2_train.ravel())).T
# add noise to the training data
train_y = mccormick(train_x[:, 0], train_x[:, 1]) + np.random.normal(0, 1, len(train_x))

# Convert training data to tensors
train_x = torch.tensor(train_x, dtype=torch.float32)
train_y = torch.tensor(train_y.ravel(), dtype=torch.float32)

# Define the DNN feature extractor
class FeatureExtractor(nn.Module):
    def __init__(self):
        super(FeatureExtractor, self).__init__()
        self.fc1 = nn.Linear(2, 8)
        self.fc2 = nn.Linear(8, 4)
        self.fc3 = nn.Linear(4, 2)
        self.activation = nn.Softplus()
    
    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        x = self.fc3(x)
        return x

# Define the DKL model
class DKLModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(DKLModel, self).__init__(train_x, train_y, likelihood)
        feature_extractor = FeatureExtractor()
        self.feature_extractor = feature_extractor
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
    
    def forward(self, x):
        projected_x = self.feature_extractor(x)
        mean_x = self.mean_module(projected_x)
        covar_x = self.covar_module(projected_x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# Define the GP model
class GPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(GPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# Initialize likelihood and models
dkl_likelihood = gpytorch.likelihoods.GaussianLikelihood()
dkl_model = DKLModel(train_x, train_y, dkl_likelihood)
gp_likelihood = gpytorch.likelihoods.GaussianLikelihood()
gp_model = GPModel(train_x, train_y, gp_likelihood)

# Training parameters
training_iterations = 500

# Train DKL model
dkl_model.train()
dkl_likelihood.train()
optimizer = torch.optim.Adam([
    {'params': dkl_model.feature_extractor.parameters()},
    {'params': dkl_model.covar_module.parameters()},
    {'params': dkl_model.mean_module.parameters()},
    {'params': dkl_likelihood.parameters(), 'lr': 0.05},
], lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(dkl_likelihood, dkl_model)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=100, gamma=0.5)

for i in range(training_iterations):
    optimizer.zero_grad()
    output = dkl_model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    optimizer.step()
    scheduler.step()

# Train GP model
gp_model.train()
gp_likelihood.train()
optimizer = torch.optim.Adam([
    {'params': gp_model.covar_module.parameters()},
    {'params': gp_model.mean_module.parameters()},
    {'params': gp_likelihood.parameters()},
], lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(gp_likelihood, gp_model)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=100, gamma=0.5)

for i in range(training_iterations):
    optimizer.zero_grad()
    output = gp_model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    optimizer.step()
    scheduler.step()

# Evaluate models
dkl_model.eval()
dkl_likelihood.eval()
gp_model.eval()
gp_likelihood.eval()

with torch.no_grad(), gpytorch.settings.fast_pred_var():
    dkl_pred = dkl_likelihood(dkl_model(test_x))
    gp_pred = gp_likelihood(gp_model(test_x))

dkl_pred = dkl_pred.mean.numpy().reshape(x1_test.shape)
gp_pred = gp_pred.mean.numpy().reshape(x1_test.shape)
true_y = test_y.numpy().reshape(x1_test.shape)

# Create interactive 3D plots
fig = go.Figure()

# DKL predictions surface
fig.add_trace(go.Surface(z=dkl_pred, x=x1_test, y=x2_test, colorscale='Blues', opacity=0.6, name='DKL', showscale=False))

# GP predictions surface
fig.add_trace(go.Surface(z=gp_pred, x=x1_test, y=x2_test, colorscale='Greens', opacity=0.6, name='GP', showscale=False))

# True surface
fig.add_trace(go.Surface(z=true_y, x=x1_test, y=x2_test, colorscale='Reds', opacity=0.6, name='McCormick', showscale=False))

# Add training points
fig.add_trace(go.Scatter3d(x=train_x[:, 0].numpy(), y=train_x[:, 1].numpy(), z=train_y.numpy(), mode='markers', marker=dict(size=3, color='black', opacity=0.8), name='Train Points'))

fig.update_layout(scene=dict(xaxis_title='X1', yaxis_title='X2', zaxis_title='Y'))

fig.show()

The interactive 3D plot above shows the predictions of the DKL (blue) and GP (green) models on the McCormick function (red). The DKL model captures the complex patterns of the function more accurately than the GP model, providing a better fit to the true surface.

In terms of performance, we can compare the mean squared error (MSE) of the DKL and GP models on the test data (i.e., the McCormick function evaluated on a grid 100x100).

Code

dkl_mse = mean_squared_error(test_y.numpy(), dkl_pred.ravel())
gp_mse = mean_squared_error(test_y.numpy(), gp_pred.ravel())


df = pd.DataFrame({'Model': ['DKL', 'GP'],
                   'MSE': [dkl_mse, gp_mse]
                   })
df

	Model	MSE
0	DKL	0.802505
1	GP	4.195485

The results are clear - the DKL model outperforms the GP model, demonstrating the benefits of combining deep neural networks with Gaussian Processes for complex regression tasks.

NN’s embedding

Let’s visualize the embedding of the input data into the feature space learned by the DNN.

Code

# Get the feature embedding of the training data
with torch.no_grad():
    feature_embedding = dkl_model.feature_extractor(train_x)


# Create a subplot with 1 row and 2 columns
fig = make_subplots(rows=1, cols=2, subplot_titles=('Original Training Data', 'Feature Embedding'))

# Add the original training data to the first subplot
fig.add_trace(
    go.Scatter(
        x=train_x[:, 0].numpy(), 
        y=train_x[:, 1].numpy(), 
        mode='markers',
        marker=dict(size=5, color=train_y.numpy(), colorscale='Viridis', opacity=0.8),
        name='Training Data',
        hoverinfo='text',
        text=[f'Index: {i}' for i in range(len(train_x))]), # Custom hover text
    row=1, col=1
)

# Add the feature embedding data to the second subplot
fig.add_trace(
    go.Scatter(
        x=feature_embedding[:, 0].numpy(), 
        y=feature_embedding[:, 1].numpy(), 
        mode='markers',
        marker=dict(size=5, color=train_y.numpy(), colorscale='Viridis', opacity=0.8),
        name='Feature Embedding',
        hoverinfo='text',
        text=[f'Index: {i}' for i in range(len(feature_embedding))]), # Custom hover text
    row=1, col=2
)

# Update layout for a cohesive look
fig.update_layout(height=350, width=700, showlegend=False)
fig.show()

It is extremely interesting that the DNN has noticed the kind of simmetry of the McCormick function. As a result, it has learned an almost 1D representation of the data, ordered by the value of the function. Now, the GP can easily fit the data, as the feature space is more suitable for the task.

References

Bonilla, Edwin V, Kian Chai, and Christopher Williams. 2007. “Multi-Task Gaussian Process Prediction.” In Advances in Neural Information Processing Systems, edited by J. Platt, D. Koller, Y. Singer, and S. Roweis. Vol. 20. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2007/file/66368270ffd51418ec58bd793f2d9b1b-Paper.pdf.

Forrester, Alexander I. J., and Andy J. Keane. 2009. “Recent Advances in Surrogate-Based Optimization.” Progress in Aerospace Sciences 45 (1): 50–79. https://www.sciencedirect.com/science/article/pii/S0376042108000766.

Frazier, Peter I. 2018. “A Tutorial on Bayesian Optimization.” https://arxiv.org/abs/1807.02811.

Garnett, Roman. 2023. Bayesian Optimization. Cambridge University Press.

Ober, Sebastian W., Carl E. Rasmussen, and Mark van der Wilk. 2021. “The Promises and Pitfalls of Deep Kernel Learning.” https://arxiv.org/abs/2102.12108.

Wilson, Andrew Gordon, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. 2015. “Deep Kernel Learning.” https://arxiv.org/abs/1511.02222.

Gaussian processes (GPs)

Deep kernel learning (DKL)

Implementation with GPyTorch

NN’s embedding

References

Implementation with `GPyTorch`