Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views12 pages

2023 Transfer Learning With Kernel Methods

Uploaded by

wokog93129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views12 pages

2023 Transfer Learning With Kernel Methods

Uploaded by

wokog93129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Article https://doi.org/10.

1038/s41467-023-41215-8

Transfer Learning with Kernel Methods


Received: 13 December 2022 Adityanarayanan Radhakrishnan1,2,3, Max Ruiz Luyten1,3, Neha Prasad1 &
Caroline Uhler 1,2
Accepted: 28 August 2023

Transfer learning refers to the process of adapting a model trained on a source


Check for updates task to a target task. While kernel methods are conceptually and computa-
tionally simple models that are competitive on a variety of tasks, it has been
unclear how to develop scalable kernel-based transfer learning methods
across general source and target tasks with possibly differing label dimensions.
In this work, we propose a transfer learning framework for kernel methods by
1234567890():,;
1234567890():,;

projecting and translating the source model to the target task. We demon-
strate the effectiveness of our framework in applications to image classifica-
tion and virtual drug screening. For both applications, we identify simple
scaling laws that characterize the performance of transfer-learned kernels as a
function of the number of target examples. We explain this phenomenon in a
simplified linear setting, where we are able to derive the exact scaling laws.

Transfer learning refers to the machine learning problem of utilizing works on transfer learning with kernels focus on applications in which
knowledge from a source task to improve performance on a target the source and target tasks have the same label sets15–20. Examples
task. Recent approaches to transfer learning have achieved tre- include predicting stock returns for a given sector based on returns
mendous empirical success in many applications, including in com- available for other sectors16 or predicting electricity consumption for
puter vision1,2, natural language processing3–5, and the biomedical certain zones of the United States based on the consumption in other
field6,7. Since transfer learning approaches generally rely on complex zones17. These methods are not applicable to general source and target
deep neural networks, it can be difficult to characterize when and why tasks with differing label dimensions, including classical transfer
they work8. Kernel methods9 are conceptually and computationally learning applications such as using a model trained to classify between
simple machine learning models that have been found to be compe- thousands of objects to subsequently classify new objects. There are
titive with neural networks on a variety of tasks, including image also various works on using kernels for multi-task learning
classification10–12 and drug screening12. Their simplicity stems from the problems21–23, which, in the context of transfer learning, assume that
fact that training a kernel method involves performing linear regres- source and target data are available at the time of training the source
sion after transforming the data. There has been renewed interest in model. These methods can be computationally expensive since they
kernels due to a recently established equivalence between wide neural involve computing matrix-valued kernels, where the number of rows/
networks and kernel methods13,14, which has led to the development of columns is equal to the number of labels. As a consequence, for a
modern, neural tangent kernels (NTKs) that are competitive with kernel trained on ImageNet3224 with 1000 possible labels, a matrix-
neural networks. Given their simplicity and effectiveness, kernel valued kernel would involve 106 times more compute than a classical
methods could provide a powerful approach for transfer learning and kernel method. Prior works also develop kernel-based methods for
also help characterize when transfer learning between a source and learning a re-weighting or transformation that captures similarities
target task would be beneficial. across source and target data distributions25–27. Such a transformation
Yet, developing scalable algorithms for transfer learning with is typically learned by solving an optimization problem that involves
kernel methods for general source and target tasks with possibly dif- materializing the full training kernel matrix, which can be computa-
fering label dimensions has been an open problem. In particular, while tionally prohibitive (e.g., for a dataset with a million samples, this
there is a standard transfer learning approach for neural networks that would require more than 3.5 terabytes of memory).
involves replacing and re-training the last layer of a pre-trained net- In this work, we present a general, scalable framework for per-
work, there is no known corresponding operation for kernels. Prior forming transfer learning with kernel methods. Unlike prior work, our

1
Massachusetts Institute of Technology, Cambridge, MA, USA. 2Broad Institute of MIT and Harvard, Cambridge, MA, USA. 3These authors contributed equally:
Adityanarayanan Radhakrishnan, Max Ruiz Luyten. e-mail: [email protected]

Nature Communications | (2023)14:5570 1


Article https://doi.org/10.1038/s41467-023-41215-8

framework enables transfer learning for kernels regardless of whether EigenPro29, thereby allowing our framework to easily scale to datasets
the source and target tasks have the same or differing label sets. Fur- such as ImageNet32 with over one million samples.
thermore, like for transfer learning methodology for neural networks, Projection is effective when the source model predictions contain
our framework allows transferring to a variety of target tasks after information regarding the target labels. We will demonstrate that this
training a kernel method only once on a source task. is the case in image classification tasks in which the predictions of a
The key components of our transfer learning framework are: Train classifier trained to distinguish between a thousand objects in
a kernel method on a source dataset and then apply the following ImageNet3224 provides information regarding the labels of images in
operations to transfer the model to the target task. other datasets, such as street view house numbers (SVHN)30; see
• Projection. We apply the trained source kernel to each sample in Fig. 1b. In particular, we will show across 23 different source and target
the target dataset and then train a secondary model on these task combinations that kernels transferred using our approach achieve
source predictions to solve the target task; see Fig. 1a. up to a 10% increase in accuracy over kernels trained on target tasks
• Translation. When the source and target tasks have the same directly.
label sets, we train a correction term that is added to the source On the other hand, translation is effective when the predictions of
model to adapt it to the target task; see Fig. 1c. the source model can be corrected to match the labels of the target
task via an additive term. We will show that this is the case in virtual
We note that while these two operations are general and can be drug screening in which a model trained to predict the effect of a drug
applied to any predictor, we focus on using them in conjunction with on one cell line can be adjusted to capture the effect on a new cell line;
kernel methods due to their conceptual simplicity, effectiveness, and see Fig. 1d. In particular, we will show that our transfer learning
flexibility, in particular given that they include infinite-width neural approach provides an improvement to prior kernel method
networks13 as a subclass. Moreover, the closed-form solutions pro- predictors12 even when transferring to cell lines and drugs not present
vided by kernel methods also enable a theoretical analysis of transfer in the source task.
learning. Projection and translation are motivated by operations that Interestingly, we observe that for both applications, image clas-
are standardly used for transfer learning using neural networks. sification and virtual drug screening, transfer learned kernel methods
Namely, projection corresponds to adding layers at the end of a neural follow simple scaling laws; i.e., how the number of available target
network trained on a source task and then training the weights in these samples effects the performance on the target task can be accurately
new layers on the target task. And when approximating a neural net- modelled. As a consequence, our work provides a simple method for
work by its linearization around the initial parameters13,28, transfer estimating the impact of collecting more target samples on the per-
learning by tuning the weights of the source model on the target task is formance of the transfer learned kernel predictors. In the simplified
equivalent to performing translation. Our formulation of projection setting of transfer learning with linear kernel methods we are able to
and translation makes these operations compatible with recent pre- mathematically derive the scaling laws, thereby providing a mathe-
conditioned gradient descent kernel regression solvers such as matical basis for the empirical observations. To do so, we obtain exact

Fig. 1 | Our framework for transfer learning with kernel methods for supervised distinguish the images of zeros from ones by using the similarity of zeros to balls
learning tasks. After training a kernel method on a source task, we transfer the and ones to poles. c Translation involves adding a correction term to the source
source model to the target task via a combination of projection and translation model, as is shown for predicting the effect of a drug on a cell line. d Translation is
operations. a Projection involves training a second kernel method on the predic- effective when the predictions of the source model can be additively corrected to
tions of the source model on the target data, as is shown for image classification match labels in the target data; e.g., the predictions of a model trained to predict
between natural images and house numbers. b Projection is effective when the the effect of drugs on one cell line may be additively adjustable to predict the effect
predictions of the source model on target examples provide useful information on new cell lines.
about target labels; e.g., a model trained to classify natural images may be able to

Nature Communications | (2023)14:5570 2


Article https://doi.org/10.1038/s41467-023-41215-8

non-asymptotic formulas for the risk of the projected and translated Throughout this work, we assume that the source and target domains
predictors. Our non-asymptotic analysis is in contrast to a large num- are equal (X s = X t ), but that the data distributions differ (Ps ≠ Pt ).
ber of prior works analyzing multitask learning algorithms31–34 and Our work is concerned with the recovery of ft by transferring a
meta-learning algorithms35,36, which provide generalization bounds model, ^ f s , that is learned by training a kernel machine on the source
establishing statistical consistency of these methods but do not pro- dataset. To enable transfer learning with kernels, we propose the use of
vide an explicit form of the risk, which is required for deriving explicit two methods, projection and translation. We first describe these
scaling laws. Overall, our work demonstrates that transfer learning methods individually and demonstrate their performance on transfer
with kernel methods between general source and target tasks is pos- learning for image classification using kernel methods. For each
sible and demonstrates the simplicity and effectiveness of the pro- method, we empirically establish scaling laws relating the quantities
posed method on a variety of important applications. ns, nt, cs, ct to the performance boost given by transfer learning, and we
also derive explicit scaling laws when ft, fs are linear maps. We then
Results utilize a combination of the two methods to perform transfer learning
In the following, we present our framework for transfer learning with in an application to virtual drug screening.
kernel methods more formally. Since kernel methods are fundamental
to this work, we start with a brief review. Transfer learning via projection
Given training examples X = ½x ð1Þ , . . . , x ðnÞ  2 Rd × n , corresponding Projection involves learning a map from source model predictions to
labels y = ½ yð1Þ , . . . , yðnÞ  2 R1 × n , a standard nonlinear approach to fit- target labels and is thus particularly suited for situations where the
ting the training data is to train a kernel machine9. This approach number of labels in the source task cs is much larger than the number
n
involves first transforming the data, fx ðiÞ gi = 1 , with a feature map, ψ, and of labels in the target task ct.
then performing linear regression. To avoid defining and working with
feature maps explicitly, kernel machines rely on a kernel function, Definition 1. Given a source dataset (Xs, ys) and a target dataset (Xt, yt),
K : Rd × Rd ! R, which corresponds to taking inner products of the the projected predictor, ^
f t , is given by:
transformed data, i.e., K(x(i), x( j)) = 〈ψ(x(i)), ψ(x( j))〉. The trained kernel
machine predictor uses the kernel instead of the feature map and is ^f ðxÞ = ^
f p ð^
f s ðxÞÞ, where ^
f p : = argmin kyt  f ð^
f s ðXt ÞÞk2 ,
t ð2Þ
given by: ff :Y s !Y t g

^f ðxÞ = αKðX , xÞ, where α = argmin ky  wK k2 , where ^ f s is a predictor trained on the source dataset. When there are
n 2 ð1Þ infinitely many possible values for the parameterized function ^ f p , we
w2R1 × n
consider the minimum norm solution.
While Definition 1 is applicable to any machine learning method,
and K n 2 Rn × n with ðK n Þi, j = Kðx ðiÞ , x ð jÞ Þ and KðX , xÞ 2 Rn with we focus on predictors ^ f s and ^
f p parameterized by kernel machines
K(X, x)i = K(x(i), x). Note that for datasets with over 105 samples, com- given their conceptual and computational simplicity. As illustrated in
puting the exact minimizer α is computationally prohibitive, and we Fig. 1a and b, projection is effective when the predictions of the source
instead use fast, approximate iterative solvers such as EigenPro29. For a model already provide useful information for the target task.
more detailed description of kernel methods see SI Note 1. Kernel-based image classifier performance improves with pro-
For the experiments in this work, we utilize a variety of kernel jection. We now demonstrate the effectiveness of projected kernel
functions. In particular,
 we consider
 the classical Laplace kernel given predictors for image classification. In particular, we first train kernels
by Kðx, x~ Þ = exp L k x  x~ k2 , which is a standard benchmark kernel to classify among 1000 objects across 1.28 million images in Ima-
that has been widely used for image classification and speech geNet32 and then transfer these models to 4 different target image
recognition29. In addition, we consider recently discovered kernels that classification datasets: CIFAR1041, Oxford 102 Flowers42, Describable
correspond to infinitely wide neural networks. While there is an Textures Datasets43, and SVHN30. We selected these datasets since they
emerging understanding that increasingly wider neural networks cover a variety of transfer learning settings, i.e. all of the CIFAR10
generalize better37,38, such models are generally computationally dif- classes are in ImageNet32, ImageNet32 contains only 2 flower classes,
ficult to train. Remarkably, recent work identified conditions under and none of DTD and SVHN classes are in ImageNet32. A full descrip-
which neural networks in the limit of infinite width implement kernel tion of the datasets is provided in Methods.
machines; the corresponding kernel is known as the Neural Tangent For all datasets, we compare the performance of 3 kernels (the
Kernel (NTK)13. In the following, we use the NTK corresponding to Laplace kernel, NTK, and CNTK) when trained just on the target task,
training an infinitely wide ReLU fully connected network13 and also the i.e. the baseline predictor, and when transferred via projection from
convolutional NTK (CNTK) corresponding to training an infinitely wide ImageNet32. Training details for all kernels are provided in Methods. In
ReLU convolutional network14. We chose to use the CNTK without Fig. 2a, we showcase the improvement of projected kernel predictors
global average pooling (GAP)14 for our experiments. While the CNTK over baseline predictors across all datasets and kernels. We observe
model with GAP as well as the models considered in39 give higher that projection yields a sizeable increase in accuracy (up to 10%) on the
accuracy on image datasets, they are computationally prohibitive to target tasks, thereby highlighting the effectiveness of this method. It is
compute for our large-scale experiments. For example, a CNTK with remarkable that this performance increase is observed even for
GAP is estimated to take 1200 GPU hours for 50k training samples11. transferring to Oxford 102 Flowers or DTD, datasets that have little to
Unlike the usual supervised learning setting where we train a no overlap with images in ImageNet32.
predictor on a single domain, we will consider the following transfer In SI Fig. S1a, we compare our results with those of a finite-width
learning setting from40, which involves two domains: (1) a source with neural network analog of the (infinite-width) CNTK where all layers of
domain X s and data distribution Ps ; and (2) a target with domain X t the source network are fine-tuned on the target task using the standard
and data distribution Pt . The goal is to learn a model for a target task cross-entropy loss44 and the Adam optimizer45. We observe that the
f t : X t ! Y t by making use of a model trained on a source task performance gap between transfer-learned finite-width neural net-
f s : X s ! Y s . We let cs and ct denote the dimensionality of Y s and Y t works and the projected CNTK is largely influenced by the perfor-
respectively, i.e. for image classification these denote the number of mance gap between these models on ImageNet32. In fact, in SI Fig. S1a,
n n
classes in the source and target. Lastly, we let ðX s , ys Þ 2 X s s × Y s s and we show that finite-width neural networks trained to the same test
nt nt
ðX t , yt Þ 2 X t × Y t denote the source and target dataset, respectively. accuracy on ImageNet32 as the (infinite-width) CNTK yield lower

Nature Communications | (2023)14:5570 3


Article https://doi.org/10.1038/s41467-023-41215-8

Fig. 2 | Analysis of transfer learning with kernels trained on ImageNet32 to 10%. b Test accuracy of the transferred and baseline predictors as a function of the
CIFAR10, Oxford 102 Flowers, DTD, and a subset of SVHN. All curves in (b, c) are number of target examples. These curves, which quantitatively describe the benefit
averaged over 3 random seeds. a Comparison of the transferred kernel predictor of collecting more target examples, follow simple logarithmic trends (R2 > . 95).
test accuracy (green) to the test accuracy of the baseline kernel predictors trained c Performance of the transferred kernel methods decreases when increasing the
directly on the target tasks (red). In all cases, the transferred kernel predictors number of source classes but keeping the total number of source examples fixed.
outperform the baseline predictors and the difference in performance is as high as Corresponding plots for DTD and SVHN are in SI Fig. S2.

performance than the CNTK when transferred to target image classi- Transfer learning via translation
fication tasks. While projection involves composing a map with the source model, the
The computational simplicity of kernel methods allows us to second component of our framework, translation, involves adding a
compute scaling laws for the projected predictors. In Fig. 2b, we map to the source model as follows.
analyze how the performance of projected kernel methods varies as
a function of the number of target examples, nt, for CIFAR10 and Definition 2. Given a source dataset (Xs, ys) and a target dataset (Xt, yt),
Oxford 102 Flowers. The results for DTD and SVHN are presented in the translated predictor, ^
f t , is given by:
SI Fig. S2a and b. For all target datasets, we observe that the accu-
racy of the projected predictors follows a simple logarithmic trend f t ðxÞ = ^
^ f s ðxÞ + ^
f c ðxÞ, where ^
f c = argmin kyt  ^
f s ðXt Þ  f ðXt Þk2 , ð3Þ
given by the curve a log nt + b for constants a, b (R2 values on all ff :X t !Y t g

datasets are above 0.95). By fitting this curve on the accuracy cor-
responding to just the smallest five values of nt, we are able to where ^ f s is a predictor trained on the source dataset. When there are
predict the accuracy of the projected predictors within 2% of the infinitely many possible values for the parameterized function ^ f c , we
reported accuracy for large values of nt (see Methods, SI Fig. S4). consider the minimum norm solution.
The robustness of this fit across many target tasks illustrates the Translated predictors correspond to first utilizing the trained
practicality of the transferred kernel methods for estimating the source model directly on the target task and then applying a correc-
number of target examples needed to achieve a given accuracy. tion, ^f c , which is learned by training a model on the corrected labels,
Additional results on the scaling laws upon varying the number of yt  ^f s ðX t Þ. Like for the projected predictors, translated predictors can
source examples per class are presented in SI Fig. S3 for transferring be implemented using any machine learning model, including kernel
between ImageNet32 and CIFAR10. In general, we observe that the methods. When the predictors ^ f s and ^
f c are parameterized by linear
performance increases as the number of source training examples models, translated predictors correspond to training a target predictor
per class increases, which is expected given the similarity of source with weights initialized by those of the trained source predictor (proof
and target tasks. in SI Note 4). We note that training translated predictors is also a new
Lastly, we analyze the impact of increasing the number of classes form of boosting46 between the source and target dataset, since the
while keeping the total number of source training examples fixed at correction term accounts for the error of the source model on the
40k. Figure 2c shows that having few samples for each class can be target task. Lastly, we note that while the formulation given in Defini-
worse than having a few classes with many samples. This may be tion 2 requires the source and target tasks to have the same label
expected for datasets such as CIFAR10, where the classes overlap with dimension, projection and translation can be naturally combined to
the ImageNet32 classes: having few classes with more examples that overcome this restriction.
overlap with CIFAR10 should be better than having many classes with Kernel-based image classifier performance improves with trans-
fewer examples per class and less overlap with CIFAR10. A similar trend lation. We now demonstrate that the translated predictors are parti-
can be observed for DTD, but interestingly, the trend differs for SVHN, cularly well-suited for correcting kernel methods to handle
indicating that SVHN images can be better classified by projecting distribution shifts in images. Namely, we consider the task of trans-
from a variety of ImageNet32 classes (see SI Fig. S2). ferring a source model trained on CIFAR10 to corrupted CIFAR10

Nature Communications | (2023)14:5570 4


Article https://doi.org/10.1038/s41467-023-41215-8

images in CIFAR10-C47. CIFAR10-C consists of the test images in source predictor already achieves an accuracy of 60.80%, the trans-
CIFAR10, but the images are corrupted by one of 19 different pertur- lated predictors achieve an accuracy of above 60% when trained on
bations, such as adjusting image contrast and introducing natural only 10 target training samples. For the examples of the contrast and
artifacts such as snow or frost. In our experiments, we select the 10k fog corruptions, Fig. 3b also shows that very few target examples allow
images of CIFAR10-C with the highest level of perturbation, and we the translated predictors to outperform the source predictors (e.g., by
reserve 9k images of each perturbation for training and 1k images for up to 5% for only 200 target examples). Overall, our results showcase
testing. In SI Fig. S5, we additionally analyze translating kernels from that translation is effective at adapting kernel methods to distribution
subsets of ImageNet32 to CIFAR10. shifts in image classification.
Again, we compare the performance of the three kernel
methods considered for projection, but along with the accuracy of Transfer learning via projection and translation in virtual drug
the translated predictor and baseline predictor, we also report the screening
accuracy of the source predictor, which is given by using the source We now demonstrate the effectiveness of projection and translation
model directly on the target task. In Fig. 3a and SI Fig. S6, we show for the use of kernel methods for virtual drug screening. A common
that the translated predictors outperform the baseline and source problem in drug screening is that experimentally measuring many
predictors on all 19 perturbations. Interestingly, even for corrup- different drug and cell line combinations is both costly and time-
tions such as contrast and fog where the source predictor is worse consuming. The goal of virtual drug screening approaches is to com-
than the baseline predictor, the translated predictor outperforms putationally identify promising candidates for experimental valida-
all other kernel predictors by up to 11%. In SI Fig. S6, we show that tion. Such approaches involve training models on existing
for these corruptions, the translated kernel predictors also out- experimental data to then impute the effect of drugs on cell lines for
perform the projected kernel predictors trained on CIFAR10. In SI which there was no experimental data.
Fig. S1b, we additionally compare with the performance of a finite- The CMAP dataset48 is a large-scale, publicly available drug screen
width analog of the CNTK by fine-tuning all layers on the target containing measurements of 978 landmark genes for 116,228 combi-
task with cross-entropy loss and the Adam optimizer. We observe nations of 20,336 drugs (molecular compounds) and 70 cell lines. This
that the translated kernel methods outperform the corresponding dataset has been an important resource for drug screening49,50. CMAP
neural networks. Remarkably kernels translated from CIFAR10 can also contains data on genetic perturbations; but in this work, we focus
even outperform fine-tuning a neural network pre-trained on on imputing the effect of chemical perturbations only. Prior work for
ImageNet32 for several perturbations (see SI Fig. S1c). In SI Fig. S7, virtual drug screening demonstrated the effectiveness of low-rank
we additionally demonstrate the effectiveness of our translation tensor completion and nearest neighbor predictors for imputing the
methodology over prior transfer learning methods using multiple effect of unseen drug and cell line combinations in CMAP51. However,
kernel learning (see Methods for further details). these methods crucially rely on the assumption that for each drug
Analogously to our analysis of the projected predictors, we there is at least one measurement for every cell line, which is not the
visualize how the accuracy of the translated predictors is affected by case when considering new chemical compounds. To overcome this
the number of target examples, nt, for a subset of corruptions shown in issue, recent work12 introduced kernel methods for drug screening
Fig. 3b. We observe that the performance of the translated predictors using the NTK to predict gene expression vectors from drug and cell
is heavily influenced by the performance of the source predictor. For line embeddings, which capture the similarity between drugs and
example, as shown in Fig. 3b for the brightness perturbation, where the cell lines.

Fig. 3 | Transferring kernel methods from CIFAR10 to adapt to 19 different the baseline kernel method when the source predictor exhibits a decrease in per-
corruptions in CIFAR10-C. a Test accuracy of baseline kernel method (red), using formance. Additional results are presented in SI Fig. S6. b Performance of the
source predictor given by directly applying the kernel trained on CIFAR10 to transferred and baseline kernel predictors as a function of the number of target
CIFAR10-C (gray), and transferred kernel method (green). The transferred kernel examples. The transferred kernel method can outperform both source and baseline
method outperforms the other models on all 19 corruptions and even improves on predictors even when transferred using as little as 200 target examples.

Nature Communications | (2023)14:5570 5


Article https://doi.org/10.1038/s41467-023-41215-8

In the following, we demonstrate that the NTK predictor can be Figure 4 a and b show that the transferred kernel predictors
transferred to improve gene expression imputation for drug and cell outperform both, the baseline model from12 as well as imputation by
line combinations, even in cases where neither the particular drug nor mean (over each cell line) gene expression across three different
the particular cell line were available when training the source model. metrics (R2, cosine similarity, and Pearson r value) on both tasks (i.e.,
To utilize the framework of12, we use the control gene expression transferring to drugs that were seen in the source task as well as
vector as cell line embedding and the 1024-bit circular fingerprints completely new drugs). All metrics and training details are presented
from52 as drug embedding. All pre-processing of the CMAP gene in Methods. Interestingly, the transferred kernel methods provide a
expression vectors is described in Methods. For the source task, we boost over the baseline kernel methods even when transferring to new
train the NTK to predict gene expression for the 54,444 drug and cell cell lines and new drugs. But as expected, we note that the increase in
line combinations corresponding to the 65 cell lines with the least drug performance is greater when transferring to drug and cell line com-
availability in CMAP. We then impute the gene expression for each of binations for which the drug was available in the source task. Figure 4c
the 5 cell lines (A375, A549, MCF7, PC3, VCAP) with the most drug and d show that the transferred kernels again follow simple logarith-
availability. We chose these data splits in order to have sufficient target mic scaling laws (fitting a logarithmic model to the red and green
samples to analyze model performance as a function of the number of curves yields R2 > 0.9). We note that the transferred NTKs have better-
target samples. In our analysis of the transferred NTK, we always scaling coefficients than the baseline models, thereby implying that
consider transfer to a new cell line, and we stratify by whether a drug in the performance gap between the transferred NTK and the baseline
the target task was already available in the source task. For this NTK grows as more target examples are collected until the perfor-
application we combine projection and translation into one predictor mance of the transferred NTK saturates at its maximum possible value.
as follows. In Fig. 4e and f, we visualize the performance of the transferred NTK in
relation to the top 2 principal components of gene expression for drug
Definition 3. Given a source dataset (Xs, ys) and a target dataset (Xt, yt), and cell line combinations. We generally observe that the performance
the projected and translated predictor, ^f pt , is given by: of the NTK is lower for cell and drug combinations that are further
from the control, i.e., the unperturbed state. Plots for the other 3 cell
h i  h i2 lines are presented in SI Fig. S8. In Methods and SI Fig. S9, we show that
^f ðxÞ = ^f ^f ðxÞ j x ,where ^  
pt s f = argmin yt  f ^ f s ðXt Þ j Xt  ,
f :Y s × X t !Y s
this approach can also be used for other transfer learning tasks related
to virtual drug screening. In particular, we show that the imputed gene
ð4Þ
expression vectors can be transferred to predict the viability of a drug
h i
and cell line combination in the large-scale, publicly available Cancer
where ^f s is a predictor trained on the source dataset and ^ f s ðxÞ j x 2
Dependency Map (DepMap) dataset53.
Y s × X t is the concatenation of ^ f s ðxÞ and x.
Note that if we omit x, Xt in the concatenation above, we get the
projected predictor, and if f is additive in its arguments, i.e., if
Theoretical analysis of projection and translation in the linear
setting
f ð½ ^f ðxÞ j xÞ = ^f ðxÞ + x, we get the translated predictor. Generally, ^
s s f ðxÞ
s In the following, we provide explicit scaling laws for the performance
and x can correspond to different modalities (e.g., class label vectors of projected and translated kernel methods in the linear setting,
and images), but in the case of drug screening, both correspond to thereby providing a mathematical basis for the empirical observations
gene expression vectors of the same dimension. Thus, combining in the previous sections.
projection and translation is natural in this context.

Fig. 4 | Transferring the NTK trained to predict gene expression for given drug number of target examples and exhibits a better scaling coefficient than the
and cell line combinations in CMAP to new drug and cell line combinations. a, b baselines. The results are averaged over 5 cell lines. e, f Visualization of the per-
The transfer learned NTK (green) outperforms imputation by mean over cell line formance of the transferred NTK in relation to the top two principal components
(gray) and previous NTK baseline predictors from12 across R2, cosine similarity, and (denoted PC1 and PC2) of gene expression for target drug and cell line combina-
Pearson r metrics. All results are averaged over the performance on 5 cell lines and tions. The performance of the NTK is generally lower for cell and drug combina-
are stratified by whether or not the target data contains drugs that are present in tions that are further from the control gene expression for a given cell line.
the source data. Error bars indicate standard deviation. c, d The transferred kernel Visualizations for the remaining 3 cell lines are presented in SI Fig. S8.
method performance follows a logarithmic trend (R2 > . 9) as a function of the

Nature Communications | (2023)14:5570 6


Article https://doi.org/10.1038/s41467-023-41215-8

We derive scaling laws for projected predictors in the following Corollary 1 not only formalizes several intuitions regarding
linear setting. We assume that X = Rd , Y s = Rcs , Y t = Rct and that fs and transfer learning, but also theoretically corroborates surprising
ft are linear maps, i.e., f s = ωs 2 Rcs × d and f t = ωt 2 Rct × d . The fol- dependencies on the number of source examples, target examples,
lowing results provide a theoretical foundation for the empirical and source classes that were empirically observed in Fig. 2 for kernels
observations regarding the role of the number of source classes and and in54 for convolutional networks. First, Corollary 1a implies that
the number of source samples for transfer learning shown in Fig. 2 as increasing the number of source examples is always beneficial for
well as in54. In particular, we will derive scaling laws for the risk, or transfer learning when the source and target tasks are related (ε ≈ 0),
expected test error, of the projected predictor as a function of the which matches intuition. Next, Corollary 1b implies that increasing the
number of source examples, ns, target examples, nt, and number of number of source classes while leaving the number of source examples
source classes, cs. We note that the risk of a predictor is a standard fixed can decrease performance (i.e. if 2S − 1 − ST > 0), even for similar
object of study for understanding generalization in statistical learning source and target tasks satisfying ε ≈ 0. This matches the experiments
theory55 and defined as follows. in Fig. 2c, where we observed that increasing the number of source
classes when keeping the number of source examples fixed can be
Definition 4. Let P be a probability density on Rd and let x, x ðiÞ ∼ i:i:d:P detrimental to the performance. This is intuitive for transferring from
for i = 1, 2, …n. Let X = ½x ð1Þ , . . . , x ðnÞ  2 Rd × n and y = ½w* x ð1Þ , . . . w* x ðnÞ  2 ImageNet32 to CIFAR10, since we would be adding classes that are not
Rc × n for w* 2 Rc × d . The risk of a predictor w ^ trained on the samples as useful for predicting objects in CIFAR10. However, note that such
(X, y) is given by behavior is a priori unexpected given generalization bounds for multi-
task learning problems31–33, which show that increasing the number of
RðwÞ ^ 2F :
^ = Ex,X ½kw* x  wxk ð5Þ tasks decreases the overall risk. Our non-asymptotic analysis demon-
strates that such decrease in risk only holds as the number of classes
By understanding how the risk scales with the number of source and the number of examples per class increase. Corollary 1c implies
examples, target examples, and source classes, we can characterize the that when the source and target task are similar and the number of
settings in which transfer learning is beneficial. As is standard in ana- source classes is less than the data dimension, transfer learning with
lyses of the risk of over-parameterized linear regression56–59, we con- the projected predictor is always better than training only on the target
sider the risk of the minimum norm solution given by task. Moreover, if the number of source classes is finite (C = 0), Cor-
ollary 1c implies that the risk of the projected predictor decreases an
w ^ = yX y ,
^ = argmin ky  wX k2F , i:e:, w ð6Þ order of magnitude faster than the baseline predictor. In particular, the
w
risk of the baseline predictor is given by (1 − T)∥ωt∥2, while that of

where X is the Moore-Penrose inverse of X. Theorem 1 establishes a the projected predictor is given by (1−T)2∥ωt∥2. Note also that when the
closed form for the risk of the projected predictor ω ^ s , thereby giving
^ pω number of target samples is small relative to the dimension, Corollary
a closed form for the scaling law for transfer learning in the linear 1c implies that decreasing the number of source classes has minimal
setting; the proof is given in SI Note 2. effect on the risk. Lastly, Corollary 1d implies that when T and C are
small, the risk of the projected predictor is roughly that of a baseline
Theorem 1. Let X = Rd , Y s = Rcs , Y t = Rct , and let ω ^ s = ys X ys and predictor trained on twice the number of samples.
ω ^ s X t Þy . Assuming that Ps and Pt are independent, isotropic
^ p = yt ðω We derive scaling laws for translated predictors in the linear
distributions on Rd , then the risk Rðω ^ pω
^ s Þ is given by setting. Analogously to the case for projection, we analyze the
risk of the translated predictor when ω ^ s is the minimum norm
h  n  i
^ pω
Rðω ^ sÞ = C 1 + C 2 K 1 1  t + 1  C 1  C 2 jjωt jj2F + C 2 K 2 ε, ð7Þ solution to k ys  ωX s k2F and ω ^ c is the minimum norm solution
d to k yt  ω^ s X t  ωX t k2F .

where ε = jjωt ðI d × d  ωys ωs Þjj2F and Theorem 2. Let X = Rd , Y s = Rcs , Y t = Rct , and let ω ^t = ω
^s + ω ^ c where
^ s = ys X ys and ω
ω ^ s X t ÞX yt . Assuming that Ps and Pt are inde-
^ c = ðyt  ω
 
ns cs ðd  ns Þ n dðns + 1Þ  2 pendent, isotropic distributions on Rd , then the risk Rðω ^ t Þ is given by
C1 = , C2 = s ,
dðd  1Þðd + 2Þ dðd  1Þðd + 2Þ " !#
nt ðd  cs Þ n nt ðd  nt Þ kωs  ωt k2F  n kω  ωt k2F
K1 = 1  , K2 = t + : ^tÞ =
Rðω + 1 s 1 s ^ b Þ,
Rðω ð8Þ
ðd  1Þðd + 2Þ d ðd  1Þðd + 2Þ kωt kF
2 d kωt k2F

The ε term in Theorem 1 quantifies the similarity between the where ω ^ b = yt X yt is the baseline predictor.
source and target tasks. For example, if there exists a linear map ωp The proof is given in SI Note 5. Theorem 2 formalizes several
such that ωpωs = ωt, then ε = 0. In the context of classification, this can intuitions regarding when translation is beneficial. In particular, we
occur if the target classes are a strict subset of the source classes. Since first observe that if the source model ωs is recovered exactly (i.e.
transfer learning is typically performed between source and target ns = d), then the risk of the translated predictor is governed by the
tasks that are similar, we expect ε to be small. To gain more insights distance between the oracle source model and target model, i.e.,
into the behavior of transfer learning using the projected predictor, ∥ωs − ωt∥. Hence, the translated predictor generalizes better than the
the following corollary considers the setting where d → ∞ in Theorem 1; baseline predictor if the source and target models are similar. In par-
the proof is given in SI Note 3. ticular, by flattening the matrices ωs and ωt into vectors and assuming
∥ωs∥ = ∥ωt∥, the translated predictor outperforms the baseline pre-
Corollary 1. Let S = nds , T = ndt , C = cds and assume ∥ωt∥F = Θ(1). Under the dictor if the angle between the flattened ωs and ωt is less than π4. On the
setting of Theorem 1, if S, T, C < ∞ as d → ∞, then: other hand, when there are no source samples, the translated predictor
a. Rðω ^ pω^ s Þ is monotonically decreasing for S ∈ [0, 1] if is exactly the baseline predictor and the corresponding risks are
ε < (1 − C)∥ωt∥F. equivalent. In general, we observe that the risk of the translated pre-
b. If 2S − 1 − ST < 0, then Rðω ^ pω
^ s Þ decreases as C increases. dictor is simply a weighted average between the baseline risk and the
c. If S = 1, then Rðω ^ pω^ s Þ = ð1  T + TCÞRðω ^ t Þ + εTð2  TÞ. risk in which the source model is recovered exactly.
d. If S=1 and T, C = Θ(δ), then Rðω^ pω
^ s Þ = ð1  2TÞ k Comparing Theorem 2 to Theorem 1, we note that the projected
2
ωt k2F + 2Tε + Θ ðδ Þ: predictor and the translated predictor generalize based on different

Nature Communications | (2023)14:5570 7


Article https://doi.org/10.1038/s41467-023-41215-8

quantities. In particular, in the case when ns = d, the risk of the trans- improve the computation time for such kernels, which would allow
lated predictor is a constant multiple of the baseline risk while the risk training better convolutional kernels on large-scale image datasets,
of the projected predictor is a multiple of the baseline risk that which could then be transferred using our framework to improve the
decreases with nt. Hence, depending on the distance between ωs and performance on a variety of downstream tasks.
ωt, the translated predictor can outperform the projected predictor or
vice-versa. As a simple example, consider the setting where Using kernel methods to adapt to distribution shifts
ωs = ωt, ns = d, and nt, cs < d; then the translated predictor achieves 0 Our work demonstrates that kernels pre-trained on a source task
risk while the projected predictor achieves non-zero risk. When can adapt to a target task with distribution shift when given even
Y s = X t , we suggest combining the projected and translated pre- just a few target training samples. This opens novel avenues for
dictors, as we did in the case of virtual drug screening. Otherwise, our applying kernel methods to tackle distribution shift in a variety of
results suggest using the translated predictor for transfer learning domains, including healthcare or genomics in which models need to
problems involving distribution shift in the features but no difference be adapted to handle shifts in cell lines, populations, batches, etc. In
in the label sets, and the projected predictor otherwise. the context of virtual drug screening, we showed that our transfer
learning approach could be used to generalize to new cell lines. The
Discussion scaling laws described in this work may provide an interesting
In this work, we developed a framework that enables transfer learning avenue to understand how many samples are required in the target
with kernel methods. In particular, we introduced the projection and domain for more complex domain shifts, such as from a model
translation operations to adjust the predictions of a source model to a organism like mouse to humans, a problem of great interest in the
specific target task: While projection involves applying a map directly pharmacological industry.
to the predictions given by the source model, translation involves
adding a map to the predictions of a source model. We demonstrated Methods
the effectiveness of the transfer learned kernels on image classification Overview of image classification datasets
and virtual drug screening tasks. Namely, we showed that transfer For projection, we used ImageNet32 as the source dataset and
learning increased the performance of kernel-based image classifiers CIFAR10, Oxford 102 Flowers, DTD, and a subset of SVHN as the target
by up to 10% over training such models directly on the target task. datasets. For all target datasets, we used the training and test splits
Interestingly, we found that transfer-learned convolutional kernels given by the PyTorch library62. For ImageNet32, we used the training
performed comparably to transfer learning using the corresponding and test splits provided by the authors24. An overview of the number of
finite-width convolutional networks. In virtual drug screening, we training and test samples used from each of these datasets is out-
demonstrated that the transferred kernel methods provided an lined below.
improvement over prior work12, even in settings where none of the 1. ImageNet32 contains 1, 281, 167 training images across 1000
target drug and cell lines were present in the source task. For both classes and 50k images for validation. All images are of
applications, we analyzed the performance of the transferred kernel size 32 × 32 × 3.
model as a function of the number of target examples and observed 2. CIFAR10 contains 50k training images across 10 classes and 10k
empiricallly that the transferred kernel followed a simple logarithmic images for validation. All images are of size 32 × 32 × 3.
trend, thereby enabling predicting the benefit of collecting more tar- 3. Oxford 102 Flowers contains 1020 training images across 102
get examples on model performance. Lastly, we mathematically classes and 6149 images for validation. Images were resized to
derived the scaling laws in the linear setting, thereby providing a the- 32 × 32 × 3 for the experiments.
oretical foundation for the empirical observations. We end by dis- 4. DTD contains 1880 training images across 47 classes and 1880
cussing various consequences as well as future research directions images for validation. Images were resized to size 32 × 32 × 3 for
motivated by our work. experiments.
5. SVHN contains 73257 training images across 10 classes and 26302
Benefit of pretraining kernel methods on large datasets images for validation. All images are of size 32 × 32 × 3. In Fig. 2, we
A key contribution of our work is enabling kernels trained on large used the same 500 training image subset for all experiments.
datasets to be transferred to a variety of downstream tasks. As is the
case for neural networks, this allows pre-trained kernel models to be Training and architecture details
saved and shared with downstream users to improve their applications Model descriptions.
of interest. A key next step to making these models easier to save and 1. Laplace Kernel: For samples x, x~ , and bandwidth parameter L, the
share is to reduce their reliance on storing the entire training set, such kernel is of the form:
as by using coresets60. We envision that by using such techniques in
conjunction with modern advances in kernel methods, the memory kx  x~ k2
exp  :
and runtime costs could be drastically reduced. L

Reducing kernel evaluation time for state-of-the-art convolu- For our experiments, we used a bandwidth of L = 10 as in63, selected
tional kernels through cross-validation.
In this work, we demonstrated that it is possible to train convolutional 2. NTK: We used the NTK corresponding to an infinite width ReLU
kernel methods on datasets with over 1 million images. In order to train fully connected network with 5 hidden layers. We chose this depth
such models, we resorted to using the CNTK of convolutional net- as it gave superior performance on image classification task
works with a fully connected last layer. While other architectures, such considered in64.
as the CNTK of convolutional networks with a global average pooling 3. CNTK: We used the CNTK corresponding to an infinite width
last layer, have been shown to achieve superior performance on ReLU convolutional network with 6 convolutional layers followed
CIFAR1014, training such kernels on 50k images from CIFAR10 is esti- by a fully connected layer. All convolutional layers used filters of
mated to take 1200 GPU hours61, which is more than three orders of size 3 × 3. The first 5 convolutional layers used a stride size of 2 to
magnitude slower than the kernels used in this work. The main com- downsample the image representations. All convolutional layers
putational bottleneck for using such improved convolution kernels is used zero padding. The CNTK was computed using the Neural
evaluating the kernel function itself. Thus an important problem is to Tangents library61.

Nature Communications | (2023)14:5570 8


Article https://doi.org/10.1038/s41467-023-41215-8

4. CNN: We compare the CNTK to a finite-width CNN of the same of classifying cars and deer in CIFAR10 (source) and transferring to the
architecture that has 16 filters in the first layer, 32 filters in the 19 corruptions in CIFAR10-C (target). The source task contains 10, 000
second layer, 64 filters in the third layer, 128 filters in the fourth training samples and 2000 test samples, while the 19 target tasks each
layer, and 256 filters in the fifth and sixth layers. In all experiments, contain 1000 training samples and 1000 test samples. The multiple
the CNN was trained using Adam with a learning rate of 10−4. Our kernel learning algorithms learn combinations of a Laplace kernel with
choice of learning rate is based on its effectiveness in prior bandwidth  10, a Gaussian kernel of the form
works65,66. K G ðx,zÞ = exp γ k x  zk2 with γ = 0.001, and the linear kernel
KL(x, z) = xTz. We choose the bandwidth for the Laplace kernel from29
Details for projection experiments. For all kernels trained on Ima- and the value of γ for the Gaussian kernel so that the entries of the
geNet32, we used EigenPro29. For all models, we trained until the Gaussian kernel are on the same order of magnitude as the Laplace
training accuracy was greater than 99%, which was at most 6 epochs of kernel for this task. For each kernel learning algorithm, we train a
EigenPro. For transfer learning to CIFAR10, Oxford 102 Flowers, DTD, source model and then compare the following three models on the
and SVHN, we applied a Laplace kernel to the outputs of the trained target task: (1) the baseline model in which we use the kernel learning
source model. For CIFAR10 and DTD, we solved the kernel regression algorithm directly on the target task; (2) the transfer learned kernel
exactly using NumPy67. For DTD and SVHN, we used ridge regulariza- model in which we use the weights from the source task to combine
tion with a coefficient of 10−4 to avoid numerical issues with solving the kernels on the target task; and (3) translating the learned kernel on
exactly. The CNN was trained for at most 500 epochs on ImageNet32, the source task using our translation methodology. As shown in SI
and the transferred model corresponded to the one with highest Fig. S7, the transfer learned kernel outperforms the baseline kernel for
validation accuracy during this time. When transfer learning, we fine- almost all multiple kernel learning algorithms (except FHeuristic), and
tuned all layers of the CNN for up to 200 epochs (again selecting the it is outperformed by our translation methodology in all cases.
model with the highest validation accuracy on the target task).
Remark. We presented a comparison on this simple binary classifica-
Details for translation experiments. For transferring kernels from tion task for the following computational reasons. First, we considered
CIFAR10 to CIFAR-C, we simply solved kernel regression exactly (no binary classification, since prior multiple kernel learning approaches
ridge regularization term). For the corresponding CNNs, we trained implemented in MKLPy scale poorly to multi-class problems. While
the source models on CIFAR10 for 100 epochs and selected the model there is no computational price to be paid for our method for multi-
with the best validation performance. When transferring CNNs to class classification, prior methods build one kernel per class and thus
CIFAR-C, we fine-tuned all layers of the CNN for 200 epochs and require 10 times more compute and memory when using all 10 classes
selected the model with the best validation accuracy. When translating in CIFAR10. Secondly, we compared with only translation and not
kernels from ImageNet32 to CIFAR10 in SI Fig. S5, we used the fol- projection since prior multiple kernel learning methods scale poorly to
lowing aggregated class indices in ImageNet32 to match the classes in the ImageNet32 dataset used in our projection experiments. Namely,
CIFAR10: multiple kernel learning methods require materializing the kernel
1. plane = {372, 230, 231, 232} matrix, which for ImageNet32 would take up more than 3.5 terabytes of
2. car = {265, 266, 267, 268 } memory as compared to 128 gigabytes for our method.
3. bird = {383, 384, 385, 386}
4. cat = {8, 10, 11, 55} Projection scaling laws
5. deer = {12, 9, 57} For the curves showing the performance of the projected predictor as
6. dog = {131, 132, 133, 134} a function of the number of target examples in Fig. 2b and SI Fig. S2a, b,
7. frog = {499, 500, 501, 494} we performed a scaling law analysis. In particular, we used linear
8. horse = {80, 39} regression to fit the coefficients a, b of the function y = alog2 x + b to
9. ship = {243, 246, 247, 235} the points from each of the curves presented in the figures. Each curve
10. truck = {279, 280, 281, 282}. in these figures has 50 evenly spaced points and all accuracies are
averaged over 3 seeds at each point. The R2 values for each of the fits is
Details for virtual drug screening. We used the NTK corresponding to presented in SI Fig. S4. Overall, we observe that all values are above
a 1 hidden layer ReLU fully connected network with an offset term. The 0.944 and are higher than 0.99 for CIFAR10 and SVHN, which
same model was used in12. We solved kernel ridge regression when have more than 2000 target training samples. Moreover, by fitting the
training the source models, baseline models, and transferred models. same function on the first 5 points from these curves for CIFAR10, we
For the source model, we used ridge regularization with a coefficient of are able to predict the accuracy on the last point of the curve within 2%
1000. To select this ridge term, we used a grid search over of the reported accuracy.
{1, 10, 100, 1000, 10000} on a random subset of 10k samples from the
source data. We used a ridge term of 1000 when transferring Preprocessing for CMAP data
the source model to the target data and a term of 100 when training While CMAP contains 978 landmark genes, we removed all genes that
the baseline model. We again tuned the ridge parameter for these were 1 upon log2 ðx + 1Þ scaling the data. This eliminates 135 genes and
models over the same set of values but on a random subset of 1000 removes batch effects identified in50 for each cell line. Following the
examples for one cell line (A549) from the target data. We used 5-fold methodology of 50, we also removed all perturbations with dose less
cross validation for the target task and reported the metrics computed than 0 and used only the perturbations that had an associated sim-
across all folds. plified molecular-input line-entry system (SMILES) string, which
resulted in a total of 20, 336 perturbations. Following50, for each of the
Comparison with multiple kernel learning approaches. In SI Fig. S7, 116, 228 observed drug and cell type combinations we then averaged
we compare our translation approach to the approach of directly using the gene expression over all the replicates.
multiple kernel learning algorithms such as Centered Kernel Align-
ment (CKA)68, EasyMKL69, FHeuristic70, and Proportionally Weighted Metrics for evaluating virtual drug screening
Multiple Kernels (PWMK)71 to learn a kernel on the source task and Let ^y 2 Rn × d denote the predicted gene expression vectors and let
P
train the target model using the learned kernel. Due to the computa- y* 2 Rn × d denote the ground truth. Let yðiÞ = d1 dj= 1 yjðiÞ . Let ^
yv , y*v 2
tional limitations of these prior methods, we only consider the subtask R denote vectorized versions of ^
dn *
y and y . We use the same three

Nature Communications | (2023)14:5570 9


Article https://doi.org/10.1038/s41467-023-41215-8

metrics as those considered in12,51. All evaluation metrics have a max- Hardware details
imum value of 1 and are defined below. All experiments were run using two servers. One server had 128GB of
1. Pearson r value: CPU random access memory (RAM) and 2 NVIDIA Titan XP GPUs each
with 12GB of memory. This server was used for the virtual drug
h^
yv , y*v i screening experiments and for training the CNTK on ImageNet32. The
r= :
k^yv k2 ky*v k2 second server had 128GB of CPU RAM and 4 NVIDIA Titan RTX GPUs
each with 24GB of memory. This server was used for all the remaining
2. Mean R2:
experiments.
0 1
Pd * ðiÞ
2
1X n
B ^ðiÞ
j = 1 ð yj  yj Þ C Data availability
2
R = @1  P 2A
: All datasets considered in this work are publicly available. The stan-
n i=1 * ðiÞ  
d
j = 1 ð yj yðiÞ Þ
dard image classification datasets considered in this work are available
3. Mean Cosine Similarity: directly through the PyTorch library62. CMap data is available through
the following website https://www.ncbi.nlm.nih.gov/geo/query/acc.
ðiÞ cgi?acc=GSE92742, and we used the level 2 data given in the file
1X
ðiÞ
n
h^y ,y* i
c= : GSE92742_Broad_LINCS_Level2_GEX_epsilon_n1269922x978.gctx. Dep-
n i = 1 k^yðiÞ k2 ky* ðiÞ k2
Map data is available through the following website https://depmap.
We additionally subtract out the mean over cell type before org/repurposing/, and we used the primary screen data.
computing cosine similarity to avoid inflated cosine similarity
arising from points far from the origin. Code availability
All code is available at https://github.com/uhlerlab/kernel_tf73.
DepMap analysis
To provide another application of our framework in the context of References
virtual drug screening, we used projection to transfer the kernel 1. Razavian, A. S., Azizpour, H., Sullivan, J. & Carlsson, S. CNN features
methods trained on imputing gene expression vectors in CMAP to off-the-shelf: An astounding baseline for recognition. In IEEE Con-
predicting the viability of a drug and cell line combination in ference on Computer Vision and Pattern Recognition Work-
DepMap53. Viability scores in DepMap are real values indicating how shops (2014).
lethal a drug is for a given cancer cell line (negative viability indicates 2. Donahue, J. et al. Decaf: A deep convolutional activation feature for
cell death). To transfer from CMAP to DepMap, we trained a kernel generic visual recognition. In International Conference on Machine
method to predict the gene expression vectors for 55, 462 cell line Learning (2014).
and drug combinations for the 64 cell lines from CMAP that do not 3. Peters, M. E. et al. Deep contextualized word representations. In
overlap with DepMap. We then used projection to transfer the model Conference of the North American Chapter of the Association for
to the 6 held-out cell lines present in both CMAP and DepMap, which Computational Linguistics: Human Language Technologies, Volume
are PC3, MCF7, A375, A549, HT29, and HEPG2. Analogously to our 1 (Long Papers) (2018).
analysis of CMAP, we stratified the target dataset by drugs that 4. Raffel, C. et al. Exploring the limits of transfer learning with a unified
appear in both the source and target tasks (9726 target samples) and text-to-text transformer. J. Mach. Learn. Res. 21, 1–67 (2020).
drugs that are only found in the target task but not in the source task 5. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training
(2685 target samples). For this application, we found that Mol2Vec72 of deep bidirectional transformers for language understanding. In
embeddings of drugs outperformed 1024-bit circular fingerprints. Proceedings of the 2019 Conference of the North American Chapter
We again used a 1-hidden layer ReLU NTK with an offset term for of the Association for Computational Linguistics: Human Language
this analysis and solved kernel ridge regression with a ridge coeffi- Technologies, Volume 1 (Long and Short Papers) (2019).
cient of 100. 6. Esteva, A. et al. Dermatologist-level classification of skin cancer
SI Fig. S9a shows the performance of the projected predictor as a with deep neural networks. Nature 542, 115–118 (2017).
function of the number of target samples when transferring to a 7. De Fauw, J. et al. Clinically applicable deep learning for diagnosis
target task with drugs that appear in the source task. All results are and referral in retinal disease. Nat. Med. 24, 1342–1350 (2018).
averaged over 5 folds of cross-validation and across 5 random seeds 8. Raghu, M., Zhang, C., Kleinberg, J. & Bengio, S. Transfusion:
for the subset of target samples considered in each fold. It is apparent Understanding transfer learning for medical imaging. In Wallach, H.
that performance is greatly improved when there are fewer than et al. (eds.) Advances in Neural Information Processing Sys-
2000 samples, thereby highlighting the benefit of the imputed gene tems (2019).
expression vectors in this setting. Interestingly, as in all the previous 9. Schölkopf, B. & Smola, A. J.Learning with Kernels: Support Vector
experiments, we find a clear logarithmic scaling law: fitting the Machines, Regularization, Optimization, and Beyond. (MIT
coefficients of the curve y = a log2 x + b to the 76 points on the graph Press, 2002).
yields an R2 of 0.994, and fitting the curve to the first 10 points 10. Arora, S. et al. Harnessing the Power of Infinitely Wide Deep Nets on
lets us predict the R2 for the last point on the curve within 0.03. Small-data Tasks. In International Conference on Learning Repre-
SI Fig. S9b shows how the performance on the target task is affected sentations (2020).
by the number of genes predicted in the source task. Again perfor- 11. Lee, J. et al. Finite Versus Infinite Neural Networks: an Empirical
mance is averaged over 5 fold cross-validation and across 5 seeds per Study. In Advances in Neural Information Processing Sys-
fold. When transferring to drugs that were available in the source tems (2020).
task, performance monotonically increases when predicting more 12. Radhakrishnan, A., Stefanakis, G., Belkin, M. & Uhler, C. Simple, fast,
genes. On the other hand, when transferring to drugs that were and flexible framework for matrix completion with infinite width
not available in the target task, performance begins to degrade neural networks. arXiv:2108.00131 (2021).
when increasing the number of predicted genes. This is intuitive, 13. Jacot, A., Gabriel, F. & Hongler, C. Neural Tangent Kernel: Con-
since not all genes would be useful for predicting the effect of an vergence and generalization in neural networks. In Bengio, S. et al.
unseen drug and could add noise to the prediction problem upon (eds.) Advances in Neural Information Processing Systems (Curran
transfer learning. Associates, Inc., 2018).

Nature Communications | (2023)14:5570 10


Article https://doi.org/10.1038/s41467-023-41215-8

14. Arora, S. et al. On Exact Computation with an Infinitely Wide Neural 38. Nakkiran, P. et al. Deep double descent: Where bigger models and
Net. In Wallach, H. et al. (eds.) Advances in Neural Information more data hurt. In International Conference in Learning Repre-
Processing Systems (Curran Associates, Inc., 2019). sentations (2020).
15. Dai, W., Yang, Q., Xue, G.-R. & Yu, Y. Boosting for transfer learning. In 39. Bietti, A. Approximation and learning with deep convolutional
ACM International Conference Proceeding Series, vol. 227, models: a kernel perspective. In International Conference on
193–200 (2007). Learning Representations (2022).
16. Lin, H. & Reimherr, M. On transfer learning in functional linear 40. Zhuang, F. et al. A comprehensive survey on transfer learning. Proc.
regression. arXiv:2206.04277 (2022). IEEE 109, 43–76 (2020).
17. Obst, D. et al. Transfer learning for linear regression: a statistical test 41. Krizhevsky, A.Learning multiple layers of features from tiny images.
of gain. arXiv:2102.09504 (2021). Master’s thesis, University of Toronto (2009).
18. Blanchard, G., Lee, G. & Scott, C. Generalizing from several related 42. Nilsback, M.-E. & Zisserman, A. Automated flower classification
classification tasks to a new unlabeled sample. Adv. Neural Inform. over a large number of classes. In 2008 Sixth Indian conference on
Process. Syst. 24 (2011). computer vision, graphics & image processing, 722–729
19. Muandet, K., Balduzzi, D. & Schölkopf, B. Domain generalization via (IEEE, 2008).
invariant feature representation. In International conference on 43. Cimpoi, M. et al. Describing textures in the wild. In Proceedings of
machine learning, 10-18 (PMLR, 2013). the IEEE Conf. on Computer Vision and Pattern Recognition
20. Tommasi, T., Orabona, F. & Caputo, B. Safety in numbers: Learning (CVPR) (2014).
categories from few examples with multi model knowledge trans- 44. Goodfellow, I., Bengio, Y. & Courville, A.Deep Learning, vol. 1 (MIT
fer. In Computer Vision and Pattern Recognition, 3081–3088 Press, 2016).
(IEEE, 2010). 45. Kingma, D. P. & Ba, J. Adam: A method for stochastic optimiza-
21. Micchelli, C. & Pontil, M. Kernels for multi–task learning. Adv. Neural tion. In International Conference on Learning Representations
Inform. Process. syst. 17, 921–928 (2004). (2015).
22. Evgeniou, T., Micchelli, C. A., Pontil, M. & Shawe-Taylor, J. Learning 46. Bishop, C. M. Pattern Recognition and Machine Learning (Informa-
multiple tasks with kernel methods. J. Mach. Learn. Res. 6, tion Science and Statistics) (Springer-Verlag, Berlin, Heidel-
615–637 (2005). berg, 2006).
23. Evgeniou, T. & Pontil, M. Regularized multi–task learning. In Pro- 47. Hendrycks, D. & Dietterich, T. G. Benchmarking neural network
ceedings of the tenth ACM SIGKDD international conference on robustness to common corruptions and perturbations.
Knowledge discovery and data mining, 109-117 (2004). arXiv:1903.12261 (2019).
24. Chrabaszcz, P., Loshchilov, I. & Hutter, F. A downsampled variant of 48. Subramanian, A., Narayan, R. & Corsello, S. M. et al. A next gen-
imagenet as an alternative to the cifar datasets. eration connectivity map: L1000 platform and the first 1,000,000
arXiv:1707.08819 (2017). profiles. Cell 171, 1437–1452 (2017).
25. Gretton, A. et al. Covariate shift by kernel mean matching. Dataset 49. Pushpakom, S. et al. Drug repurposing: progress, challenges and
Shift Mach. Learn 3, 5 (2009). recommendations. Nat. Rev. Drug Discov. 18, 41–58 (2019).
26. Pan, S. J., Tsang, I. W., Kwok, J. T. & Yang, Q. Domain adaptation via 50. Belyaeva, A. et al. Causal network models of SARS-CoV-2 expres-
transfer component analysis. IEEE Trans. Neural Networks 22, sion and aging to identify candidates for drug repurposing. Nat.
199–210 (2010). Commun. 12, (2021).
27. Argyriou, A., Evgeniou, T. & Pontil, M. Convex multi-task feature 51. Hodos, R. et al. Cell-specific prediction and application of drug-
learning. Mach. Learn 73, 243–272 (2008). induced gene expression profiles. Pacific Symp. Biocomput. 23,
28. Liu, C., Zhu, L. & Belkin, M. On the linearity of large non-linear 32–43 (2018).
models: when and why the tangent kernel is constant. In Neural 52. Democratizing deep-learning for drug discovery, quantum chem-
Information Processing Systems (2020). istry, materials science and biology. https://github.com/
29. Ma, S. & Belkin, M. Kernel machines that adapt to GPUs for effective deepchem/deepchem (2016).
large batch training. In Conference on Machine Learning and Sys- 53. Corsello, S. et al. Discovering the anticancer potential of non-
tems (2019). oncology drugs by systematic viability profiling. Nat. Cancer 1,
30. Netzer, Y. et al. Reading digits in natural images with unsupervised 1–14 (2020).
feature learning. In Advances in Neural Information Processing 54. Huh, M., Agrawal, P. & Efros, A. A. What makes ImageNet good for
Systems (NIPS) (2011). transfer learning? arXiv:1608.08614 (2016).
31. Baxter, J. A model of inductive bias learning. J. Artificial Intell. Res. 55. Vapnik, V. N. Statistical Learning Theory (Wiley-Interscience, 1998).
12, 149–198 (2000). 56. Engl, H. W., Hanke, M. & Neubauer, A. Regularization of Inverse
32. Ando, R. K., Zhang, T. & Bartlett, P. A framework for learning pre- Problems, vol. 375 (Springer Science & Business Media, 1996).
dictive structures from multiple tasks and unlabeled data. J. Mach. 57. Belkin, M., Hsu, D. & Xu, J. Two models of double descent for weak
Learn. Res. 6, (2005). features. Society Industrial Appl. Mathe. J. Mathe. Data Science 2,
33. Maurer, A., Pontil, M. & Romera-Paredes, B. The benefit of multitask 1167–1180 (2020).
representation learning. J. Mach. Learn. Res. 17, 1–32 (2016). 58. Bartlett, P. L., Long, P. M., Lugosi, G. & Tsigler, A. Benign overfitting
34. Kuzborskij, I. & Orabona, F. Fast rates by transferring from auxiliary in linear regression. Proc. Natl. Acad. Sci. 117, 30063–30070 (2020).
hypotheses. Mach. Learn. 106, 171–195 (2017). 59. Hastie, T., Montanari, A., Rosset, S. & Tibshirani, R. J. Surprises in
35. Denevi, G., Ciliberto, C., Stamos, D. & Pontil, M. Learning to learn high-dimensional ridgeless least squares interpolation.
around a common mean. Adv. Neural Inform. Process. Syst. 31, (2018). arXiv:1903.08560 (2019).
36. Khodak, M., Balcan, M.-F. F. & Talwalkar, A. S. Adaptive gradient- 60. Zheng, Y. & Phillips, J. M. Coresets for kernel regression. In Pro-
based meta-learning methods. Adv. Neural Inform. Process. Syst. ceedings of the 23rd ACM SIGKDD International Conference on
32, (2019). Knowledge Discovery and Data Mining, 645–654 (2017).
37. Belkin, M., Hsu, D., Ma, S. & Mandal, S. Reconciling modern 61. Novak, R. et al. Neural Tangents: Fast and easy infinite neural net-
machine-learning practice and the classical bias–variance trade-off. works in Python. In International Conference on Learning Repre-
Proc. Natl. Acad. Sci. 116, 15849–15854 (2019). sentations (2020).

Nature Communications | (2023)14:5570 11


Article https://doi.org/10.1038/s41467-023-41215-8

62. Paszke, A. et al. Pytorch: An imperative style, high-performance Author contributions


deep learning library. In Wallach, H. et al. (eds.) Adv. Neural Inform. A.R., M.R.L, C.U. designed the research, A.R., M.R.L., N.P. developed and
Process. Syst. (Curran Associates, Inc., 2019). implemented the algorithms, all authors performed model and data
63. Belkin, M., Ma, S. & Mandal, S. To understand deep learning we analysis. A.R., M.R.L., C.U. wrote the paper.
need to understand kernel learning. In International Conference on
Machine Learning, 541–549 (PMLR, 2018). Competing interests
64. Nichani, E., Radhakrishnan, A. & Uhler, C. Increasing depth leads to The authors declare no competing interests.
U-shaped test risk in over-parameterized convolutional networks. In
International Conference on Machine Learning Workshop on Over- Additional information
parameterization: Pitfalls and Opportunities (2021). Supplementary information The online version contains
65. Radhakrishnan, A., Belkin, M. & Uhler, C. Overparameterized neural supplementary material available at
networks implement associative memory. Proc. Natl. Acad. Sci. 117, https://doi.org/10.1038/s41467-023-41215-8.
27162–27170 (2020).
66. Howard, J. & Ruder, S. Universal language model fine-tuning for text Correspondence and requests for materials should be addressed to
classification. In Association for Computational Linguistics, 328–339 Caroline Uhler.
(Association for Computational Linguistics, 2018).
67. Oliphant, T. E. A guide to NumPy, vol. 1 (Trelgol Publishing Peer review information Nature Communications thanks the anon-
USA, 2006). ymous, reviewer(s) for their contribution to the peer review of this work.
68. Cortes, C., Mohri, M. & Rostamizadeh, A. Two-stage learning kernel A peer review file is available.
algorithms. In Int. Conference Mach. Learn. 239–246 (2010).
69. Aiolli, F. & Donini, M. Easymkl: a scalable multiple kernel learning Reprints and permissions information is available at
algorithm. Neurocomputing 169, 215–224 (2015). http://www.nature.com/reprints
70. Qiu, S. & Lane, T. A framework for multiple kernel support vector
regression and its applications to sirna efficacy prediction. IEEE/ Publisher’s note Springer Nature remains neutral with regard to jur-
ACM Trans. Comput. Biol. Bioinform. 6, 190–199 (2008). isdictional claims in published maps and institutional affiliations.
71. Tanabe, H., Ho, T. B., Nguyen, C. H. & Kawasaki, S. Simple but
effective methods for combining kernels in computational biology. Open Access This article is licensed under a Creative Commons
In 2008 IEEE International Conference on Research, Innovation and Attribution 4.0 International License, which permits use, sharing,
Vision for the Future in Computing and Communication Technolo- adaptation, distribution and reproduction in any medium or format, as
gies, 71-78 (IEEE, 2008). long as you give appropriate credit to the original author(s) and the
72. Jaeger-Honz, S., Fulle, S. & Turk, S. Mol2vec: Unsupervised machine source, provide a link to the Creative Commons licence, and indicate if
learning approach with chemical intuition. J. Chem. Inform. Model. changes were made. The images or other third party material in this
58 (2017). article are included in the article’s Creative Commons licence, unless
73. Radhakrishnan, A., Ruiz Luyten, M., Prasad, N. & Uhler, C. Transfer indicated otherwise in a credit line to the material. If material is not
Learning with Kernel Methods. https://github.com/uhlerlab/kernel_tf included in the article’s Creative Commons licence and your intended
(2023). use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright
Acknowledgements holder. To view a copy of this licence, visit http://creativecommons.org/
The authors were partially supported by NCCIH/NIH (1DP2AT012345), licenses/by/4.0/.
NSF (DMS-1651995), ONR (N00014-22-1-2116), the MIT-IBM Watson AI
Lab, AstraZeneca, the Eric and Wendy Schmidt Center at the Broad © The Author(s) 2023
Institute, and a Simons Investigator Award (to C.U.).

Nature Communications | (2023)14:5570 12

You might also like