Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views7 pages

Neural Network Matrix Facorization

The document presents a novel approach called Neural Network Matrix Factorization (NNMF), which enhances traditional matrix factorization techniques by replacing the inner product with a feed-forward neural network to learn an arbitrary function from data. The authors describe the learning process involving the optimization of both the neural network weights and latent feature vectors through alternating updates. NNMF is shown to outperform standard low-rank techniques on benchmark datasets, although it is noted that its full potential may not yet be realized due to the vast possibilities in architecture and optimization.

Uploaded by

smmzdr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views7 pages

Neural Network Matrix Facorization

The document presents a novel approach called Neural Network Matrix Factorization (NNMF), which enhances traditional matrix factorization techniques by replacing the inner product with a feed-forward neural network to learn an arbitrary function from data. The authors describe the learning process involving the optimization of both the neural network weights and latent feature vectors through alternating updates. NNMF is shown to outperform standard low-rank techniques on benchmark datasets, although it is noted that its full potential may not yet be realized due to the vast possibilities in architecture and optimization.

Uploaded by

smmzdr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Under review as a conference paper at ICLR 2016

N EURAL N ETWORK M ATRIX FACTORIZATION


Gintare Karolina Dziugaite Daniel M. Roy
Department of Engineering Department of Statistical Sciences
University of Cambridge University of Toronto
Cambridge, United Kingdom CB2 1PZ Toronto, Canada M5S 3G3
[email protected] [email protected]

A BSTRACT
arXiv:1511.06443v2 [cs.LG] 15 Dec 2015

Data often comes in the form of an array or matrix. Matrix factorization techniques
attempt to recover missing or corrupted entries by assuming that the matrix can be
written as the product of two low-rank matrices. In other words, matrix factorization
approximates the entries of the matrix by a simple, fixed function—namely, the
inner product—acting on the latent feature vectors for the corresponding row and
column. Here we consider replacing the inner product by an arbitrary function
that we learn from the data at the same time as we learn the latent feature vectors.
In particular, we replace the inner product by a multi-layer feed-forward neural
network, and learn by alternating between optimizing the network for fixed latent
features, and optimizing the latent features for a fixed network. The resulting
approach—which we call neural network matrix factorization or NNMF, for short—
dominates standard low-rank techniques on a suite of benchmark but is dominated
by some recent proposals that take advantage of the graph features. Given the
vast range of architectures, activation functions, regularizers, and optimization
techniques that could be used within the NNMF framework, it seems likely the true
potential of the approach has yet to be reached.

1 I NTRODUCTION

We are interested in modeling arrays of data, which arise in the analysis of networks, graphs, and,
more generally, relational data. For example, in collaborative filtering and recommender system
applications (Goldberg et al., 1992; Koren et al., 2009), we may have N users and M movies, and for
some collection J ⊆ [N ] × [M ] of user–movie pairs (n, m), we have recorded the rating Xn,m that
user n gave movie m. The inferential goal we focus on in this work is to predict the ratings for those
pairs not in J. The data can be modeled as a partial observation of an N × M array X = (Xn,m ).
While the methods we discuss are applicable far beyond the setting of movie rating data or even
two-dimensional arrays, we will rely on the user–movie metaphor throughout.
One of the most popular approaches to modeling relational data using latent features is based on
matrix factorization. Here the idea is to assume that X is well approximated by the product U T V
of two low-rank matrices U ∈ RD×N and V ∈ RD×M , where the rank D is much less than N and
M . Write Un and Vm for the D-dimensional column vectors of U and V , respectively. Informally,
we will think of Un as a latent vector of features describing user n, and of Vm as a latent vector of
features describing movie m. The ranking Xn,m is then approximated by the inner product UnT Vm .
Probabilistic matrix factorization (Salakhutdinov and Mnih, 2008), or simply PMF, is based on
matrix factorization, and further assumes the entries of X are independent Gaussians with common
variance and means given by the corresponding entries of U T V . Maximum likelihood inference of
the features U and V leads one to minimize the Frobenius norm of X − U t V , or equivalently, the
root mean squared error (RMSE) between the inner product of the features and the observed relations.
In practice, regularization of the features vectors often improves the performance of the resulting
predictions, provided the regularization parameter is chosen carefully, e.g., by cross validation.
PMF is extremely effective in practice, but is also easy to improve upon when dealing with large
but sparsely observed array data, as is typical in collaborative filtering. One way to improve upon
PMF is to introduce row and column effects that model systematic biases associated with users and

1
Under review as a conference paper at ICLR 2016

with movies, leading to a model known in collaborative filtering community as BiasedMF (Koren
et al., 2009). In this approach, the mean of Xn,m is taken to be UnT Vm + µn + τm + β, where
µ = (µ1 , . . . , µS ), τ = (τ1 , . . . , τM ), and β are additional latent variables representing the user,
movie, and global biases, respectively. Note that the row and column effects in BiasedMF can be seen
as a special case of PMF where we fix an entry of U and a distinct entry of V to take the value 1. In
other words, BiasedMF implements a strong inductive bias. Again, regularization improves prediction
performance in practice.
In this short paper, we describe a different approach to factorizing X. Write Un ◦ Vm for the element-
wise product, i.e., the D-dimensional vector whose i’th entry is Ui,n Vi,m . Using this notation,
P PMF
models the mean of Xn,m by f (Un ◦ Vm ), where f is the function f (w1 , w2 , . . . ) = j wj . In the
same notation, BiasedPMF models the mean of Xn,m by f (Un ◦ Vm , µn , τm , β). Our idea is to learn
f , rather than assume it is fixed. In particular, we take f = fθ to be a feed-forward neural network
with weights θ. Given data, we learn the weights θ at the same time that we learn the latent features.
Note that, for fixed latent feature vectors, we effectively have a supervised learning problem, and
so we can optimize the neural network by gradient descent as is typical. Fixing the neural network,
we can optimize the latent feature vectors also by gradient descent, not unlike recent applications
of neural networks to the problem of transferring artistic styles to ordinary images (Gatys et al.,
2015). As one would expect, regularization is critical. We used `2 -regularization for the latent feature
vectors, and chose the regularization parameter λ by optimizing the error on a validation set. We call
our proposal neural network matrix factorization, or simply NNMF.

2 M ODEL
Let [N ] = {1, 2, . . . , N }. A data array is modeled as a collection of real-valued random variables
Xn,m , for (n, m) ∈ J, where J ⊂ [N ] × [M ] are the indices of the observed entries of the N × M
data array. (The extension to higher-dimensional arrays or arrays whose entries are elements in spaces
other than R is straightforward.)
To each row, n ∈ N , we associate a latent feature vector Un ∈ RD and a latent feature matrix
0 D 0 ×K
Un ∈ R . Similarly, to each column, m ∈ M , we associate a latent feature vector Vm ∈ RD
0
and latent feature matrix Vm0 ∈ RD ×K . Write Un,k 0
for the k’th column of Un0 (and similarly for
0
Vm ), and write (U, V ) for the collection of all latent features.1
Let fθ be a feed-forward neural network with weights θ. Viewing θ and the latent features (U, V ) as
unknown parameters, we assume the entries Xn,m are independent random variables with means
X̂n,m := X̂(Un , Vm , Un0 , Vm0 ) := fθ Un , Vm , Un,1
0 0 0 0

◦ Vm,1 , . . . , Un,D 0 ◦ Vm,D 0 . (1)
In other words, the neural network fθ has 2D + D0 real-valued input units and one real-valued output
unit. The first D are user-specific features; the next D are movie-specific features; and the last D0
inputs are the result of inner products between K-dimensional vectors.2

3 L EARNING
To learn the network weights θ and latent features (U, V ), we minimize the objective
X hX X X X i
(Xn,m − X̂n,m )2 + λ ||Un0 ||2F + ||Un ||22 + ||Vm0 ||2F + ||Vm ||22 , (2)
(n,m)∈J n n m m

where λ is a regularization parameter, || · ||2 denotes the `2 norm, and || · ||F denotes the Frobenius
norm. This objective can be understood as a penalized log likelihood under a Gaussian model for
each entry. It can also be understood as a specifying the maximum a posteriori estimate assuming
independent Gaussian priors for every latent feature.
1
Note that there is nothing forcing the latent feature vectors Un for users and Vm for movies to have the
same dimensionality. We made this choice for simplicity.
2
In our experiments, we took K = 1 and D0 large. Taking D0 = 1 results in a model where the input to
the neural network is the prediction of a rank-K matrix factorization and 2D additional features. This simple
modification of matrix factorization lead to improvements, but not as dramatic as those we report in Section 5.

2
Under review as a conference paper at ICLR 2016

During training, we alternated between optimizing the neural network weights, while fixing the latent
features, and optimizing the latent features, while fixing the network weights. Optimization was
carried out by gradient descent on the entire dataset (i.e., we did not use batches). We used RMSProp
to adjust the learning rate. (See Section 5 for details.) We did not evaluate other optimization
algorithms.

4 R ELATED W ORK
NNMF is very similar in spirit to the Random Function Model for modeling arrays/matrices proposed
by Lloyd et al. (2012). Using probabilistic symmetry considerations, they arrive at a model where the
mean of Xn,m is given by g(Un , Vm ), where g : R2D → R is modeled by a Gaussian process. At
a high level, our model replaces the Gaussian process prior with a parametric neural network one.
We explored simply taking g to be a feed-forward neural network (acting on a concatenated pair of
vectors in RD ), but found that we achieved better performance if some of the input dimensions first
underwent an element-wise product. (We discuss this further below in relationship to NTN models.)
Conceivably, a deep neural network could learn to approximate the element-wise product or even
outperform it, but this was not the case in our experiments, which used gradient-descent techniques to
learn the neural network weights. In experiments, we found that NNMF significantly outperformed
RFM, although the RFM results were those produced by an implementation that was limited to D ≤ 3
latent dimensions due to significant algorithmic issues associated with scaling up inference for the
Gaussian process component. Given some recent advances in GP inference, it would be interesting to
revisit the RFM, though it is not clear to the authors whether the advances are quite enough.
NNMF is related to some methods applied to Knowledge Bases and Knowledge Graphs. (See
(Nickel et al., 2015) for a review of relational machine learning and Knowledge Graphs.) Knowl-
edge bases (KBs) are relational data composed of entity–relation–entity triples. For example, a
geopolitical knowledge base might contain facts such as (Rome, capitol-of, Italy) and
(Lithuania, member-of, EU). Knowledge graphs (KGs) are representations of KBs as
graphs whose vertices represent entities and whose (labelled) edges represent relations between
entities. Given this connection, one can see that KBs can be thought of as a collection of (extremely
sparsely observed) arrays, one for each relation, or as a single three-dimensional array. The key
challenge in modeling KBs and KGs is to simultaneously learn many relations using shared rep-
resentations in order to augment the limited data one has on each relation. This is one of the key
differences with the collaborative filtering setting.
A method for KGs similar to NNMF is the Neural Tensor Network (NTN), which combines a tensor
product with a single-layer neural network (Socher et al., 2013). (Methods similar to NTN have been
applied to problems in speech recognition. See, e.g., (Yu et al., 2013).) Other approaches to KGs use
neural networks to produce representations, rather than map representations to predictions, like NTN
and NNMF. (See, e.g., (Huang et al., 2015) and (Bian et al., 2014).)
NTNs model each element of a two-dimensional array by
 
T Un
UnT Q[1:H] Vm

X̂n,m := a tanh +W +b , (3)
Vm
where Un , Vm ∈ RD are feature vectors; a ∈ RH is a linear layer weight vector; b ∈ RH is a
bias vector; W ∈ RH×2D is a weight matrix; Q[1:H] ∈ RD×D×H is a third order tensor; and the
nonlinearity tanh(·) : RH → RH acts element wise. The tensor product term UnT Q[1:H] Vm denotes
the element in RH whose h’th entry is equal to UnT Qh Vm , where Qh ∈ RD×D is the h’th slice of
Q[1:H] . The model is trained by optimizing the contrastive max-margin objective using L-BFGS with
mini-batches.
Ignoring the particularly nonlinearity used, the first layer of the NNMF model can be expressed in the
form Eq. (3) if we take W = 0 and allow ourselves to fix some entries of the latent features. (NTN
employs no additional layers.) For example, taking K = 1 as in our experiments, define for n ∈ [N ],
m ∈ [M ], i, j ∈ [2D + D0 ], and h ∈ [H],
Un 1D
" # " #  0
2D+D 0 0 Wh,i , if i = j,
Ūn = 1D ∈ R , V̄m = Vm ∈ R2D+D , Qhij = (4)
Un0
Vm 0 0, otherwise,

3
Under review as a conference paper at ICLR 2016

0
where 1D denotes an D-dimensional column vector with all entries equal to 1 and W 0 ∈ RH×(2D+D )
is the weight matrix defining the first layer of the NNMF network. Then we recover the first layer of
0 0
NNMF with the third-order tensor Q[1:H] ∈ R(2D+D )×(2D+D )×H whose h’th slice is Qh .
There have been many scalable techniques proposed to model very large KGs. Nickel et al. (2015)
split existing models into two categories: latent feature models and graph feature models. Latent
variable methods learn unobserved representations of entities and use them to predict relations,
while graph feature methods learn to predict relations directly from extracted features of local graph
structure. Toutanova and Chen (2015) argue through empirical comparisons that these two categories
of models exhibit complimentary strengths.
A number of state-of-the-art proposals for collaborative filtering are perhaps best thought of as
incorporating aspects of graph feature models. An example of a method relaxing the low-rank
assumption using graph features is the Local Low Rank Matrix Approximation (Lee et al., 2013),
which assumes that every entry in the matrix is given by a combination of low rank matrices, where
the combination is specific to the entry. LLORMA achieves impressive state-of-the-art performance.
Other approaches also use neural-network architectures but work by trying to predict the ground
truth ratings directly from the observed ratings matrix X. For example, in I-AutoRec (Sedhain et al.,
2015), an autoencoder is learned that takes as input the observed movie ratings vector Xn for user n
and produces as output the ground truth Xntruth . (Missing entries are typically replaced by value 3.)
AutoRec achieves state-of-the-art performance, slightly besting LLORMA on some benchmarks, but
a careful comparison would likely require a fresh data set and strict controls on how the numerous
parameters for both models are chosen (Blum and Hardt, 2015; Dwork et al., 2015). Another model
in this category is the I-RBM (Salakhutdinov et al., 2007), but its performance is now far from the
state of the art.
Both LLORMA and I-AutoRec can be seen as models combining aspects of both graph feature and
latent feature models. LLORMA identifies similar rows and columns (entities) using graph features,
but model each local low-rank approximation using latent features. I-AutoRec takes as input all
observed ratings (relations) for a user (entity), allowing the network to model the graph features,
which in this case are similarities and distances among movies.
In Section 5, we compare the performance of NNMF and other approaches on benchmarks including
link prediction in graphs, as well as collaborative filtering in movie rating datasets. In our experiments,
NNMF dominated other latent feature methods, as well as the I-RBM model. However, NNMF was
dominated by both LLORMA and I-AutoRec. One possibility is that a different approach to learning
the underlying neural network would deliver results on par with these methods. Another possibility
is that the difference reflects some fundamental limitation of latent feature models, which assume
that the ratings are conditionally independent given the latent feature representations. Local graph
structure may contain information that would aid in predicting ratings. In particular, NNMF does not
learn from the pattern of missing ratings,which can reveal information about a user or movie: e.g., a
user might tend only to give ratings when those ratings are extreme, and movies with low ratings are
less likely to be viewed in the first place. In contrast to NNMF, both LLORMA and AutoRec could,
in principle, be taking advantage of the information latent in the pattern of missing ratings, although
the strength of this effect has not been studied. In LLORMA, the sparsity pattern affects the notion of
locality. In AutoRec, the entire pattern of ratings is fed as input, although the sparsity is obscured
somewhat by missing entries being replaced by 3’s.
Some recent work by Hernández-Lobato et al. (2014) demonstrates that explicitly modeling the
non-random pattern of missing ratings can lead to a slight improvement in performance for latent
feature models, although the gains they demonstrated were not dramatic enough that they would
have closed the gap between NNMF and LLORMA/AutoRec. Indeed, we implemented a neural
architecture similar in spirit to theirs, but were only able to improve the RMSE score by approximately
0.003. A more careful analysis would be necessary to make more definitive conclusions.

5 E XPERIMENTS

We evaluated NNMF on two graph datasets (NIPS and Protein) and two collaborative filtering datasets
(MovieLens 100K and 1M). See Table 1 for more information about the datasets.

4
Under review as a conference paper at ICLR 2016

NIPS Protein ML100k ML1m

Vertices X 234 230 943 6040


Vertices Y - - 1682 3900
Edges 27144 52900 100000 1000209

Table 1: Data sets and their dimensions. The mark “-” highlights that the array is square.

NIPS Protein ML-100K ML-1M

RFM (3) 0.110 0.136 - PMF (60) 0.883


PMF (3) 0.130 0.139 - LLORMA-GLOBAL 0.865
PMF (60) 0.062 0.104 0.952 I-RBM 0.854
BiasedMF (60) 0.065 0.111 0.911 BiasedMF (60) 0.852
NTN (60) 0.048 0.071 0.910 NTN (60) 0.852
NNMF (3HL) 0.040 0.065 0.907 LLORMA-LOCAL 0.833
NNMF (4HL) - - 0.903 I-AutoRec 0.831
NNMF (3HL) 0.846
NNMF (4HL) 0.843

Table 2: Results across the four data sets for a variety of techniques. The token (D) specifies that a
rank-D factorization was used. The token (nHL) specifies that n hidden layers were used. Scores
reported for RFM and PMF (3) are taken from (Lloyd et al., 2012). Scores for BiasedMF were
obtained using LibRec (Guo et al., 2015). Scores for LLORMA were taken from (Lee et al., 2013),
AutoRec and RBM, were taken from (Sedhain et al., 2015).

NNMF, NTN, PMF model performance was evaluated on 5 randomly subsampled test sets, each
comprising 10% of the data points, and then averaged. The remaining 90% of the data was split into
training and validation sets: For the graph datasets, we used a 10% of the training data for validation.
Due to the larger size of collaborative filtering datasets, we used 2% and 0.5% of the training data for
validation on the MovieLens 100K and 1M datasets, respectively. These numbers were chosen to
make the Monte Carlo error of the validation set estimate sufficiently small. (It is likely that results
could be improved by better use of the training and validation data.)
The regularization parameter, λ, and optimal stopping time were chosen by optimizing the error on
the validation set. For every fixed setting of λ, the network and features were learned by optimizing
Eq. (2) as described in Section 3.
For simplicity, and to avoid the pitfalls of choosing parameters that produce good test set performance,
the number and dimensionality of the features, as well as the network architecture, were fixed across
experiments. It is conceivable that cross validating these parameters would have yielded better results.
On the other hand, it would be wise to employ safeguards (Dwork et al., 2015) before embarking on
an adaptive search for better architectures, learning rates, activation functions, etc.
We chose D0 = 60 feature dimensions to be preprocessed by an element-wise product, and included
D = 10 additional features for each user and each movie. The feed-forward neural network
was chosen to have 3 hidden√ √ layers with 50 sigmoidal units each. The network weights were
sampled uniformly in ±4 6/ nin + nout , where nin , nout denote the number of inbound and
outbound connections. The latent features were randomly initialized from a zero-mean Gaussian
distribution with standard deviation 0.1. The features and weights were learned by gradient descent,
and RMSPROP was used to adjust the learning rate, which was initialized to 0.001 for NIPS, Protein,
and ML-100K, and to 0.005 for ML-1M.
To train the PMF model, we chose 60 dimensions after evaluating the performance of PMF with
various choices for the dimensionality and finding that this worked best. On each run, the regular-
ization parameter was chosen from a large range by optimizing the validation error. (We tried many
other settings for PMF, and have reported the best numbers we obtained here to make the comparison
conservative.)

5
Under review as a conference paper at ICLR 2016

NTN model hyperparameters were chosen to match NNMF ones—we used 60-dimensional latent
features, and 50 units in the hidden layer. This setup yields a third order tensor with 60 × 60 × 50 =
180, 000 entries. Compared to the network underlying NNMF, a NTN of approximately the same
size has roughly 20 times more parameters. The model was trained with gradient descent on the
same objective function as for NNMF. We had to use mini-batches for the MovieLens 1M dataset
to avoid memory issues. Just as for other models, we chose the regularization parameter λ by
optimizing it the error on the validation set. Note, that in the original NTN model was trained with
a contrastive max-margin objective function with `2 regularization of all parameters. We applied a
sigmoid nonlinearity to the output layer of the original NTN, to ensure that its outputs fell in [0, 1].
The results appear in Table 2. As mentioned above, NNMF dominates PMF, RFM, and to a lesser
extent NTN. In (Lloyd et al., 2012), the performance of RFM is compared with PMF when both
models use the same number of latent dimensions. The performance of PMF, however, tends to
improve with the higher dimension, assuming proper regularization, and so RFM (3) is seen here
to perform worse than PMF (60). It is possible that recent advances in Gaussian process regression
could in turn improve the performance of RFM.
NNMF outperforms BaisedMF, although the margin narrows as we move to the sparsely-observed
MovieLens datasets. We note that adding bias correction terms to NNMF also improves the perfor-
mance of NNMF, although the improvement is on the order of 0.003, and so may not be robust. It is
also possible that using more of the training data might widen the gap.
NNMF beats the (low-rank) global version of LLORMA, but not the local version that relaxes
the low-rank constraint. NNMF is also bested by AutoRec. It is also not clear if we could have
reliably found much better network weights and features had we made different choices around the
architecture, composition, and training of the neural network. Given that NNMF dominates PMF so
handily on the graph datasets, it might stand to reason that there is a lot of room for improvement on
MovieLens through better engineering of NNMF. It is worth noting that a ‘local’ versions of NNMF
could be developed along the same lines as were for LLORMA. Given that NNMF dominates PMF,
it might then also stand to reason that a local version of NNMF would dominate LLORMA, because
LLORMA can be understood as a local version of PMF.
To see whether deeper networks performed better on the collaborative filtering datasets, we also
evaluated NNMF on the MovieLens data sets using a 4 hidden layer network. We observed that fewer
units per layer yielded better results. (We compared 50 units per layer when (D, D0 ) = (10, 60) to
20 units per layer when (D, D0 ) = (10, 80).) However, to draw any conclusions, more experiments
would be needed, with care to avoid overfitting. We reported scores for 4 hidden layer networks, with
20 units per hidden layer, and (D, D0 ) = (10, 80) latent feature dimensions. We believe that adding
additional layers would likely improve the results, though we suspect the performance would saturate
quickly (and then drop if we did not very carefully initialize and regularize the network).

6 D ISCUSSION

NNMF achieves state-of-the-art results among latent feature models, but is dominated by approaches
that take into account local graph structure. However, it is possible that our experiments have not
identified the limits of the NNMF model. It is difficult to exhaustively explore the range of network
architectures, activation functions, regularization techniques, and cross-validation strategies. Even
if we could explore them all, we would be in danger of overfitting and losing any hope of insight
into the usefulness of NNMF. Indeed, we erred towards not trying to optimize over the model’s
many possible configuration. It would be interesting to apply recent advances in adaptive estimation
to control the possibility of overfitting during this phase of designing and evaluating a new model
(Dwork et al., 2015).

ACKNOWLEDGMENTS

The authors would like to thank Zoubin Ghahramani for feedback and helpful discussions.

6
Under review as a conference paper at ICLR 2016

R EFERENCES
J. Bian, B. Gao, and T.-Y. Liu. Knowledge-powered deep learning for word embedding. In Proc.
European Conference on Machine Learning. Springer, September 2014.
A. Blum and M. Hardt. The ladder: A reliable leaderboard for machine learning competitions, 2015.
arXiv:1502.04585. Conference version appeared in ICML, 2015.
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. The reusable holdout:
Preserving validity in adaptive data analysis. Science, 349(6248):636–638, 2015. Earlier versions
appeared in NIPS 2015 and STOC 2015.
L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style, 2015. arXiv:1508.06576.
D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative filtering to weave an
information tapestry. Communications of the ACM, 35:61–70, 1992.
G. Guo, J. Zhang, Z. Sun, and N. Yorke-Smith. Librec: A java library for recommender systems,
2015. In Posters, Demos, Late-breaking Results and Workshop Proceedings of the 23rd Conference
on User Modelling, Adaptation and Personalization (UMAP).
J. M. Hernández-Lobato, N. Houlsby, and Z. Ghahramani. Probabilistic matrix factorization with
non-random missing data. In Proc. of the Int. Conf. on Machine Learning, 2014.
H. Huang, L. Heck, and H. Ji. Leveraging deep neural networks and knowledge graphs for entity
disambiguation, 2015. arXiv:1504.07678v1.
Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.
Computer, 42(8):30–37, Aug. 2009. doi: 10.1109/MC.2009.263.
J. Lee, S. Kim, G. Lebanon, and Y. Singer. Local low-rank matrix approximation. In Proc. of the Int.
Conf. on Machine Learning, 2013.
J. Lloyd, P. Orbanz, Z. Ghahramani, and D. M. Roy. Random function priors for exchangeable arrays
with applications to graphs and relational data. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and
K. Weinberger, editors, Adv Neural Inform. Proc. Systems 25, pages 1007–1015, 2012.
M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relational machine learning for
knowledge graphs, 2015. arXiv:1503.00759.
R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Neural Information Processing
Systems 21, 2008.
R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machines for collaborative filtering.
In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages
791–798, New York, NY, USA, 2007. ACM. doi: 10.1145/1273496.1273596.
S. Sedhain, A. K. Menon, S. Sanner, and L. Xie. Autorec: Autoencoders meet collaborative filtering.
In Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion,
pages 111–112, Republic and Canton of Geneva, Switzerland, 2015. International World Wide
Web Conferences Steering Committee. doi: 10.1145/2740908.2742726.
R. Socher, D. Chen, C. D. Manning, and A. Ng. Reasoning with neural tensor networks for knowledge
base completion. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors,
Advances in Neural Information Processing Systems 26, pages 926–934, 2013.
K. Toutanova and D. Chen. Observed versus latent features for knowledge base and text inference. In
3rd Workshop on Continuous Vector Space Models and Their Compositionality. Association for
Computational Linguistics, July 2015.
D. Yu, L. Deng, and F. Seide. The deep tensor neural network with applications to large vocabulary
speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 21(2):
388–396, 2013.

You might also like