Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
2 views21 pages

AReviewof Bayesian Methodsfor Infinite Factorisations

Uploaded by

heitorblesa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views21 pages

AReviewof Bayesian Methodsfor Infinite Factorisations

Uploaded by

heitorblesa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/374166236

A Review of Bayesian Methods for Infinite Factorisations

Preprint · September 2023

CITATIONS READS
0 61

1 author:

Margarita Grushanina
Imperial College London
2 PUBLICATIONS 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Margarita Grushanina on 25 September 2023.

The user has requested enhancement of the downloaded file.


A Review of Bayesian Methods for Infinite Factorisations
Margarita Grushanina*
arXiv:2309.12990v1 [stat.ME] 22 Sep 2023

September 25, 2023

Abstract

Defining the number of latent factors has been one of the most challenging problems in factor
analysis. Infinite factor models offer a solution to this problem by applying increasing shrinkage
on the columns of factor loading matrices, thus penalising increasing factor dimensionality. The
adaptive MCMC algorithms used for inference in such models allow to defer the dimension of the
latent factor space automatically based on the data. This paper presents an overview of Bayesian
models for infinite factorisations with some discussion on the properties of such models as well
as their comparative advantages and drawbacks.
Keywords: Factor analysis, adaptive Gibbs sampling, spike-and-slab prior, Indian buffet pro-
cess, multiplicative gamma process, increasing shrinkage

1 Introduction
Latent factor models represent a popular tool for data analysis in many areas of science, includ-
ing psychology, marketing, economics, finance, genetic research, pharmacology and medicine. Their
history dates back to Spearman (1904), who first suggested common factor analysis as a single factor
model in the context of psychology. Thurstone (1931) and Thurstone (1934) extended it to multi-
ple common factors and introduced some important factor analysis concepts, such as communality,
uniqueness, and rotation. Anderson and Rubin (1956) in their seminal paper established important
theoretical foundations of latent factor analysis. Since then there has been a vast and constantly
growing pool of literature covering various theoretical and practical aspects of factor analysis. Some
selective reviews include, for example, Barhoumi et al. (2013) and Stock and Watson (2016) for dy-
namic factor models, Bai and Wang (2016) for large factor models, and Fan et al. (2021) for factor
models in application to econometric learning.
Recent years have also seen a considerable research in the area of Bayesian latent factor models.
Some of the many important contributions in this area include Geweke and Zhou (1996), Aguilar and
West (2000), West (2003), Lopes and West (2004), Frühwirth-Schnatter and Lopes (2010), Conti et
al. (2014), Ročková and George (2016), Kaufmann and Schumacher (2019) and Frühwirth-Schnatter
et al. (2022a).
One of the most challenging tasks in factor analysis concerns the inference of the true number
of latent factors in the model. The most common approach in the literature has long been to use
various criteria to choose a model with the correct number of factors. Thus, Bai and Ng (2002)
* Department of Economics, Vienna University of Economics and Business, Welthandelsplatz 1, 1020 Vienna, Austria

1
use information criteria to compare models with different factors’ cardinalities. Kapetanios (2010)
performs model comparison using test statistics, while Polasek (1997) and Lopes and West (2004)
rely on marginal likelihood estimation to determine the true number of factors in the model. Car-
valho et al. (2008) perform the evolutionary stochastic model search which iteratively increases the
model by an additional factor until reaching some pre-specified limit or until the process stops includ-
ing additional factors. As a different approach, Lopes and West (2004) customise a reversible jump
MCMC (RJMCMC) algorithm introduced in Green (1995) for moving between models with different
numbers of factors, while Frühwirth-Schnatter and Lopes (2018) suggest a one-sweep algorithm to
estimate the true number of factors from an overfitting factor model.
However, such methods are often computationally demanding, especially when the dimensional-
ity of the analysed data set is high. Recently, another approach has been developed which allows the
factors’ cardinality to be derived from data by letting the number of factors to potentially be infinite.
The dimension reduction is then achieved by assigning a nonparametric prior to factor loadings which
penalises the increase of the number of columns in the factor loading matrix via increasing shrinkage
of the factor loadings on each additional factor to zero. Thus, in their pioneering work, Bhattacharya
and Dunson (2011) introduced the multiplicative gamma process (MGP) prior on the precision of
factor loadings, which is defined as a cumulative product of gamma distributions. Knowles and
Ghahramani (2011) and Ročková and George (2016) employed the Indian Buffet Process (IBP) to
enforce sparsity on factor loadings and at the same time penalise the increasing dimensionality of
latent factors. Legramanti et al. (2020) introduced the cumulative shrinkage process (CUSP) prior
which applies cumulative shrinkage on the increasing number of columns of the factor loading matrix
via a sequence of spike-and-slab distributions. Model inference is usually performed via Gibbs sam-
pler steps, however, the models’ changing dimensions at different iterations of the sampler require the
usage of adaptive algorithms, which have some specific properties that need to be taken into account.
This paper provides a review of the methods for infinite factorisations, with a focus on their
properties, comparative advantages and drawbacks. The paper proceeds as follows: Section 2 briefly
reviews the formulation of a Bayesian factor model and a shrinkage prior on factor loadings. Sections
3 - 5 provide an insight into the three above mentioned priors for infinite factorisations, namely,
MGP, CUSP and IBP priors, and outline their main advantages and drawbacks. Section 6 reviews the
concept of generalized infinite factorization models. Section 7 concludes with a discussion.

2 Bayesian infinite factor model


2.1 Bayesian latent factor model
In the traditional Bayesian factor analysis data on p related variables are assumed to arise from a
multivariate normal distribution yt ∼ Np (0, Ω), where yt is the t-th of the T observations and Ω is
the unknown covariance matrix of the data. A factor model represents each observation yt as a linear
combination of K common factors ft = (f1t , . . . , fKt )T :

yt = Λft + ǫt , (1)

where Λ is an unknown p × K factor loading matrix with factor loadings λih (i = 1, . . . , p, h =


1, . . . , K) and it is typically assumed that K ≪ p.
Often, the latent factors are assumed to be orthogonal and follow a normal distribution ft ∼
Np (0, Ip ). Furthermore, it is assumed that the factors ft and fs are pairwise independent for t 6= s.

2
The idiosyncratic errors ǫt are also assumed normal and pairwise independent:

ǫt ∼ Np (0, Σ), Σ = diag(σ12 , . . . , σp2 ).

These assumptions allow to represent the covariance matrix of the data in the following way:

Ω = ΛΛT + Σ. (2)

There are many different ways to choose a prior for the elements of the factor loading matrix Λ.
0 ) for the reason of conjugacy.
A typical choice involves a version of a normal prior λih ∼ N (d0ih , Dih
0
The hyperparameter dih is often chosen to be equal to zero. This has an additional advantage that with
a suitably chosen hyperprior for Dih 0 such setting can result in a sparse Λ with many zero elements,

which is justified for many applications of factor models. To ensure identifiability, it is often assumed
that Λ has a full rank lower triangular structure, which imposes a choice of a truncated normal prior
for the diagonal elements of Λ to ensure positivity and a normal prior for the lower diagonal elements
(see, e.g. Geweke and Zhou (1996), Lopes and West (2004), Ghosh and Dunson (2009), amongst
others).
The idiosyncratic variances σi2 are usually assigned an inverse Gamma prior σi2 ∼ G −1 (c0i , C0i )
mainly for the reasons of its conditional conjugacy.

2.2 Standard Gibbs sampler


Inference is usually performed via a Gibbs sampler, sequentially sampling factor loadings, id-
iosyncratic variances and factors from their respective conditional distributions. These steps are rather
generic for a wide range of factor models and choices of parameters. Assuming that the data is ex-
plained by K latent factors and that in the normal prior for the elements of the factor loading matrix
d0ih = 0, the Gibbs sampler steps for updating Λ, Σ and F = {ft : t = 1, . . . , T } will look as
follows:
Step 1. Sample λi for i in (1, . . . , p) from

λTi |− ∼ NK (Ψ−1 −2 T −1 −2 T −1 −2 T −1

i + σi F F ) F σi yi , (Ψi + σi F F )

0 , . . . , D 0 ) and λ is the ith row of the factor loading matrix Λ.


where Ψi = diag(Di1 iK i

Step 2. Sample σi−2 for i in (1, . . . , p) from


T
!
T 1X
σi−2 |− ∼G c0i + , C0i + (yit − λTi ft )2 .
2 2
t=1

Step 3. Sample ft for t in (1, . . . , T ) from

ft |− ∼ NK (IK + ΛT Σ−1 Λ)−1 ΛT Σ−1 yt , (IK + ΛT Σ−1 Λ)−1




where Σ = diag(σ12 , . . . , σp2 ).


Additional steps can be added to update hyperparameters if hyperpriors are assigned to any of the
parameters of the prior distributions for λih and σi2 .

3
2.3 Infinite factorisations and increasing shrinkage of the prior for Λ
In the above described Gibbs sampler steps, we take the number of latent factors K as known. In
reality, this is rarely the case and determining the plausible number of latent factors can be a difficult
and time consuming problem, especially in high-dimensional data sets. In the last decade, there
has been a rise in the literature using a different approach towards determining the number of latent
factors. This approach assumes that a factor model can in theory include infinitely many factors, i.e.
the factor loading matrix Λ can be comprised of infinitely many columns. This means that Λ is seen
as a parameter-expanded factor loading matrix with redundant parameters.
More formally, if ΘΛ denotes the collection of all matrices Λ with p rows and infinitely many
columns, then the product ΛΛT is a p × p matrix with all entries finite if and only if the following
condition holds1 :
n ∞
X o
ΘΛ = Λ = (λih ), i = 1, . . . , p, h = 1, . . . , ∞, max λ2ih < ∞
1≤i≤p
h=1

The prior on the elements of Λ is defined in such a way that it allows λih s to decrease in magni-
tude if the column index h grows, thus penalising the increasing factor dimensionality. This approach
allows the number of factors to be derived automatically from data via an adaptive inference algo-
rithm. In the next sections we discuss the most notable methods for infinite factorisations in detail.

3 Multiplicative gamma process prior


3.1 The prior specification
In their seminal paper, Bhattacharya and Dunson (2011) proposed one way to choose a prior on
the elements of a factor loading matrix so that to penalise the effect of additional columns: λih s are
given a normal prior centred at zero, while the prior precisions of λih s for each h are defined as a
cumulative product of gamma priors.
The MGP prior can be formalised as follows:
h
Y
λih |φih , τh ∼ N (0, φ−1 −1
ih τh ), φih ∼ G(ν1 /2, ν2 /2), τh = δl , (3)
l=1
δ1 ∼ G(a1 , b1 ), δl ∼ G(a2 , b2 ), l ≥ 2,

where δl (l = 1, . . . , ∞) are independent, τh is a global shrinkage parameter for the h-th column,
φih are local shrinkage parameters for the elements of the h-th column. The condition a2 > 1 is
imposed on the shape parameter of the prior for δl to insure that τh s are stochastically increasing with
increasing h. In Bhattacharya and Dunson (2011), b1 = b2 = 1 are set at 1, while a1 and a2 are
assigned the hyperprior G(2, 1) and sampled in a Metropolis-within-Gibbs step.

3.2 Inference and adaptive Gibbs sampler


The inference is done via a Gibbs sampler with a few additional steps to the standard ones de-
scribed in Section 2.2. A distinctive feature of the sampler suggested in Bhattacharya and Dunson (2011)
1
This follows from the Cauchy-Schwartz inequality, see the proof in Bhattacharya and Dunson (2011).

4
is that it truncates the factor loading matrix Λ to have k∗ columns, where k∗ is the number of factors
supported by the data at each given iteration of the sampler. The truncation procedure deserves some
closer attention.
Although theoretically the number of factors is allowed to be infinitely large, in reality one
chooses a suitable level of truncation k∗ , designed to be large enough not to miss any important
factors, but also not too conservative to induce unnecessary computational effort. The sampler is
initiated with a conservative guess K0 , which is chosen to be substantially larger than the supposed
actual number of factors. At each iteration of the sampler, the posterior samples of the factor load-
ing matrix Λ contain information about the effective number of factors supported by the data in the
following way. Let m(g) be the number of columns of Λ at iteration g which have all their elements
so small that they fall within some pre-specified neighbourhood of zero. Then these columns are
considered redundant and k∗(g) = k∗(g−1) − m(g) is defined to be the effective number of factors
at iteration g. To keep balance between dimensionality reduction and exploring the whole space of
possible factors, k∗ is adapted with probability p(g) = exp(α0 + α1 g), with the parameters chosen so
that the adaptation occurs more often at the beginning of the chain and decreases in frequency expo-
nentially fast (the adaptation is designed to satisfy the diminishing adaptation condition in Theorem
5 of Roberts and Rosenthal (2007), which is necessary for convergence). When the adaptation oc-
curs, the redundant factors are discarded and the corresponding columns are deleted from the loading
matrix (together with all other corresponding parameters). If none of the columns appear redundant
at iteration g, a factor is added, with all its parameters sampled from the corresponding prior distri-
butions. Adaptation is made to occur after a suitable burn-in period in order to ensure that the true
posterior distribution is being sampled from before truncating the loading matrices.
In the adaptive Gibbs sampler with the MGP prior on the factor loadings, the first three steps
will be essentially the same as in Section 2.2, with two alterations: the number of factors K will be
replaced by k∗ and in Step 1 Di1 0 , . . . , D 0 will consequently be replaced by φ−1 τ −1 , . . . , φ−1 τ −1 .
iK i1 1 ik ∗ k ∗
The additional steps will have the following form:
Step 4. Sample φih for i in (1, . . . , p) and h in (1, . . . , k∗ ) from

ν1 + 1 ν2 + τh λ2ih
 
φih |− ∼ G , .
2 2

Step 5. Sample δ1 from


k∗ p
!
2a1 + pk∗ 1 X (1) X
δ1 |− ∼ G ,1 + τl φil λ2il .
2 2
l=1 i=1

Sample δh for h ≥ 2 from


k∗ p
!
2a2 + p(k∗ − h + 1) 1 X (h) X
δh |− ∼ G ,1 + τl φil λ2il
2 2
l=h i=1

(h) Ql
where τl = t=1,t6=h δt for h in (1, . . . , k∗ ).
Step 6. Sample the posterior densities of a1 |δ1 and a2 |δ2 , . . . , δk∗ via a random walk Metropolis-
Hastings step with ap1 ∼ N (a1 , s21 ) and ap2 ∼ N (a2 , s22 ) serving as proposal quantities and the accep-

5
tance probabilities being:

Γ(a1 ) ap1 ap1 −a1 a1 −ap


ρa1 = δ e 1,
Γ(ap1 ) a1 1
k∗
!ap2 −a2
Γ(a2 ) −(k −1) ap2 Y
  ∗
p
ρa2 = p δl ea2 −a2 .
Γ(a2 ) a2
l=2

Step 7. At each iteration generate a random number ug from U (0, 1). If ug ≤ p(g), check if any
columns of the factor loading matrix Λ are within the pre-specified neighbourhood of 0, and if this
is so, discard the redundant columns and all its corresponding parameters. In the case when the
number of such columns is zero, generate an additional factor by sampling its parameters from the
prior distributions.

3.3 Practical applications and properties


The MGP prior has initially been developed for high-dimensional datasets with p ≫ T and a
sparse covariance matrix structure, such as genes expression data. However, it acquired a wide-
spread popularity and has been proved useful in various applications, see e.g. Montagna et al. (2012)
and Rai et al. (2014), amongst others. An application of particular interest is the infinite mixture of
infinite factor analysers (IMIFA) model introduced in Murphy et al. (2020), where the MGP prior was
used in the context of a mixture of factor analysers to allow automatic inference on the number of
latent factors within each cluster.
However, the MGP model has also some important limitations. Some of these limitations are
investigated in Durante (2017), who addressed the dependence of the shrinkage induced by the MGP
prior on the value of the hyperparameters a1 > 0 and a2 > 0. Bhattacharya and Dunson (2011) state
that the τh s in (3) are stochastically increasing with increasing h under the restriction a2 > 1, which
means that the induced prior on 1/τh increasingly shrinks the underlying quantity towards zero as the
column index h increases, provided that a2 > 1. Durante (2017) argues that this is not sufficient to
guarantee the increasing shrinkage property in a general case. Instead, further conditions are required,
such as

a2 > b2 + 1, a2 > a1 (4)

for the increasing penalization of a high number of factors to hold (in expectation), providing that
a1 > 0 and a2 > 0 and the values of a1 are not excessively high. In his simulation study of the
performance of the MGP prior for various values of the hyperparameters a1 and a2 , Durante (2017)
investigates the behaviour of the model with T = 100, p = 10, and two different values for the true
number of factors, namely K = 2 and K = 6. The results show an improved posterior concentration
when the parameters a1 and a2 satisfy condition (4), specially for the case K = 2. As the true rank
of the model increases, there is evidence that the shrinkage induced by the MGP prior might be too
strong.
Another critique of the MGP prior appeared in Legramanti et al. (2020), who pointed out that the
hyperparameters a1 and a2 both control the rate of shrinkage and the prior for the loadings on active
factors. This creates a trade-off between the need to maintain considerably diffuse priors for active
components and the endeavour to shrink the redundant ones. In their simulation study, Legramanti

6
(p, K) mode k∗ IQR â1 â2
(6, 2) 6.00 1.00 1.41 5.89
(10, 3) 5.75 1.30 1.31 5.12
(30, 5) 8.34 1.30 2.61 3.27
(50, 8) 12.30 1.60 2.62 2.49
(100, 15) 19.80 1.70 2.68 2.10
(150, 25) 5.00 0.00 4.32 4.96
Table 1: Performance of the adaptive Gibbs sampler based on the MGP prior for various combinations of p
and K. The modal estimates of k ∗ and the interquartile range (IQR) are reported. â1 and â2 are the estimates
of the values of a1 and a2 in (3) inferred via the Metropolis-Hastings step.

et al. (2020) found that the MGP prior significantly overestimates the number of active factors on a
medium sized data set with p < T .
In an attempt to evaluate the performance of the MGP prior when the hyperparameters a1 and a2
are derived from data, we simulated a dataset in a similar way as in Bhattacharya and Dunson (2011).
More specifically, a synthetic data set was simulated with T = 100 and idiosyncratic variances
sampled from G −1 (1, 0.25). The number of non-zero elements in each column of Λ were chosen
between 2k and k + 1, with zeros allocated randomly and non-zero elements sampled independently
from N (0, 9). We generated yt from Np (0, Ω), where Ω = ΛΛ′ + Σ. Further, we chose six (p, K)
combinations to test various dimensions of Λ, namely (6, 2), (10, 3), (30, 5), (50, 8), (100, 15) and
(150, 25) with a conservative initial upper bound of k0 = min(p, 5 log(p)), and k0 = 10 log(p) for
the latter case with p > T . For each pair we considered 10 simulation replicates. The simulation was
run for 30000 iterations with a burn-in of 10000.
We used the following hyperparameter values: ν1 and ν2 both equal to 3, the rate parameters b1
and b2 in the Gamma priors for δ1 and δl are set at 1. For the case when p < T , α0 and α1 in the
adaptation probability expression were set as −0.5 and −3 × (10)−4 , and as −1 and −5 × (10)−4
for the case when p ≥ T . The threshold for monitoring the columns to discard as 0.012 with the
proportion of elements required to be below the threshold at 80 % of p.
The simulation results in Table 1 show that the model tends to overestimate the number of active
factors in the case when p ≤ T . In the last case, when the number of variables p exceeds the number
of observations T , the number of active factors is severely underestimated compared to the true one.
The last two columns in Table 1 show the posterior mean of a1 and a2 . The first efficient shrinkage
condition of Durante (2017), a2 > b2 + 1, holds for all (p, k) combinations considered. For the first
three combinations of p and k, the column shrinkage parameters a1 and a2 , estimated from the data,
are in accordance with the second efficient shrinkage condition of Durante (2017), namely a2 > a1 .
However, with higher p, the condition a2 > a1 seems to cease holding when p gets closer to 50. This
result is of some interest especially in view of the simulation study in Durante (2017), which suggests
that the shrinkage induced by the MGP prior (and satisfying the condition a2 > a1 ) might prove too
strong when the dimension of the data set increases.
Assigning a hyperprior to influential parameters, like we did in the case of a1 and a2 , is a good
way to reduce uncertainty and subjectivity of the model. However, the adaptation mechanism of such
a sampler involves several hyperparameters, which may need to be adjusted depending on the nature
2
Setting the threshold for monitoring the redundant columns at a smaller value than 0.01 in the case when p ≥ T led to
a an improvement of the results. However, tuning the threshold parameters remains highly heuristic and can be tricky while
working with real data sets when the true number of factors is not known.

7
and dimensionality of data. For example, we used an additional parameter indicating the proportion
of the factor loadings in the column of Λ which needs to be within the chosen neighbourhood of zero
to be considered redundant. This was first introduced in Murphy et al. (2020), who found the choice
of these truncation parameters to be a delicate issue which strongly depends on the type of the data.
The threshold defining the neighbourhood of 0, which is used to decide which factor loadings should
be discarded, is another such example. Moreover, the parameters of the adaptation probability, α0
and α1 , also need some tuning. In our simulation study, the speed of the adaptation differed for the
settings with p < T and p > T , when using the same values for α0 and α1 .
The importance and difficulty of choosing a suitable truncation criteria in the adaptive infinite
factor algorithms was addressed in Schiavon and Canale (2020). The authors argue that the choice
of truncation criteria, such as the predefined neighbourhood of zero, plays a vital role for the perfor-
mance of the model. The optimal value of the criterion depends of the scale of data, while the number
of active factors can be severely underestimated if the value of the truncation criterion is too large,
and severely overestimated if it is too small. This is especially true for high-dimensional data, as with
p getting larger, the probability of having all values of |λih | smaller than the predefined threshold
goes to zero exponentially. In the absence of any guidance towards choosing an optimal value of such
a threshold, this remains a highly subjective and random procedure. Schiavon and Canale (2020)
suggest another way to define a criterion for truncating the redundant factors, which is robust to the
scale of the data and has a well-defined upper bound. The main idea is to truncate Λ in such a way
that the truncated model is able to explain at least a fraction Q ∈ (0, 1) of the total variability of the
data, where the variability of y is measured by the trace of the covariance matrix Ω:

tr(Λk∗ ΛTk∗ ) + tr(Σ)


≥ Q,
tr(Ω)

where Λk∗ denotes the factor loading matrix obtained by discarding the columns of Λ starting from
k∗ +1. The authors conduct a simulation study which shows that using the suggested method to select
the relevant active factors drastically improves the performance of the MGP model.

4 Cumulative shrinkage process prior


4.1 The prior specification
Legramanti et al. (2020) proposed another type of a nonparametric prior on the variances of the
elements of Λ, which largely corrects the drawbacks of the MGP prior. The CUSP prior on the factor
loadings induces shrinkage via a sequence of spike-and slab distributions that assign growing mass to
the spike as the model complexity grows. The CUSP prior formalises as follows:

λih | θh ∼ N (0, θh ), where i = 1, . . . , p and h = 1, . . . , ∞

h
X l−1
Y
θh | πh ∼ (1 − πh )G −1 (aθ , bθ ) + πh δθ∞ , πh = wl , w l = vl (1 − vm ) (5)
l=1 m=1

where πh ∈ (0, 1) and the vh s are generated independently from B(1, α), following the usual stick-
breaking representation introduced in Sethuraman (1994). By integrating out θh , each loading λih

8
has the marginal prior3

λih ∼ (1 − πh )t2aθ (0, bθ /aθ ) + πh N (0, θ∞ )

where t2aθ (0, bθ /aθ ) denotes the Student-t distribution with 2aθ degrees of freedom, location 0 and
scale bθ /aθ . To facilitate effective shrinkage of the redundant factors, θ∞ should be set close to 0.
The authors recommend a small value θ∞ > 0, following Ishwaran and Rao (2005), as it induces
a continuous shrinkage prior on every factor loading, thus improving mixing and identification of
inactive factors. The authors use the fixed value of θ∞ = 0.05, however, it can be replaced by some
continuous distribution without affecting the key properties of the prior. This is shown in Kowal and
Canale (2022), where a normal mixture of inverse-gamma priors is employed for the spike and slab
distributions. The slab parameters aθ and bθ should be specified so as to induce a moderately diffuse
prior on active loadings.

4.2 Inference and adaptive Gibbs sampler


The inference is done via Gibbs sampler steps. Similarly to the MGP model, the first three steps
remain essentially the same as in Section 2.2, with the difference that in Step 1 Di1 0 , . . . , D 0 will
iK
be replaced by θ1 . . . , θH , where H is the truncation level. This truncation level is chosen differently
than in Bhattacharya and Dunson (2011) and the adaptation process is also different and designed in
such a way that it depends less on heuristically chosen parameters.
While the probability of adaptation at iteration g of the sampler is also set to satisfy the di-
minishing adaptation condition of Roberts and Rosenthal (2007), there is no need to pre-specify an
ad-hoc parameter describing some small neighbourhood of 0. The inactive columns of Λ are iden-
tified as those which are assigned to the spike and are discarded at iteration g with the probability
p(g) = eα0 +α1 g together with all corresponding parameters. If at iteration g all columns of the factor
loading matrix are identified as active, i.e. assigned to the slab, an additional column of Λ is gen-
erated from the spike and all the corresponding parameters are sampled from their respective prior
distributions. The initial number of columns H at which the CUSP model is truncated is set equal
to p + 1, following the consideration that there can be at most p active factors and by construction
at least one column is assigned to the spike. The assignment of the columns of Λ to spike or slab
at iteration g is done using H (g) categorical variables zh ∈ {1, 2, . . . , H (g) } with a discrete prior
P r(zh = h | wh ) = wh , where H (g) is the number of columns in Λ at iteration g.
The additional Gibbs sampler steps will look as follows:
Step 4. Sample θh in a data augmentation step. Thus, (5) can be obtained by marginalising out
independent latent indicators zh with probabilities p(zh = l | wl ) = wl for l = 1, . . . , H, from the
equation

θh | zh ∼ {1 − 1(zh ≤ h)}G −1 (aθ , bθ ) + 1(zh ≤ h)δθ∞ .

Sample zh for h in (1, . . . , H) from a categorical distribution with probabilities as below


(
wl Np (λh ; 0, θ∞ Ip ), l = 1, . . . , h,
p(zh = l | −) ∼
wl t2aθ (λh ; 0, (bθ /aθ )Ip ) , l = h + 1, . . . , H.
3
In the equation (5) the inverse gamma distribution for the slab is chosen for the reasons of conjugacy. In principle, this
expression provides a general prior, where a sufficiently diffuse continuous distribution needs to be chosen for the slab.

9
Step 5. Sample vl for l in (1, . . . , H − 1) from
H H
!
X X
vl | − ∼ B 1 + 1(zh = l), α + 1(zh > l) .
h=1 h=1
Ql−1
Set vH = 1 and update w1 , . . . , wH from wl = vl m=1 (1 − vm ).
Step 6. For h in (1, . . . , H):  
Pp
if zh ≤ h set θh = θ∞ , otherwise sample θh from G −1 aθ + 12 p, bθ + 1
2
2
j=1 ih .
λ
Step 7. After some burn-in period g̃ required for the stabilization of the chain, the truncation index
PH (g) (g)
H (g) and the number of active factors H ∗(g) = h=1 1(zh > h) are adapted with probability
p(g) = exp(α0 + α1 g)4 as follows:

– if H ∗(g) < H (g−1) − 1:

set H (g) = H ∗(g) + 1, drop inactive columns in Λ(g) along with the associated param-
eters in F (g) , θ (g) and w (g) , and add the final component sampled from the spike to
Λ(g) , together with the associated parameters in F (g) , θ (g) and w (g) sampled from the
corresponding priors

– otherwise:

set H (g) = H (g−1) + 1 and add the final column sampled from the spike to Λ(g) , together
with the associated parameters in F (g) , θ (g) and w (g) sampled from the corresponding
priors.

4.3 Practical applications and properties


Since its introduction, the CUSP prior has been widely used in both theoretical studies and prac-
tical applications. The most notable of them include Kowal and Canale (2022), who employed
the further generalised CUSP prior in the context of nonparametric functional bases; Frühwirth-
Schnatter (2023), who extended the CUSP prior to the class of generalized cumulative shrinkage pri-
ors with arbitrary stick-breaking representations which might be finite or infinite; Gu and Dunson (2023),
who applied the CUSP prior to infer the number of latent binary variables in the context of a Bayesian
Pyramid (a multilayer discrete latent structure model for discrete data).
In contrast to the MGP prior, the CUSP prior on factor loadings provides a clear separation in the
parameters which control active factors and the shrinkage of the redundant terms. Thus, the shrinkage
rate depends on α in a sense that smaller values of α enforce more rapid shrinkage and therefore
smaller number of factors. The parameters aθ , bθ of the inverse gamma prior for the slab control
modelling of active factors (the inverse gamma prior can be replaced by another suitable continuous
prior) and can be sampled from data in the spirit of the parameters a1 and a2 in the MGP model.
To evaluate the comparative performance of the model with the CUSP prior on the data sets of
various dimensionality, we simulated data sets in the same way as in Section 3.3. The stick breaking
parameter α, which represents a prior expectation of the number of active factors in the dataset, was
4
The coefficients α0 and α1 are chosen according to the criteria described in Section 3.2

10
(p, K) mode H ∗ IQR
(6, 2) 2.00 0.00
(10, 3) 3.00 0.00
(30, 5) 5.00 0.00
(50, 8) 8.00 0.00
(100, 15) 15.00 0.00
Table 2: Performance of the adaptive Gibbs sampler based on the CUSP prior for various combinations of p
and K. The modal estimates of H ∗ and the interquartile range (IQR) are reported.

set to 5 (as in Legramanti et al. (2020)). We also choose the same parameters of the slab distribution
as in Legramanti et al. (2020), namely aθ = bθ = 2 and θ∞ = 0.05. The parameters of the adaptation
probability of the sampler α0 and α1 were set as −1 and −5 × (10)−4 . The simulations were run
for 15,000 iterations, with 5,000 discarded as burn-in, as convergence was achieved faster than in the
case of the MGP prior. The simulation results are presented in Table 2 and show that the model was
able to recover the correct number of factors in all considered cases.
The CUSP model offers significant advantages compared to the MGP model by eliminating the
very subjective and influential truncation threshold and decoupling the generation mechanism for
active and redundant components. This results in much more robust estimations of the number of
factors in data sets of various dimensions. In our experience, assigning some continuous distribution
to δθ∞ and a hyperprior to bθ can improve the performance, especially on a non-standardised data
sets. The model provides poor uncertainty quantification with the sampler often being stuck in one
(in most cases correct) value of H ∗ . This problem was addressed in Kowal and Canale (2022) by
extending the CUSP prior with a parameter expansion scheme which disperses the shrinkage applied
to the factors.

5 Indian buffet process prior


5.1 The prior specification
Another, slightly different approach to modelling factor loading matrices involves Indian Buffet
Process (Griffiths and Ghahramani (2006)), which defines a distribution over infinite binary matrices,
to provide sparsity and a framework for inferring the number of latent factors in the data set. This
approach was first suggested in Knowles and Ghahramani (2011) and is formally presented below.
First, a binary matrix Z is introduced whose elements indicate whether an observed variable i
has a contribution (non-zero loading) of factor h. Then the elements of Λ can be modelled in the
following way:

λih |zih ∼ zih N (λih ; 0, βh−1 ) + (1 − zih )δ0 (λih ),

where βh is a precision of the factor loadings in the hth column of Λ and δ0 is a delta function with a
point-mass at 0.
Thus, the factor loadings are modelled via a spike-and-slab distribution, however, differently from
the CUSP prior, the separation into the spike and the slab is done not with a variance parameter but
directly for the factor loadings λih via an auxiliary binary indicator matrix. This allows a potentially
infinite number of latent factors, i.e. Z has infinitely many columns of which only a finite number

11
will have nonzero entries. If πh is a probability of a factor h contributing to any of the p variables,
and K is (for the moment the finite) number of latent factors, the IBP with the intensity parameter
αIB arises from the Beta-Bernoulli prior:
α 
IB
zih |πh ∼ Bernoulli(πh ), πh |αIB ∼ B ,1 ,
K
by setting K → ∞ and integrating out πh .

5.2 Inference and adaptive Gibbs sampler


The inference is done via a Gibbs sampler, of which the second and the third steps are the same as
in Section 2.2. The initial number of factors, which will define the dimensions of Λ and Z is chosen
as some conservative number which clearly overfits any possible number of factors in the data set.
Step 1 has the difference that not the ith row of the factor loadings matrix Λ but each element λih is
sampled separately from the univariate normal distribution, if zih = 1:
Step 1. Sample λih for which zih = 1 from

λih |− ∼ N (βh + σi−2 fh fhT )−1 σi−2 fh yiT , (βh + σi−2 fh fhT )−1


where fh is a vector of t = 1, . . . , T observations of factor h.


The precisions βh will be sampled in the following way:
Step 4. Sampling βh providing it is given a gamma prior G(aβ , bβ )
 
Pp
zih X
βh | zh , λih ∼ G aβ + i=1 , bβ + λ2ih  .
2
i,h

The binary indicator zih can be sampled using the fact that it is possible to calculate the posterior
p(zih =1|−)
density of the ratio p(z ih =0|−)
from the likelihood and prior probabilities and for every element there
can be only two events, zih = 1 or zih = 0. This is done in the following way:
Step 5. Sample binary indicator zih using
q
(βh + σi−2 fh fhT )−1 βh exp 21 (βh + σi−2 fh fhT )−1 (σi−2 fh yiT )2 m−i,h

p(zih = 1|−)

p(zih = 0|−) T − 1 − m−i,h

where m−i,h is the number of other variables for which factor h is active, not counting variable i.
Although the binary matrix Z has infinitely many columns, only the nonzero ones contribute
to the likelihood. However, one needs to take into account the zero columns too, as the number of
factors can (and in many cases will) change at the subsequent iterations of the sampler. Let us denote
κi the number of columns of Z which contain 1 only in row i, so it will contain information about the
number of factors which are only active for the variable i5 . After the sampling step 5, κi = 0 for any
5
In terms of the Indian Buffet Process this means the number of new dishes customer i tries.

12
i by design, so the new factors κi are sampled in a separate MH step. Note that this is not a random
walk MH step as the proposal densities are not symmetric.
Step 6. Sample the number of new active factors κi in a MH step with the following proposal density
!
T κi
− T 1X T P ois(κi ; αIB /(p − 1))
ρκi = (2π) 2 |M | 2 exp m Mm ,
2 t P ois(κi ; αIB ν/(p − 1))

where ν > 0 is a tuning parameter aimed at improving mixing, M = σi−2 λκi λTκi + Iκi with λκi
denoting a 1×κi vector of the new elements of the factor loading matrix, and m = M −1 σi−2 λκi (yit −
λTi ft ). Steps 5 and 6 are designed to be in one loop for i = (1, . . . , p), i.e. for each variable i, first,
the indicator zih is sampled for every h, and then the number of new factors for variable i is sampled
in the following step.
Step 7. Assuming the gamma prior G(aα , bα ), sample the IBP strength parameter αIB from
 
p
X 1
αIB | Z ∼ G aα + K+ , bα + ,
j
j=1

where K+ is the number of active factors for which zih = 1 at least for one i.

5.3 Practical applications and properties


The IBP prior coupled with a spike-and-slab distribution proved to be a useful approach to model
sparse factor loadings and represents an alternative to implementing an increasing shrinkage on the
columns of the factor loading matrix in terms of inferring the number of active factors. A somewhat
related work was introduced earlier by Rai and Daume (2008) in the context of a nonparametric
Bayesian factor regression model, where a sparse IBP prior was coupled with a hierarchical prior
over factors. The authors did not assume independence of factors as in traditional factor analysis,
and instead of a normal prior used a Kingman’s coalescent prior which describes an exchangeable
distribution over a countable set of factors.
The original model of Knowles and Ghahramani (2011) was further extended in Ročková and
George (2016), where the authors couple the IBP prior on the binary indicators with a spike-and-
slab LASSO (SSL) prior of the elements of Λ. The SSL prior assigns to both the spike and the
slab components a Laplace distribution designed so that the slab has a common scale parameter and
the spike has a factor-specific scale parameter (different for each h). This prior tackles the problem
of rotational invariance of Λ by automatically promoting rotations with many zero loadings thus
resulting in many exact zeros in the factor loading matrix and facilitating identification. Differently
from Knowles and Ghahramani (2011) and Rai and Daume (2008), who do inference via a Gibbs
sampler, Ročková and George (2016) use an expectation-maximization (EM) algorithm, which brings
computational advantages for high-dimensional data.
Recently, Frühwirth-Schnatter (2023) suggested an exchangeable shrinkage process (ESP) prior
for finite number of factors K, which has relation to the IBP prior when K → ∞. The prior in its
general form is formulated as follows:

λih | τh ∼ (1 − τh )δ0 + τh Pslab (λih ), τh | K ∼ B(aK , bK ), h = 1, . . . , K, (6)

13
where δ0 is a Dirac delta, Pslab is an arbitrary continuous slab distribution, and K is the finite number
of factors. The slab probabilities τh s then decide the number of active factors K+ < K. When in
(6) bK = 1 and aK = αIB /K, for K → ∞ this prior converges to the IBP prior (Teh et al. (2007)).
The ESP prior has been used in the context of sparse Bayesian factor analysis in Frühwirth-Schnatter
et al. (2022a) and in the context of a mixture of factor analysers model in Grushanina and Frühwirth-
Schnatter (2023).

6 Generalised infinite factor models


One of the recent developments in the area of infinite factor models is the generalised infinite
factorisation model developed in Schiavon et al. (2022), where authors were motivated by the ex-
isting methods’ drawbacks such as lack of accommodation for grouped variables and other non-
exchangeable structures. While the existing increasing shrinkage models focus on priors for Λ which
are exchangeable within columns, they lack consideration for possible grouping of the rows of Λ,
which can occur in many applications, such as, for example, different genes in genomic data sets.
Here we briefly outline the main idea of the proposed method without going into much detail.
The generalised model is defined in the following way:

yit = si (zit ), zt = Λft + ǫt , ǫt ∼ ηǫ , (7)

where Λ is a p × K factor loading matrix, ft is a K-dimensional factor with a diagonal covariance


matrix Ξ = diag(ξ11 , . . . , ξKK ), ǫt is a p-dimensional error term independent of factors, ηǫ is some
arbitrary distribution, and the function si is the function si : R → R, for i = 1, . . . , p. Here,
differently from the factor model described in Section 2.1, it is not necessarily assumed that ft and ǫt
are normally distributed.
When, in fact, this is the case and si is the identity function, the model (7) takes the form of a
Gaussian linear factor model described in Section 2.1. When si = Fi−1 (Φ(zit )) with Φ(zit ) denoting
a Gaussian cumulative distribution function, the model (7) becomes a Gaussian copula factor model
as described in Murray et al. (2013). Choosing an appropriate si and modifying the assumptions
regarding the distribution of the parameters in (7) results in other types of factor models. The covari-
ance matrix Ω as in (2) has a more general form in the case of the generalised infinite factorisation
model Ω = ΛΞΛT + Σ, where Σ is the covariance matrix of the error term. The suggested prior
on the elements of Λ allows infinitely many columns, so that the number of factors K → ∞, and is
formulated as follows:

λih | θih ∼ N (0, θih ), θih = τ0 γh φih , τ0 ∼ ητ0 , γh ∼ ηγh , φih ∼ ηφi , (8)

where τ0 , γh and φih are responsible for global, column-specific and local shrinkage, respectively,
are independent a priori and the distributions ητ0 , ηγh and ηφi are supported on [0, ∞).
What is essentially different to previously described models, is that via φih a non-exchangeable
structure is imposed on the rows of Λ via some meta covariates X, which inform the sparsity structure
of Λ. Denoting by Xp×q a matrix of q meta covariates, ηφi should be chosen so as to satisfy:

E(φih | βh ) = g(xTi βh ), βh = (β1h , . . . , βqh )T , βmh ∼ ηβ , m = 1, . . . , q,

where g is a smooth one-to-one differentiable link function, xi = (xi1 , . . . , xiq ) denotes the ith
row of X, and βh are coefficients controlling the impact of the meta covariates on the shrinkage of

14
the elements of the hth column of Λ. Taking the example from the ecology application studied in
Schiavon et al. (2022), different bird species (variables i) may belong to the same phylogenetic order
(metacovariates m), have roughly the same size, follow similar diet etc.
In more details, the priors and hyperpriors on the factor loading are specified as follows:

τ0 = 1, γh = νh ρh , φih | βh ∼ Ber{logit−1 (xTi βh )cp },


νh−1 ∼ G(aν , bν ), aν > 1, ρh = Ber(1 − πh ), βh ∼ Nq (0, σβ2 Iq ),

where the link function g(x) takes the form of logit−1 (x) = ex /(1 + ex ) and cp ∈ (0, 1) is a possible
offset. The distribution of the parameter πh = p(γh = 0) follows a stick-breaking construction
h
X l−1
Y
πh = wl , wl = vl (1 − vm ), vm ∼ B(1, αgen ),
l=1 m=1

similar to Legramanti et al. (2020).


The model inference is performed via an adaptive Gibbs sampler, which resembles the one de-
veloped for the CUSP model. The frequency of adaptation is set in accordance with the Theorem 5
of Roberts and Rosenthal (2007), and at the iteration, at which the adaptation occurs, the redundant
columns of the loading matrix are discarded with all other corresponding parameters and the number
of active factors is adapted accordingly. The redundant columns are identified as those for which
ρh = 0. If at some iteration there are no redundant columns, then an additional factor and all its
corresponding parameters are generated from the priors.
The exact form of the Gibbs sampler steps depends on the prior assumptions for the elements of
(7). In case of the standard isotropic Gaussian and inverse gamma priors for factors and idyosyncratic
variances, steps 2 and 3 of the sampler will be identical to the ones described in Section 2.2. For the
detailed description of the Gibbs sampler steps the reader is referred to the Supplementary Material
of Schiavon et al. (2022).

7 Discussion and identification issues


Infinite factorisation models offer an enormous advantage of the automatic inference on the num-
ber of active factors by allowing it be derived from data. This is done by assigning a non-parametric
prior to the elements of the factor loading matrix, which penalises the increasing number of columns.
Some of such models at the same time account for the element-wise sparsity of factor loadings which
can be justified in many real life applications, such as genetics, economics, biology, and many others.
One of the weak points of such models is that they often rely on rather subjective truncation
parameters, with the lack of clear guidance towards the procedure of choosing such parameters. The
MGP prior of Bhattacharya and Dunson (2011) is the most prominent example of it, the simulation
studies in Schiavon and Canale (2020) and in Section 3.3 of this paper illustrate this point. This
subjectivity was significantly reduced in the CUSP prior of Legramanti et al. (2020). Generalisation
of the CUSP prior by setting the hyperprior on the spike parameter as in Kowal and Canale (2022)
significantly improved the performance of the model on data sets of different nature and eliminated
the need of data-dependent parameter tuning. In addition, the parameter-expanded version of the
CUSP model suggested in Kowal and Canale (2022) resulted in better uncertainty quantification.
The class of generalised infinite factorisation models of Schiavon et al. (2022) generalises the idea
of infinite factorisations with increasing shrinkage on factor loadings and incorporates it into a wide

15
class of various types of factor models. In addition, it allows the grouping of the variables, which
provides a useful feature for a wide rage of applications. The truncation of the redundant factors
is done in a similar way to the CUSP model, however, the complexity of this rather general model
makes unavoidable some subjective choices regarding hyperparameters and functional forms.
Another important issue concerns the identification of factor loadings. It is well known that the
decomposition of the covariance matrix Ω as in (2) is not unique. First, the correct identification of
the idiosyncratic covariance matrix should be ensured to guarantee that in the following two repre-
sentations:

Ω = ΛΛT + Σ, Ω = ΘΘT + Σ0

Σ = Σ0 and, hence, the cross-covariance matrix ΛΛT = ΘΘT is uniquely identified. This prob-
lem is known under the name of variance identification. The row deletion property of Anderson and
Rubin (1956) presents a sufficient condition for variance identification and states that whenever an ar-
bitrary row is deleted from Λ, two disjoint matrices of rank K should remain. This property imposes
an upper bound on the number of factors K ≤ p−1 2 . So, for dense factor models, variance identifi-
cation can fail if the number of factors is too high. For sparse factor models, additional restrictions
on the number of non-zero elements in each column of Λ need to be applied (see, e.g. Frühwirth-
Schnatter et al. (2022b)). Although in most cases K ≪ p and the upper bound will be respected,
there is no formal guarantee of variance identification for infinite factor models even when the factor
loading matrix is dense, and even less so in the case of sparse infinite factor models.
The second problem deals with the correct identification of Λ from ΛΛT . It is referred to as
the problem of rotational invariance and stems from the fact that for any semi-orthogonal matrix
P : P P T = I and Θ = ΛP , gt = P T ft , the two models

yt = Λft + ǫt and yt = Θgt + ǫt

are observationally indistinguishable. This problem is often addressed in the literature by imposing
restrictions on the elements of Λ, such as, for example, setting the upper diagonal elements equal to
zero and requiring the diagonal elements to be positive so that Λ represents a positive lower triangular
matrix. This approach has first been implemented by Geweke and Zhou (1996) and followed by
many others (see, for example, Lopes and West (2004) and Carvalho et al. (2008)). This constraint
introduces order dependence upon variables, which results in posterior distributions whose shapes
depend on the ordering of the variables in the data set and thus is not applicable for infinite factor
models. However, these models can still be employed for the tasks of covariance matrix estimation,
variable selection and prediction, which do not require identification.
However, while variance identification is rarely addressed in the literature and not at all in the
context of infinite factor models, in recent years some ex-post identification methods aimed at tackling
rotational invariance have been proposed, which are applicable for infinite factor models. These
methods usually involve some kind of orthogonalisation procedure applied at a post-processing step,
such as, for example, orthogonal Procrustean algorithm (Aßmann et al. (2016)) or Varimax procedure
(Poworoznek et al. (2021)).
There have also been some attempts to embed identification consideration into the estimation
procedure. Thus, Ročková and George (2016) offer a solution to the indeterminacy due to rotational
invariance via the SSL prior, which automatically promotes the rotations with many zero loadings
and thus reduces posterior multimodality. Their EM algorithm provides sparse posterior modal esti-
mates with exact zeroes in the factor loading matrix. Schiavon et al. (2022) propose an identification

16
scheme, which is somewhat similar in the idea. They search for an approximation of the maximum a
posteriori estimators of Λ, β = (β1 , β2 , . . .) and Σ by integrating out the scale parameters and latent
factors from the posterior density function and taking the parameters of interest from the draw which
produced the highest marginal posterior density function f (Λ, β, Σ | y).

17
References
Aguilar, O. and M. West (2000). “Bayesian Dynamic Factor Models and Portfolio Allocation”. In:
Journal of Business and Economic Statistics 18(3), pp. 338–357.
Anderson, T.W. and H. Rubin (1956). “Statistical inference in factor analysis”. In: Proceedings of the
Third Berkeley Symposium on Mathematical Statistics and Probability Volume V, pp. 111–150.
Aßmann, C., J. Boysen-Hogrefe, and M. Pape (2016). “Bayesian analysis of static and dynamic factor
models: An ex-post approach towards the rotation problem”. In: Journal of Econometrics 192(1),
pp. 190–206.
Bai, J. and S. Ng (2002). “Determining the number of factors in approximate factor models”. In:
Econometrica 70(1), pp. 191–221.
Bai, Jushan and Peng Wang (2016). “Econometric Analysis of Large Factor Models”. In: Annual
Review of Economics 8, pp. 53–80.
Barhoumi, K., O. Darné, and L. Ferrara (2013). Dynamic Factor Models: A Review of the literatire.
Working papers. Banque de France.
Bhattacharya, A. and D.B. Dunson (2011). “Sparse Bayesian infinite factor models”. In: Biometrika
98(2), pp. 291–306.
Carvalho, C.M. et al. (2008). “High-Dimensional Sparse Factor Modeling: Applications in Gene Ex-
pression Genomics”. In: Journal of American Statistical Association 103(484), pp. 1438–1456.
Conti, J.C. et al. (2014). “Bayesian Exploratory Factor Analysis”. In: Journal of Econometrics 183,
pp. 31–57.
Durante, D. (2017). “A note on the multiplicative gamma process”. In: Statistics & Probability Letters
122, pp. 198–204.
Fan, Jianqing, Kunpeng Li, and Yuan Liao (2021). “Recent Developments in Factor Models and Ap-
plications in Econometric Learning”. In: Annual Review of Financial Economics 13(1), pp. 401–
430.
Frühwirth-Schnatter, S., D. Hosszejni, and H. F. Lopes (2022a). “Sparse finite Bayesian factor analy-
sis when the number of factors is unknown”. In: ArXiv 2301.06459.
Frühwirth-Schnatter, S., D. Hosszejni, and H.F. Lopes (2022b). “When it counts - Econometric iden-
tification of factor models based on GLT structures”. In: ArXiv: 2301.06354.
Frühwirth-Schnatter, S. and H. Lopes (2010). Parsimonious Bayesian Factor Analysis when the Num-
ber of Factors is Unknown. Research report. Booth School of Business, Univeristy of Chicago.
Frühwirth-Schnatter, S. and H. Lopes (2018). “Sparse Bayesian Factor Analysis when the Number of
Factors is Unknown”. In: ArXiv 1804.04231.
Frühwirth-Schnatter, Sylvia (2023). “Generalized Cumulative Shrinkage Process Priors with Appli-
cations to Sparse Bayesian Factor Analysis”. In: Philosophical Transactions of the Royal Society
A( 381), 381:20220148. DOI: 10.1098/rsta.2022.0148.
Geweke, J. and G. Zhou (1996). “Measuring the pricing error of the arbitrage pricing theory”. In:
Review of Financial Studies 9(2), pp. 557–587.

18
Ghosh, Joyee and David B. Dunson (2009). “Default Prior Distributions and Efficient Posterior Com-
putation in Bayesian Factor Analysis”. In: Journal of Computational and Graphical Statistics
18(2), pp. 306–320.
Green, P. (1995). “Reversible jump Markov chain Monte Carlo computation and Bayesian model
determination”. In: Biometrika 82(4), pp. 711–732.
Griffiths, T. and Z. Ghahramani (2006). “Infinite latent feature models and the Indian buffet process”.
In: Advances in Neural Information Processing Systems. Ed. by Y. Weiss, B. Schölkopf, and J.
Platt. Vol. 18. MIT Press.
Grushanina, M. and S. Frühwirth-Schnatter (2023). Dynamic Mixture of Finite Mixtures of Factor
Analysers with Automatic Inference on the Number of Clusters and Factors. arXiv: 2307.07045.
Gu, Y. and D.B. Dunson (2023). “Bayesian Pyramids: identifiable multilayer discrete latent struc-
ture models for discrete data”. In: Journal of the Royal Statistical Society Series B: Statistical
Methodology 85(2), pp. 399–426.
Ishwaran, H. and J.S. Rao (2005). “Spike and slab variable selection: Frequentist and Bayesian strate-
gies”. In: The Annals of Statistics 33(2), pp. 730–773.
Kapetanios, G. (2010). “A testing procedure for determining the number of factors in approximate
factor models with large datasets”. In: Journal of Business and Economic Statistics 3(28), pp. 251–
258.
Kaufmann, S. and C. Schumacher (2019). “Bayesian estimation of sparse dynamic factor models with
order-independent and ex-post made indetification”. In: Journal of Econometrics 210(1), pp. 116–
134.
Knowles, D. and Z. Ghahramani (2011). “Nonparametric Bayesian sparse factor models with appli-
cation to gene expression modeling”. In: The Annals of Applied Statistics 5(2B), pp. 1534–1552.
Kowal, D.R. and A. Canale (2022). “Semiparametric Functional Factor Models with Bayesian Rank
Selection”. In: ArXiv 2108.02151.
Legramanti, S., D. Durante, and D.B. Dunson (2020). “Bayesian cumulative shrinkage for infinite
factorizations”. In: Biometrika 107(3), pp. 745–752.
Lopes, H.F. and M. West (2004). “Bayesian model assessment in factor analysis”. In: Statistica Sinica
14(1), pp. 41–67.
Montagna, Silvia et al. (2012). “Bayesian Latent Factor Regression for Functional and Longitudinal
Data”. In: Biometrics 68(4), pp. 1064–1073.
Murphy, K., C. Viroli, and I.C. Gormley (2020). “Infinite Mixtures of Infinite Factor Analysers”. In:
Bayesian analysis 15(3), pp. 937–963.
Murray, J.S. et al. (2013). “Bayesian Gaussian Copula Factor Models for Mixed Data”. In: Journal of
the American Statistical Association 108(502), pp. 656–665.
Polasek, W. (1997). “Factor analysis and outliers: a Bayesian approach”. In: Discussion Paper, Uni-
versity of Basel.
Poworoznek, E., F. Ferrari, and D. Dunson (July 2021). “Efficiently resolving rotational ambiguity in
Bayesian matrix sampling with matching”. In: ArXiv: 2107.13783.

19
Rai, P. and H. Daume (2008). “The Infinite Hierarchical Factor Regression Model”. In: Advances in
Neural Information Processing Systems. Ed. by D. Koller et al. Vol. 21.
Rai, P. et al. (2014). “Scalable Bayesian Low-Rank Decomposition of Incomplete Multiway Tensors”.
In: Proceedings of the 31st International Conference on Machine Learning. Vol. 32. 2, pp. 1800–
1808.
Roberts, G.O. and J.S. Rosenthal (2007). “Coupling and ergodicity of adaptive Markov chain Monte
Carlo algorithms”. In: Journal of Applied Probability 44(2), pp. 458–475.
Ročková, V. and E.I. George (2016). “Fast Bayesian factor analysis via automatic rotation to sparsity”.
In: Journal of the American Statistical Association 111(516), pp. 1608–1622.
Schiavon, L. and A. Canale (2020). “On the truncation criteria in infinite factor models”. In: Stat 9(1),
e298.
Schiavon, L., A. Canale, and D.B. Dunson (2022). “Generalized infinite factorization models”. In:
Biometrika 109(3), pp. 817–835.
Sethuraman, J. (1994). “A constructive definition of Dirichlet priors”. In: Statistica Sinica 4, pp. 639–
650.
Spearman, C. (1904). “”General Intelligence,” Objectively Determined and Measured”. In: The Amer-
ican Journal of Psychology 15(2), pp. 201–292.
Stock, J.H. and M.W. Watson (2016). “Chapter 8 - Dynamic Factor Models, Factor-Augmented Vector
Autoregressions, and Structural Vector Autoregressions in Macroeconomics”. In: ed. by John B.
Taylor and Harald Uhlig. Vol. 2. Handbook of Macroeconomics. Elsevier, pp. 415–525.
Teh, Y., D. Görür, and Z. Ghahramani (2007). “Stick-breaking Construction for the Indian Buffet
Process”. In: Proceedings of the Eleventh International Conference on Artificial Intelligence and
Statistics. Ed. by Marina Meila and Xiaotong Shen. Vol. 2. Proceedings of Machine Learning
Research. PMLR: San Juan, Puerto Rico, pp. 556–563.
Thurstone, L.L. (1931). “Multiple factor analysis”. In: Psychological Review 38(5), pp. 406–427.
Thurstone, L.L. (1934). “The Vectors of Mind”. In: The Psychological Review 41, pp. 1–32.
West, M. (2003). “Bayesian Factor Regression Models in the ”large p, small n” Paradigm”. In:
Bayesian Statistics. Oxford University Press, pp. 723–732.

20

View publication stats

You might also like