Lecture:
Non-Rigid Structure from Motion
-------------------------------------------------------------------------------------------------------------------------------
3D Vision
Universitat Pompeu Fabra
Discussion
Non-Rigid Shapes?
§ Can we obtain non-rigid 3D information from images?
Structure from Motion
(Rigid) Structure from Motion
Given: a monocular video (or a collection of pictures)
We want: simultaneously recovering the 3D shape and the camera
motion
Epipolar geometry can be used
The assumption of rigidity is enough to make the problem well-posed
What about non-rigid motion?
Our world is Non Rigid!
No external markers!
One or Many
Why is this important?
The world is non-rigid! Too many everyday applications in many
different domains
Movie industry, augmented reality Experimental industry
Sport industry: sailing Endoscopy
The movie industry
Even more details
er
e sp
m
0 fr a n d
0 o
1 2 sec
Animal Reconstruction
er
e sp
m
0 fr a n d
0 o
1 2 sec
…produces better robots?
Epipolar geometry can be used
The assumption of rigidity is enough to make the problem well-posed
Can Epipolar geometry be used?
Considering only one image, we obtain the same 3D constraint
Epipolar geometry can be used
After acquiring a new image, we obtain a similar constraint but now
triangulation is not available since the shape is non rigid
Epipolar geometry can be used
After acquiring a new image, we obtain a similar constraint but now
triangulation is not available since the shape is non rigid
Non-Rigid
Structure from Motion
Non-Rigid Structure from Motion
Given: a monocular video (or a collection of pictures)
We want: simultaneously recovering the 3D shape of a time-
varying object (4D estimation) and the camera motion
Some Results
Some Results
Some Results
Solving the problem
The problem can be solved by:
• Factorization: a closed-form solution can be achieved by using
SVD factorization, enforcing a specific rank (this can change as
a function of the type of camera model, or the type of scene). In
theory, it is hard to accurately enforce constraints
• Non-linear Optimization: the solution is achieved iteratively, the
computational cost can be bigger but additional priors can be
enforced accurately
In terms of processing, the problem can be solved:
• Offline: all the frames are processed at once, after video
capture
• Online: the frames are processed as the data arrive, frame by
frame. More real applications, but can become less accurate
Non-Rigid
Structure from Motion
Problem Statement
A Reminder of Camera Models
§ Perspective camera: All rays converge to the optical center
§ Orthographic camera: All rays are parallel. Z-coordinate is
irrelevant in the projection
Perspective camera Orthographic camera
3D-to-2D: Perspective Model
A p-th 3D point Xp=[Xp, Yp, Zp]T in homogeneous coordinates can be
related with its 2D projection xp=[xp, yp]T by means of a matrix Pi for
the i-th image, such as:
where Pi is 3x4 matrix as:
3D-to-2D: Orthographic Model
A p-th 3D point Xp=[Xp, Yp, Zp]T can be related with its 2D projection
xp=[xp, yp]T by means of a matrix Ri for the i-th image, such as:
where Ri is 2x3 matrix and ti is a 2x1 translation vector as:
In practice, we subtract the translations by assuming centered
observations (i.e., they are equivalent to the mean values of xp). For
later computations, we will approximate xp= xp- ti
Problem Statement
Orthographic camera
2xP
In
th th
e e
rig sa
id m
2xP 2x3 3xP 2xP
ca e
se
,i
ist
ge
3D ima
nt
re per
ffe
di ion
3IxP
A rat
u
ig
nf
co
2Ix3I
Full Linear Relation
2IxP
Orthographic camera
Measurement Matrix
Considering P non-rigid 3D points observed in I RGB images, we
can collect all observations to obtain a linear system such as:
3IxP 3Ix4I 4IxP 2IxP 2Ix3I 3IxP
Perspective camera Orthographic camera
where W is a 3IxP matrix, P is 3Ix4I, and X is 4IxP for the
perspective case (relation on the left); and W is a 2IxP matrix, R is
2Ix3I, and X is 3IxP for the perspective case (relation on the right)
What about the rank of W?
Considering P non-rigid 3D points observed in I RGB images, we
can collect all observations to obtain a linear system such as:
3IxP 3Ix4I 4IxP 2IxP 2Ix3I 3IxP
Perspective camera Orthographic camera
rank(W)≤ min(3I,P) rank(W)≤ min(2I,P)
A severely ill-posed problem
Orthographic camera
Th
is
is var
an iab
2IxP 2Ix3I 3IxP
ex les
pl
os
io
n
of
2ip entries << 6i variables + 3ip variables
A Toy Comparison
Let us assume a 1 minute video with just 100 tracked points, and
considering only the estimation of the 3D shape
Rigid Case Non-Rigid Case
Input data: Input data:
100 points x 60 sec x 30 Hz x 2 100 points x 60 sec x 30 Hz x 2
= 360,000 measurements = 360,000 measurements
Unknowns: Unknowns:
100 points x 3 100 points x 60 sec x 30 Hz x 3
= 300 unknowns = 540,000 unknowns
well-posed problem ill-posed problem
How can I solve the problem?
The art of priors
Including deformation priors is substantially more difficult than
using simple rigidity
Many possibilities were presented
A wide variety of priors in literature:
§ Physical priors. Particle dynamics, elasticity, finite elements,
and many others
§ Probabilistic priors. Low-rank models on shape, trajectory,
shape-trajectory or force domains. Union of subspaces,
Gaussian priors
§ Geometric priors: isometric, as rigid as possible, bone lengths,
quadratic models
§ Temporal priors: temporal-coherent deformations
§ Piecewise priors
§ Many others
Shape Linear Subspace
(a probabilistic prior)
A Low-Rank Shape Model
Basically, the non-rigid 3D shape can be obtained as a linear
combination of fixed shape vectors. For every combination of
weight coefficients, a different solution can be achieved:
Rotation Linear combination of Translation Your estimation
some shapes
Including the low-rank shape model
We approximate the 3D shape by a linear combination of K shape
vectors b (normally, K << P or I). For every k-th component, a
weight coefficient lk is needed. As the shape is non-rigid, by
modifying the coefficients for every i-th image, we will change the
3D shape as:
3IxP 3Ix3K 3KxP
Another type of expression for the i-th image:
Shape Basis Estimation
In non-rigid structure from motion, we have some alternatives to
estimate the shape basis:
§ The most natural is to learn it on the fly, using only the input data
§ The input data can also be used to estimate a shape basis from
a shape at rest (like a mean shape) by applying:
- Modal analysis based on physical models
- Spectral analysis based on a distance matrix
§ If training data are assumed, we learn it by means of a learning
approach (PCA, deep based, etc.). This approach is supervised
Non-Rigid
Structure from motion
by factorization
Including the low-rank shape model
Thanks to the relation between the 3D shape and the shape basis:
Orthographic camera
3IxP 3Ix3K 3KxP
we obtain the projection equation by using the low-rank shape
model as:
2IxP 2Ix3K 3KxP
Including the low-rank shape model
Orthographic camera
2IxP 2Ix3K 3KxP
What about the perspective case?
A similar analysis can be followed, but now, considering
homogeneous coordinates. We can obtain:
3IxP 3Ix3K+1 3K+1xP
3x3 3x1
Factorization
In both cases, the goal is to infer the motion factor (P or R) and the
3D coordinates X of the observed non-rigid object from 2D point
tracks in a monocular video W:
a
er
m
ca
ic
ph
ra
og
2IxP 2Ix3K 3KxP
th
Or
a
er
m
ca
ve
cti
pe
rs
Pe
3IxP 3Ix3K+1 3K+1xP
The full linear system
W=MB
Two factors: motion factor M (camera rotation and weight
coefficients) and shape one as a product of B and the coefficients
More on factorization Orthographic camera
Because M is a 2Ix3K matrix and B is a 3KxP matrix, the rank of W
is 3K. If we apply SVD to W, we will have only 3K non-zero
singular values
However, measurements are normally noisy, and in practice the
rank will not be 3K. We have to impose it
Applying SVD factorization, we have:
W ra
e nk
ne K
W= UAVT=[U !][ !VT]=[U !Q][Q-1 !VT]
ed a
to pri
i.e., M=U !Q and B=Q-1 !VT (the two factors we look for)
tu ori
ne
th
e
Many solutions can be achieved by modifying Q. Of course, for all
invertible 3Kx3K Q matrices
Metric Upgrade
How is Q computed?
Enforcing orthogonality constraints on the camera rotation. A
rotation matrix always has some properties (it is not a random
matrix), since lies in the SO(3) manifold
Be careful. Now, matrix M also includes the weight coefficients in
addition to the camera rotations!
But in many cases, we
cannot observe all the
points in all the images
==
Missing tracks
A toy example with missing tracks
Orthographic camera
…
l l
u
(
f
(
!"" !"%
o t
( #"
'"% '"% '"$ '"&
(
n
!"% !%% !$%
is = #%
t a!
!"$
d a
!%$ $
$ !&$ #$
u
!"&
t !%& !$& !&&
#&
'"& '&% '&$ '&&
p
In 8x4 8x12 12x4
Handling missing tracks
Two alternatives are possible:
§ Applying a matrix completion algorithm to infer the missing
entries, and then run factorization over the full measurement
matrix
§ No consider missing entries in the formulation by applying non-
linear optimization. Once the 3D model and camera pose are
computed, the 2D missing tracks can be inferred too
?
Non-Rigid
Structure from Motion
by Non-Linear Optimization
Problem Statement
For an orthographic camera, we have:
The problem (compacting over the points) can be formulated as:
and we perform non-linear optimization by minimizing a geometric
error cost function. Translation ti is optional
Bundle Adjustment
Normally, the Levenberg-Marquardt method is used to minimize the
problem. We need a Jacobian matrix J as the derivative of the
function with respect to the unknowns (R, B and the set of weight lk)
Again, there are many variants on how to proceed to reduce the
computational complexity of the problem:
§ Alternate minimization of motion and shape parameters
§ Sparse methods. The computation of J is complex, but it can be
approximated by considering a binary pattern
Initialization: The optimization can be initialized assuming a rigid
shape, i.e., using rigid factorization or non-linear optimization for a
rigid shape
Bundle Adjustment
The bundle adjustment method:
§ Minimize the cost function with Levenberg-Marquadt
§ Exploit the sparseness of the Jacobian function matrix to
decrease computation and memory requirements
The Levenberg-Marquadt algorithm does:
§ Mixture of Gauss-Newton and Gradient descent
§ Behaves like Gauss-Newton when close to the minimum
(quadratic region)
§ Gradient descent when the prediction is poor
§ Depends on a parameter θ that controls the mixture of Gauss-
Newton and Gradient descent as:
(JJT+θI) δp = -g Parameters we
want to estimate
Exercise
Let us assume a monocular video of 3 images, where 6 points are
observed. Considering the map is non-rigid and the visibility is full,
define the corresponding Jacobian matrix. A low-rank shape model
of rank 2 can be considered
Number of unknowns
Number of equations
se !
J= p
r
a rn
S tte
pa
Including priors
As in the rigid case, we can apply temporal smoothness priors, but
now, in both camera motion and shape deformation (be careful
when input data are a collection of pictures). To this end, we may
consider the expression:
where Li includes all K weight coefficients in the i-th image
How can we obtain a sequential solution?
We solve the optimization in a sequential manner, considering the
information as the data arrive. Future frames are not available. Two
options:
§ Pure sequential (frame by frame)
§ Sliding window (from 3 to 5 consecutive frames)
Initialization is performed by rigid estimation (assuming just the
initial frames). The problem is actually challenging
An Extension
Semantic 3D Reconstruction
3D Reconstruction of Categories
Unsupervised 3D Reconstruction and Grouping of Rigid and Non-Rigid Categories. Antonio Agudo. IEEE Transactions on
Pattern Analysis and Machine Intelligence (TPAMI), 44(1): 519-532, 2022.
Input Data as Training Data
Shape Basis as a MLP
Neural Dense Non-Rigid Structure from Motion with Latent Space Constraints. Vikramjit Sidhu, Edgar Tretschk, Vladislav
Golyanik, Antonio Agudo, and Christian Theobalt. European Conference on Computer Vision, 2020.
Shape Basis as a MLP
Priors and models can be considered as a loss function in training. For
example, the next energy includes both data term and priors as:
Neural Dense Non-Rigid Structure from Motion with Latent Space Constraints. Vikramjit Sidhu, Edgar Tretschk, Vladislav
Golyanik, Antonio Agudo, and Christian Theobalt. European Conference on Computer Vision, 2020.
Some Results
Neural Dense Non-Rigid Structure from Motion with Latent Space Constraints. Vikramjit Sidhu, Edgar Tretschk, Vladislav
Golyanik, Antonio Agudo, and Christian Theobalt. European Conference on Computer Vision, 2020.
Neural Radiance Fields in
the non-rigid context
Dynamic Neural Radiance Fields
4DPV: 4D Pet from Videos by Coarse-to-fine Non-Rigid Radiance Fields. Sergio M. de Paco and Antonio Agudo. Asian
Conference on Computer Vision, 2024.
Coarse-to-fine Shapes from Videos
Demo
4DPV: 4D Pet from Videos by Coarse-to-fine Non-Rigid Radiance Fields. Sergio M. de Paco and Antonio Agudo. Asian
Conference on Computer Vision, 2024.
Things to remember
3D and 4D information can be obtained from a sequence of images
For rigid objects, the problem is well-posed. For non-rigid ones, it is
inherently ill-posed (additional priors are necessary)
Model-based approaches can handle a wide variety of
deformations. They are normally universal and generic. No
supervision is needed
Data-based approaches require a lot of data to constrain the
solution space. Obtaining *good* data can become hard. Only for a
particular object or deformation (depending on the training data)
Future must be unsupervised (or self-supervised), and probably,
combining both model- and data-based approaches. With a hand-
held camera, performing the estimation of multiple scenarios
Acknowledgments
Thanks to Kris Kitani, Yaser Sheikh, Alessio del Bue, Lourdes
Agapito, Sergio M. de Paco