TLDR: A new research paper introduces XFactor, the first self-supervised model capable of “true” Novel View Synthesis (NVS) by achieving transferability. Unlike previous methods that struggle with applying learned camera movements across different scenes, XFactor disentangles camera pose from scene content using a stereo-monocular model and a novel transferability objective with input/output augmentations. It outperforms prior models like RayZer and RUST and introduces a new metric, True Pose Similarity (TPS), to quantify this crucial ability, demonstrating NVS without relying on complex 3D geometry.
Novel View Synthesis (NVS) is a fascinating area in 3D computer vision that allows us to generate new views of a scene from different camera angles, given a set of existing images. Traditionally, many NVS methods rely on precise 3D information, like camera poses, which are often obtained through complex multi-view geometry techniques. However, a new research paper challenges this reliance, asking if NVS can be formulated as a pure machine learning problem, free from these geometric biases.
The paper, titled “TRUESELF-SUPERVISEDNOVELVIEWSYNTHESIS IS TRANSFERABLE,” identifies a critical criterion for a model to truly perform NVS: transferability. This means that a camera pose learned from one video sequence should be able to accurately re-render the same camera movement in a completely different 3D scene. The authors, Thomas W. Mitchel, Hyunwoo Ryu, and Vincent Sitzmann, found that existing self-supervised NVS models often fail this test; their predicted poses don’t transfer, leading to different camera trajectories in different scenes.
To address this, they introduce a groundbreaking model called XFactor. XFactor is the first geometry-free, self-supervised model capable of what they term “true NVS.” It achieves this by combining pair-wise pose estimation with a clever augmentation scheme for inputs and outputs. This approach helps the model disentangle camera pose from the specific content of a scene, enabling it to reason geometrically without any explicit 3D inductive biases or concepts like SE(3) camera pose parameterization.
XFactor tackles two principal problems in self-supervised NVS: interpolation and information leakage. Previous multi-view models often learned to interpolate between existing context frames rather than truly understanding camera poses. XFactor prevents this by bootstrapping from a stereo-monocular model, which by design, must always extrapolate. Furthermore, it introduces a novel transferability objective during training, ensuring that the model learns pure geometric pose descriptions rather than smuggling pixel information into the pose latents. This is achieved by augmenting frame sequences in a way that minimizes pixel overlap while preserving camera motion, such as applying inverse masks.
The researchers also introduce a new metric called True Pose Similarity (TPS) to quantify transferability, measuring how well novel views adhere to reference poses. Through extensive experiments on large-scale real-world datasets like RE10K, DL3DV, MVImgNet, and CO3Dv2, XFactor significantly outperforms prior pose-free NVS transformers like RayZer and RUST. It demonstrates superior transferability and its latent poses show a high correlation with real-world camera poses.
Remarkably, XFactor achieves these results with unconstrained latent pose variables, proving that explicit SE(3) parameterization of poses is not only unnecessary but can even be detrimental to transferability. The model’s architecture, implemented as multi-view Vision Transformers (ViTs) with RoPE positional embeddings, is designed to handle an arbitrary number of views, extending from its initial stereo-monocular training to a multi-view fine-tuned model.
Also Read:
- FactoredScenes: Generating Realistic 3D Indoor Environments with Programs and Poses
- Achieving Consistent Multi-View Customization with MVCustom
While XFactor represents a significant leap forward, the authors acknowledge limitations, such as the current restriction of its pose encoder to a stereo model, which limits ultra-wide baseline pose estimation in a single pass, and occasional blurring artifacts in transferred frames. Nevertheless, XFactor paves the way for new formulations of classic 3D vision problems based purely on machine learning principles, without relying on conventional multi-view geometry. You can read the full research paper here.


