TLDR: Point-RTD is a novel pretraining strategy for transformer models on 3D point clouds. Unlike traditional masked autoencoding, it corrupts point cloud tokens and uses a discriminator-generator architecture for denoising. This approach significantly improves reconstruction accuracy, converges faster, and achieves higher classification accuracy on ShapeNet, ModelNet10, and ModelNet40 datasets compared to PointMAE, making 3D point cloud processing more efficient and robust.
Point clouds, which provide a rich three-dimensional description of environments, are crucial in fields like autonomous driving, robotics, and remote sensing. However, their unstructured nature, lacking intrinsic ordering and uniform neighborhood relationships, poses significant challenges for applying transformer-based architectures that have excelled in other data types like text and images.
Traditional approaches for adapting transformers to point clouds often involve patch-based tokenization, where point clouds are segmented into clusters. Many prominent models, such as Point-BERT and Point-MAE, utilize a masked autoencoding pretraining strategy. This involves hiding portions of the data and training the model to predict these missing parts. While effective, this method may not be the optimal strategy for reconstructing complex 3D point cloud data.
Introducing Point-RTD: A Novel Pretraining Strategy
To address these limitations, researchers have introduced Point-RTD (Replaced Token Denoising), a new pretraining strategy designed to enhance token robustness through a corruption-reconstruction framework. Unlike masked autoencoding, Point-RTD corrupts point cloud tokens and employs a discriminator-generator architecture for denoising. This innovative shift allows for more effective learning of structural priors, leading to significant improvements in model performance and efficiency. You can find the full research paper here: Point-RTD: Replaced Token Denoising for Pretraining Transformer Models on Point Clouds.
How Point-RTD Works
Point-RTD begins with patch-based tokenization, similar to existing models, using Farthest Point Sampling (FPS) and k-Nearest Neighbors (kNN) to segment point clouds into patches. These patches are then encoded into token embeddings using a mini-PointNet, capturing local geometric features.
The core innovation lies in its corruption regime. Initially, this involved applying Gaussian noise to a large percentage of tokens. However, Point-RTD extends this by introducing a token replacement strategy. Instead of just adding noise, a subset of tokens is replaced with tokens from other samples within the batch. This can be done either randomly (random mixup) or by selecting tokens from the most similar sample of a different class (nearest-neighbor mixup). Random mixup has shown to yield better performance, introducing greater diversity in corruption patterns.
This replacement-based corruption acts as a strong regularizer, forcing the model to learn class-distinctive representations that remain robust even with semantically mixed inputs. Conceptually, this regime acts as a form of contrastive regularization, implicitly training the model to minimize confusion across class boundaries within the denoising objective.
The architecture includes a discriminator and a generator. The discriminator identifies whether tokens are corrupted or clean, using a weighted binary cross-entropy loss. The generator then autoregressively cleans the corrupted tokens, guided by the discriminator’s feedback, minimizing the mean squared error between the cleaned and original tokens to optimize for accurate reconstruction.
Performance and Efficiency Gains
Point-RTD has demonstrated superior performance across several benchmarks compared to the baseline Point-MAE framework:
- On the ShapeNet dataset, Point-RTD significantly reduces reconstruction error (Chamfer Distance) by over 93% compared to PointMAE, achieving more than 14 times lower Chamfer Distance on the test set. This indicates a much higher reconstruction fidelity and better generalization to unseen data.
- The method also converges faster and yields higher classification accuracy on ModelNet10 and ModelNet40 benchmarks. For instance, on ModelNet10, Point-RTD achieved 92.73% accuracy, surpassing Point-MAE’s peak of 89.76%. Notably, Point-RTD reached 87.22% accuracy after just 50 epochs, while Point-MAE only achieved 13.66% in the same period, highlighting its rapid convergence.
- On the more challenging ModelNet40 benchmark, Point-RTD achieved 94.2% accuracy with a 10-vote majority mechanism, matching or outperforming several strong baselines and maintaining strong linear SVM accuracy (93.0%).
These findings suggest that Point-RTD’s robustness-centered pretraining produces strong representations with significantly reduced computational effort, challenging the notion that long pretraining schedules are necessary for high downstream accuracy. The explicit discriminator-guided feedback loop and the injection of semantically incorrect tokens force the model to develop sharper inter-class boundaries and more generalizable features, which is particularly beneficial for unstructured 3D data.
Also Read:
- CAGE Network: A New Standard for Accurate Floorplan Reconstruction from Point Clouds
- Automating 3D Design: GenCAD-3D Generates Editable CAD Programs from Scans
Future Implications
The design of Point-RTD is model-agnostic, meaning its corruption and denoising strategy can be broadly applied to any patch-based point cloud transformer. This versatility makes it well-suited for future extensions and adaptations, providing an effective means of regularizing transformer-based models through pretraining and supporting strong performance in various 3D vision pipelines.


