Zero-Shot Depth Perception: How Drones Navigate with Just a Camera and IMU

TLDR: This paper introduces a zero-shot method for autonomous aerial navigation that predicts metric depth using only a monocular RGB camera and an Inertial Measurement Unit (IMU). By rescaling relative depth estimates from a neural network with sparse 3D features from a visual-inertial navigation system, the approach enables collision avoidance on compute-constrained quadrotors in unknown, cluttered environments without prior training data. The best-performing rescaling strategy, monotonic spline fitting, was validated in both simulations and real-world hardware experiments, demonstrating reliable obstacle avoidance at 15 Hz.

Autonomous aerial navigation, particularly for drones, has long relied on sophisticated and often heavy sensor systems like LiDARs or stereo cameras to perceive their surroundings and avoid collisions. While effective, these sensors add significant size and mass to the drone, reducing crucial flight time. This research paper, titled “Zero-Shot Metric Depth Estimation via Monocular Visual-Inertial Rescaling for Autonomous Aerial Navigation,” presents an innovative and lightweight solution to this challenge.

Authored by Steven Yang, Xiaoyu Tian, Kshitij Goel, and Wennie Tabib, the paper introduces a methodology that allows drones to predict metric depth—the actual distance to objects—using only a single monocular RGB camera and an Inertial Measurement Unit (IMU). This minimalist sensor setup mirrors what First Person View (FPV) drone pilots use, enabling aggressive maneuvering through complex environments like tree branches or under bridges.

The core of their approach lies in a “zero-shot” rescaling strategy. Traditional monocular depth estimation methods often require extensive data-intensive and domain-specific fine-tuning to achieve metrically accurate results. This is impractical for robots operating in unknown environments, such as those used in search and rescue missions, where pre-existing data is unavailable. The proposed method bypasses this need by leveraging a sparse 3D feature map generated by a visual-inertial navigation system (VINS) to rescale relative depth estimates obtained from a monocular depth estimation (MDE) network.

Here’s how it works in a simplified manner: The drone’s RGB camera captures images, which are fed into an MDE network (specifically, DepthAnythingV2 small model was used for its balance of accuracy and speed) to predict relative depth—how far objects are from each other, but not their absolute distance. Simultaneously, the camera images and IMU data are processed by the VINS to create a sparse map of metrically accurate 3D features. The magic happens when these sparse 3D features are used to “rescale” the relative depth estimates from the MDE network, converting them into precise metric depth values.

The researchers explored several rescaling techniques, including polynomial, exponential, and various spline fittings. After rigorous comparison in diverse simulation environments (mine, sewer, and drone dome), the monotonic spline fitting approach emerged as the most accurate and consistent. This method was then successfully deployed on a compute-constrained quadrotor in real-world scenarios, achieving onboard metric depth estimates at 15 Hz.

The practical implications are significant. The system demonstrated successful collision avoidance when integrated with a motion primitives-based planner. In simulated sewer environments, the navigation performance using the estimated depth was comparable to using ground-truth depth data, with similar rates of collisions and successful goal achievements. Hardware experiments in a dusty industrial tunnel environment further validated the robustness of the system, even detecting fine details like a tripod in the depth estimation output.

While highly effective, the authors acknowledge certain limitations. The depth estimation performance might degrade in very open scenes lacking distinct features (like large sky views) or when prominent foreground objects abruptly disappear. Additionally, planning close to surfaces might be risky due to potential “chattering” in depth predictions, suggesting the need for a larger collision radius or more robust planning strategies. The accuracy also heavily relies on the quality of sparse feature points.

Also Read:

In conclusion, this research offers a compelling solution for autonomous aerial navigation, enabling drones to perceive and navigate complex, unknown environments with minimal, lightweight sensors. By combining monocular vision with inertial data and intelligent rescaling, it paves the way for more agile, longer-flying, and safer autonomous aerial systems. You can find the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Zero-Shot Depth Perception: How Drones Navigate with Just a Camera and IMU

Gen AI News and Updates

U.S. Air Force Secures Skydio Drone Technology for Enhanced Autonomous Operations

Enhancing Text Legibility in AI-Generated Videos with Synthetic Data

Tailoring Image Edits: A Collaborative Approach to User Preferences in AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates