26-085 Vision Physical and Geometric Fondation Models for Autonomous Navigation

  • Doctorat, 36 mois
  • Temps plein
  • Expérience : pas de préférence
  • Maitrise, IEP, IUP, Bac+4
  • Exploration

Mission

This project builds upon the Nuance PhD thesis, co-funded in 2022 by CNES and the Occitanie Region. Its theme is of strategic and multidisciplinary interest across several CNES divisions.

Context

Planetary or underground environments pose major challenges for autonomous robotics: absence of GNSS, low visibility [6], complex topologies, and embedded computing constraints. Navigation requires robust localization (visual/inertial SLAM), accurate state estimation, and planning adapted to unstructured environments. To achieve this, Navigation requires robust localization (visual/inertial SLAM), accurate state estimation, and planning adapted to unstructured environments. To this end, approaches combining conventional SLAM and deep learning exist, such as feature extraction in difficult conditions [9].

Recent approaches in computer vision demonstrate the potential of large-scale trained Vision Foundation Models (VFM) [7]. However, these models still struggle to reason about the physics and geometry of scenes. New work, such as Visual Geometry Grounded Transformers (VGGT) [8], aims to explicitly integrate 3D geometric constraints,  while physics-grounded approaches seek to inject the fundamental laws of dynamics into learning. At the same time, 3D reconstruction methods (NeRF, 3D Gaussian Splatting) [5], [3] enable dense and differentiable modeling of environments,  paving the way for more robust and predictive navigation, and for evaluating the VFM models using new view synthesis (NVS) as a proxy task, without requiring annotated 3D data [2], [4]. These realistic rendering methods are very useful for reducing the gap between simulation and reality by incorporating photorealistic visual data [1].

Thesis objectives

The objective of this thesis is to develop a hybrid navigation pipeline combining:

— Visual/inertial SLAM (previously developped) enriched by deep learning modules.

— Vision Foundation Models anchored in physics and geometry, capable of reasoning about dynamics (forces, collisions) and the 3D structure of the scene (e.g., VGGT).

— Advanced 3D reconstruction techniques (NeRF, 3D Gaussian Splatting) to generate dense, semantic maps that can be used in planning.

— Multi-sensor (cameras, lidars, neuromorphic cameras) and multi-environment

(realistic simulators and real robotic platforms) validation.

Methodological approach

Year 1: Analysis and evaluation of existing VFMs (including VGGT) for navigation. Creation of a simulated benchmark based on NeRF/3DGS to test their physical and geometric capabilities.

Year 2: Development of hybrid modules (SLAM + VFM) with explicit learning of physical and geometric constraints. Integration into a real-time navigation pipeline. 

Year 3: Extension to complex environments and validation on real robotic platforms (rover, drone). Study of multi-sensor and possibly multi-agent scenarios.

Expected outcomes

— New hybrid SLAM methods integrating depth vision, physics, and geometry.

— Continuous 3D representations usable for navigation and robust planning.

— Publications in leading vision and robotics conferences (CVPR, ICRA, RSS, ICCV, ECCV, NeurIPS).

— Demonstrators applicable to space exploration and CNES scenarios.

[1] Arunkumar Byravan, et al : Nerf2real : Sim2real transfer of vision-guided bipedal motion skills using neural radiance fields. arXiv preprint, 2023. arXiv :2303.04732.

[2] Yue Chen, et al.. Feat2GS : Probing visual foundation models with gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6348–6361, June 2025.

[3] Bernhard Kerbl, et al : 3D gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4) :1–14, 2023. SIGGRAPH 2023.

[4] Xiaohan Lei, et al. : Gaussian splatting for visual navigation. arXiv preprint, 2023.

[5] Ben Mildenhall, et al : Representing scenes as neural radiance fields for view synthesis. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, volume 12346 of Lecture Notes in Computer Science, pages 405–421, Cham, 2020. Springer International Publishing. ECCV 2020 Oral Presentation - Best Paper Honorable Mention ; *Denotes equal contribution.

[6] Tong Qin, et al.. : Vins-mono : A robust and versatile monocular visual-inertial state estimator. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 1–7, 2018.

[7] Jiaming Wang, et al : A generalizable navigation system for in-the-wild environments. arXiv preprint, 2025. Accessed : 2025-05-29.

[8] Jianyuan Wang, et al. : VGGT : Visual geometry grounded transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), page to appear, October 2023.

[9] Xiao Zhang, et al. : Sl-slam : A robust visual-inertial slam based on deep feature extraction and matching. arXiv preprint arXiv :2303.00079, 2023.

=================

For more Information about the topics and the co-financial partner (found by the lab!); contact Directeur de thèse - damien.vivet@isae-supaero.fr

Then, prepare a resume, a recent transcript and a reference letter from your M2 supervisor/ engineering school director and you will be ready to apply online  before March 13th, 2026 Midnight Paris time!

Profil

Master in Computer Vision, Machine Learning, AI

Laboratoire

TESA

Message from PhD team

More details on CNES website : https://cnes.fr/fr/theses-post-doctorats