Abstract |
We live in a dynamic physical world, surrounded by all kinds of 3D objects. Designing perception systems that can see the world in 3D from only 2D observations is not only key to many AR and robotics applications, but also a cornerstone for general visual understanding. Prevalent learning-based methods often treat images simply as compositions of 2D patterns, ignoring the fact that they arise from a 3D world. The major obstacle is the lack of large-scale 3D annotations for training, which are prohibitively expensive to collect. Natural intelligences, on the other hand, develop comprehensive 3D understanding of the world primarily by observing 2D projections, without relying on extensive 3D supervision. This begs the question: "can machines learn to perceive the 3D world without explicit 3D supervision?" In this talk, I will present some of our recent efforts in approaching this question, and show that physically-grounded, disentangled 3D object representations can be learned simply from raw photos and videos on the Internet, through an inverse rendering framework. |