Neural Scene Chronology

CVPR 2023

Haotong Lin1,2, Qianqian Wang2, Ruojin Cai2, Sida Peng1, Hadar Averbuch-Elor3, Xiaowei Zhou1, Noah Snavely2

1Zhejiang University    2Cornell University    3Tel Aviv University


TL;DR: Our method can reconstruct a time-varying 3D model from Internet photos,
and render photo-realistic images with independent control of viewpoint, time and illumination.

The following video shows the reconstruction of 5Pointz, a large-scale graffiti landmark in New York City.

In this work, we aim to reconstruct a time-varying 3D model, capable of rendering photo-realistic renderings with independent control of viewpoint, illumination and time, from Internet photos of large-scale landmarks. The core challenges are twofold. First, different types of temporal changes, such as illumination and changes to the underlying scene itself (such as replacing one graffiti artwork with another) are entangled together in the imagery. Second, scene-level temporal changes are often discrete and sporadic over time, rather than continuous. To tackle these problems, we propose a new scene representation equipped with a novel temporal step function encoding method that can model discrete scene-level content changes as piece-wise constant functions over time. Specifically, we represent the scene as a space-time radiance field with a per-image illumination embedding, where temporally-varying scene changes are encoded using a set of learned step functions. To facilitate our task of chronology reconstruction from Internet imagery, we also collect a new dataset of four scenes that exhibit various changes over time. We demonstrate that our method exhibits state-of-the-art view synthesis results on this dataset, while achieving independent control of viewpoint, time and illumination.

Controlling the illumination using a reference image

More results on Times Square and the Metropolitan Museum of Art

Ablation on the time step function encoding

Ours (Step Function Encoding) v.s. Without Encoding.

Step Function Encoding can model abrupt scene-level content changes without
overfitting per-image illumination noise. Without Encoding underfits the abrupt
scene-level content changes, leading to fade-in/fade-out artifacts.

Ours (Step Function Encoding) v.s. Positional Encoding.

Positional encoding overfits the per-image illumination noise, leading to flickering artifacts.


  title={Neural Scene Chronology},
  author={Lin, Haotong and Wang, Qianqian and Cai, Ruojin and Peng, Sida and Averbuch-Elor, Hadar and Zhou, Xiaowei and Snavely, Noah},