Salient object detection (SOD) aims to identify standout elements in a scene, with recent advancements primarily focused on integrating depth data (RGB-D) or temporal data from videos to enhance SOD in complex scenes. However, the unison of two types of crucial information remains largely underexplored due to data constraints. We address this gap by introducing the DViSal dataset, fueling further research in the emerging field of RGB-D video salient object detection (DVSOD). Our dataset features 237 diverse RGB-D videos alongside comprehensive annotations, including object and instance-level markings, as well as bounding boxes and scribbles. These resources enable a broad scope for potential research directions. We also conduct benchmarking experiments using various SOD models, affirming the efficacy of multimodal video input for salient object detection. Lastly, we highlight some intriguing findings and promising future research avenues. To foster growth in this field, our dataset and benchmark results are publicly accessible.
In the following video demo, we present some intuitive examples that illustrate the effectiveness of incorporating different input modalities in various scenarios. It is evident that the results obtained from RGBD videos are significantly more captivating when compared to using RGB or RGBD alone. This attributes to the superiority of integrating the advantages of complementary multimodal RGBD information and temporal contexts.
Our dataset and code are available now to research community. If you have questions, feel free to email us at wji3@ualberta.ca. If you are interested in our work, you are welcome to cite our paper using the BibTeX provided below.
@InProceedings{li2023dvsod,
title={DVSOD: RGB-D Video Salient Object Detection},
author={Li, Jingjing and Ji, Wei and Wang, Size and Li, Wenbo and Cheng, Li},
booktitle={Advances in Neural Information Processing Systems},
year={2023},
month={December}
}