Achieving high-quality shadow removal with strong generalizability is challenging in scenes with complex global illumination. Due to the limited diversity in shadow removal datasets, current methods are prone to overfitting training data, often leading to reduced performance on unseen cases. To address this, we leverage the rich visual priors of a pre-trained Stable Diffusion (SD) model and propose a two-stage fine-tuning pipeline to adapt the SD model for stable and efficient shadow removal. In the first stage, we fix the VAE and fine-tune the denoiser in latent space, which yields substantial shadow removal but may lose some high-frequency details. To resolve this, we introduce a second stage, called the detail injection stage. This stage selectively extracts features from the VAE encoder to modulate the decoder, injecting fine details into the final results. Experimental results show that our method outperforms state-of-the-art shadow removal techniques. The cross-dataset evaluation further demonstrates that our method generalizes effectively to unseen data, enhancing the applicability of shadow removal methods.
Our proposed network. We propose a two-stage shadow removal network based on Stable Diffusion (SD). (1) In the first stage, as shown in the bottom half, we fine-tune the pre-trained UNet in SD within the latent space defined by SD's pre-trained VAE (E and D). We found that the pre-trained latent space can effectively represent shadow-free images. (2) In the second stage, as shown in the top half, we modulate the VAE decoder D by selectively adding features from the VAE encoder E using a Detail Injection Model (DIM). The model consists of multiple RRDB layers, which inject shadow-free texture details into the decoder features. With these two stages, our proposed network can generate high-quality, shadow-free images that preserve fine details.
Quantitative comparisons on ISTD+, SRD, INS, and WSRD+ datasets. Best results are highlighted as 1st , 2nd and 3rd.
To evaluate the generalizability of our approach, we conduct a cross-dataset evaluation. Specifically, we train our model on the training set of one dataset and test it on the testing set of a different dataset. Typically, training and testing data within the same dataset share similar shadow patterns, backgrounds, or lighting conditions. This cross-dataset evaluation presents a more challenging test for shadow removal methods and provides a more stringent measure of generalizability. We use A→B to denote training on dataset A and testing on dataset B.
Cross-dataset evaluation. ISTD+→SRD means training on the ISTD+ dataset and tesing on the SRD dataset.
@Inproceedings{xu_2025_CVPR,
title={Detail-Preserving Latent Diffusion for Stable Shadow Removal},
author={Xu, Jiamin and Zheng, Yuxin and Li, Zelong and Wang, Chi and Gu, Renshu and Xu, Weiwei and Xu, Gang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}