Enhanced U-Net Architectures for Accurate Room Impulse Response Generation via Differential-Phase Learning
Ignacio Martin-Salinas, Gema Piñero, Jose A. Belloch, and Adrian Amor-Martin
EURASIP Journal on Audio, Speech, and Music Processing, Nov 2025
Generating accurate room impulse responses (RIRs) remains challenging, particularly regarding phase estimation. Building upon previous work utilizing encoder-decoder deep learning architectures, this paper investigates advanced techniques to improve phase prediction accuracy. We propose and evaluate several enhanced U-Net models, including variants with a variational autoencoder (VAE) bottleneck and differing input conditioning methods for spatial and room parameters (embedding layers vs. normalized dense layers). A key focus is the comparison between predicting direct phase and differential phase. Furthermore, we analyze the impact of using mean absolute error (MAE) versus mean squared error (MSE) for the magnitude component of the loss function. The study also explores the efficacy of applying the Griffin-Lim algorithm as a post-processing step to refine the phase estimated by the networks. Performance is evaluated on a real RIR dataset, comparing the different model architectures, information vector encoding strategies, phase targets (direct vs. differential), loss functions, and the contribution of phase recovery algorithms to overall RIR fidelity. Results provide insights into effective strategies for enhancing phase generation in data-driven RIR synthesis.