Abstract: During the performance of sound source localization which uses both visualand aural information, it presently remains unclear how much either image orsound modalities contribute to the result, i.e. do we need both image and soundfor sound source localization? To address this question, we develop anunsupervised learning system that solves sound source localization bydecomposing this task into two steps: (i) "potential sound sourcelocalization ", a step that localizes possible sound sources using only visualinformation (ii) "object selection ", a step that identifies which objects areactually sounding using aural information. Our overall system achievesstate-of-the-art performance in sound source localization, and moreimportantly, we find that despite the constraint on available information, theresults of (i) achieve similar performance. From this observation and furtherexperiments, we show that visual information is dominant in "sound " sourcelocalization when evaluated with the currently adopted benchmark dataset.Moreover, we show that the majority of sound-producing objects within thesamples in this dataset can be inherently identified using only visualinformation, and thus that the dataset is inadequate to evaluate a system scapability to leverage aural information. As an alternative, we present anevaluation protocol that enforces both visual and aural information to beleveraged, and verify this property through several experiments.

