Click2Hear: Interactive Spatial Audio Separation and Localization using Visual Cues

Avinash Subramaniam*, Debottam Dutta*, Chaitanya Amballa*

Dept. of Electrical and Computer Engineering

University of Illinois at Urbana-Champaign

* = Equal contribution

Abstract:

We present Click2Hear, an interactive system that utilizes visual cues to generate localized binaural audio from a video with mono audio. The system achieves this by performing audio separation and sound source localization. Mono audio lacks spatial information in itself to localize or separate sources without any external information such as speaker identity and visual cues. Without such prior information, most of the existing methods utilize visual information present in the video frames to achieve these tasks. These methods take inspiration from the real world that, humans also have a sense of localizing the sound with the help of visual information they capture. This visual information helps in both separating and localizing sources at the same time. In addition to helping with audio source separation, visual information can also give a sense of the spatial location of the audio source. Existing audio-visual methods attempt these tasks separately or require sequential processing to do both tasks. We first investigate the generalizability of a popular existing method in localizing and separating mono-sound mixtures to more realistic scenarios by conducting several experiments. We then propose a unified framework that does both of the aforementioned tasks with the additional task of binauralization in one shot. Our experiments show promising results in achieving this task.

Proposed Architecture:

Neural Net Architecture

Demo:

SOP on MUSIC

Original

Click on the video to play/pause the correspoding audio associated with it. You can change time by clicking on the waveform

Pixelwise Audio

Click on any pixel to play the correspoding audio associated with it. You can change time by clicking on the waveform

SiSLoc (on FAIR-Play)

Original

Click on the video to play/pause the correspoding audio associated with it. You can change time by clicking on the waveform

Pixelwise Audio

Click on any pixel to play the correspoding audio associated with it. You can change time by clicking on the waveform

Mono2Binaural

Mono Audio

Click on the video to play/pause the correspoding audio associated with it. You can change time by clicking on the waveform

Predicted Binaural Audio (Earphones are recommended)

Click on the video to play/pause the correspoding audio associated with it. You can change time by clicking on the waveform

SiBSLoc

Original

Click on the video to play/pause the correspoding audio associated with it. You can change time by clicking on the waveform

Pixelwise Seperated + Binaural

Click on any pixel to play the correspoding audio associated with it. You can change time by clicking on the waveform

References:

  1. The Sound of Pixels
    Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba
  2. 2.5D Visual Sound
    Ruohan Gao, Kristen Grauman

© 2023 Avinash Subramaniam, Debottam Dutta, Chaitanya Amballa. University of Illinois at Urbana-Champaign. All rights reserved.