Authors
- Kazuki Shimada
- Archontis Politis*
- Parthasaarathy Sudarsanam*
- Daniel Krause*
- Kengo Uchida
- Sharath Adavann*
- Aapo Hakala*
- Yuichiro Koyama*
- Naoya Takahashi
- Shusuke Takahashi*
- Tuomas Virtanen*
- Yuki Mitsufuji
* External authors
Venue
- NeurIPS 2023
Date
- 2023
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
Kazuki Shimada
Archontis Politis*
Parthasaarathy Sudarsanam*
Daniel Krause*
Sharath Adavann*
Aapo Hakala*
Yuichiro Koyama*
Shusuke Takahashi*
Tuomas Virtanen*
* External authors
NeurIPS 2023
2023
Abstract
While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results show that the audio-visual SELD system achieves lower localization error than the audio-only system. The data is available at https://zenodo.org/record/7880637.
Related Publications
Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and recent works have extended it to discrete diffusion. This paper theoretically analyzes CFG in the context of masked discrete …
We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditio…
This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine…
JOIN US
Shape the Future of AI with Sony AI
We want to hear from those of you who have a strong desire
to shape the future of AI.



