Back to the audio resources page

A Sector-Based, Frequency Domain Approach to Detection and Localization of Multiple Speakers

G. Lathoud and others

This webpage describes the 16 kHz multichannel data used for experiments in the paper with same title, submitted to ICASSP 2005.
Pointers to DIVX videos and WAV sound files are included.
Questions? ->
Note: optimized C code can be accessed here.

Online Access to Audio and Video Recordings

Sequences were taken from the AV16.3 corpus [2]:
ICASSP article AV16.3 corpus
Seq. #1synthmultisource-setup1
Seq. #2synthmultisource-setup2
Seq. #3synthmultisource-setup3
Seq. #4seq01-1p-0000
Seq. #5seq37-3p-0001


Seq. #1, #2 and #3: loudspeaker recordings (emitted signals and recorded signals) can be accessed here (WAV files).
Seq. #4: this single human speaker sequence can be accessed here, including a DIVX file and a WAV file.
Seq. #5: this 3-human speaker sequence can be accessed here, including image snapshots 1, 2, 3 and a WAV file

Important Note on Ground-Truth Annotation

Description of the Data

Real recordings were made in an instrumented meeting room [1] with a horizontal circular 8-mic array (10 cm radius) set on a table. The data is part of AV16.3, a larger corpus available online [2]. Each source was annotated in terms of both spatial location and speech/silence segmentation. In this paper, given the microphone array's geometry, source location is defined in terms of azimuth. Time frames are 32 ms long, with 16 ms overlap. We used 5 recordings from the AV16.3 corpus [2]. In all cases, a circular, 8-microphone array with radius 10 cm was used.

References

[1] D. Moore, "The IDIAP Smart Meeting Room", IDIAP-COM 02-07, IDIAP, 2002 (ps, pdf).
[2] G. Lathoud, J.M. Odobez and D. Gatica-Perez, "AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking", in Proceedings of the MLMI'04 workshop, 2005 (ps, pdf).

Back to the audio resources page


Last updated on 2008-09-03 by Guillaume Lathoud - glathoud at yahoo dot fr