[ README about the AV 16.3 corpus ] CONTENTS: ( 1 ) General description. ( 2 ) Geometry ( 3 ) File-by-file description. ( 4 ) Log of modifications. NOTE: A publication describing how the corpus was defined and recorded, as well as the available 2D and 3D speaker location ground-truth and examples of use can be find in: http://glat.info/ma/av16.3/AV163.pdf ================================================== ( 1 ) GENERAL DESCRIPTION ================================================== This is a corpus of audio-visual recordings made in an indoor environment with 16 microphones and 3 cameras - hence the name "AV16.3". Also lapels were used when possible. Signals coming from all sensors were recorded in a fully synchronous manner. In most (but not all) sequences, people are wearing a colored ball marker on the top of their head, in order to facilitate 2D annotation and/or 3D reconstruction. For detailed description of each recording, see CONTENTS_DETAILED (annotated recordings) and CONTENTS_DETAILED_ALL (all recordings) in this directory. It includes a detailed description of the type of behavior of the actors, as well as timecode information that is useful to synchronize the various audio/video streams. IMPORTANT: note that audio always starts at timecode 00:00:10.00 A short description of each directory is given below, in part ( 2 ) Many different types of behaviours were recorded, including visual occlusions, speech overlaps, sharp motions, etc. The purpose is multidisciplinary, the data may be of interest for audio-only localization and tracking, video-only, or both. The main motivation for these recordings is systematic assessment of research algorithms, using e.g. true 3D mouth location provided by calibrated cameras and 2D measurements on the various images. For more information the user can report to the IDIAP Research Report RR 04-28 (http://www.idiap.ch). A selection of potentially interesting parts of the corpus, beside data: - MATLAB script to access interesting info such as array geometry, file names, etc. http://glat.info/ma/av16.3/seq_av163.m - Video annotation interfaces (head, ball marker, mouth): http://glat.info/ma/av16.3/HAI ( head box annotation ) http://glat.info/ma/av16.3/BAI ( ball marker box annotation ) http://glat.info/ma/av16.3/MAI ( mouth location annotation ) http://glat.info/ma/av16.3/FORMATS ( description of file formats ) - Examples of use: http://glat.info/ma/av16.3/EXAMPLES/AUDIO/README ( single audio source ) http://glat.info/ma/av16.3/EXAMPLES/VIDEO/README ( multiple objects ) http://glat.info/ma/av16.3/EXAMPLES/3D-RECONSTRUCTION/README ( 3D mouth location ) Guillaume Lathoud - glathoud@yahoo.fr Jean-Marc Odobez - odobez@idiap.ch Daniel Gatica-Perez - gatica@idiap.ch ================================================== ( 2 ) Geometry ================================================== The room was 8.2m * 3.6m * 2.4m, with a long table in the middle. For more details consult: ./com02-07.pdf For all recordings named "seq*" there are two, 8-microphone, uniform circular microphone arrays (0.1m radius) placed 0.04m above the table. The centers of the two arrays are separated by 0.8m. The origin of the 3D referent used everywhere in this corpus is the middle point between the 2 arrays. The Z axis is pointing upward. * microphone array plane : z = 0 * table surface : z = -0.04 * room floor : z = -0.84 * other points (table, walls, panels) : see gt.mat in ./CAL_session08/ (x,y,z) microphone coordinates are given below the graph. window c1 m7 m8 m6 m1 m5 m2 m4 m3 Y ^ | | O-->X right wall m15 m16 m14 m9 m13 m10 m12 m11 c2 c3 Microphone arrays: 2x8 microphones (m1 to m16 in the graph). With MATLAB or Octave: i = 1:16 angle = pi * (-1 + 2 * mod(i-1,8) / 8) xyz = [ 0.1 * cos(angle); 0.1 * sin(angle) + 0.4 - 0.8 * (i>8); 0 * angle ].' xyz should look like this: -0.10000 0.40000 -0.00000 -0.07071 0.32929 -0.00000 0.00000 0.30000 -0.00000 0.07071 0.32929 -0.00000 0.10000 0.40000 0.00000 0.07071 0.47071 0.00000 0.00000 0.50000 0.00000 -0.07071 0.47071 0.00000 -0.10000 -0.40000 -0.00000 -0.07071 -0.47071 -0.00000 0.00000 -0.50000 -0.00000 0.07071 -0.47071 -0.00000 0.10000 -0.40000 0.00000 0.07071 -0.32929 0.00000 0.00000 -0.30000 0.00000 -0.07071 -0.32929 0.00000 For more information, see also ./seq_av163.m ================================================== ( 3 ) FILE-BY-FILE DESCRIPTION ================================================== ( 3.1 ) Files: AV163.pdf -> publication motivating and describing the corpus, including the camera calibration process, the available 2D and 3D ground-truth, and examples of use. CONTENTS_DETAILED -> details about each ANNOTATED recording, including a description of actor(s)' behaviour, timecodes where each video starts, etc. CONTENTS_DETAILED_ALL -> details about each recording, including a description of actor(s)' behaviour, timecodes where each video starts, etc. FORMATS -> describe the format of the annotation files, i.e. files with extensions ".headgt", ".mouthgt", ".ballgt", ".3dmouthgt", ".3dballgt". README -> this file. readradfile.m -> MATLAB function to read radial distorsion parameters from a file. seq_av163.m -> useful MATLAB function to get audio and video file pointers, camera calibration parameters, etc. for a given sequence. static_gt_2_angle_gt.m -> MATLAB function to extract static speaker annotation ( 3D mouth location + speech/silence segmentation ). It is only applicable to "seq01-1p-0000" and "seq37-3p-0001". See EXAMPLES/AUDIO for an example of use. ( 3.2 ) Data directories: For example the name "seq37-3p-0001" contains three parts: - "seq37" is the unique identifier of this recording: sequence #37. - "3p" means that overall 3 persons were recorded - but not necessarily all visible simultaneously. - "0001" are four binary flags giving a quick overview of the contents of this recording. From left to right: bit 1: 0 means "very constrained", 1 means "mostly unconstrained" (general behavior: although most recordings follow some sort of scenario, some include very strong constraints such as the speaker facing the microphone arrays at all times) bit 2: 0 means "static", 1 means "dynamic" (static = sporadic motion (e.g. mostly seated), dynamic = continous motion) bit 3: 0 means "minor occlusion(s)", 1 means "at least one major occlusion" (for at least one array or camera: whenever somebody passes in front of or behind somebody else) bit 4: 0 means "little overlap", 1 means "significant overlap" (audio: indicates whether there is a significant proportion of overlap between speakers and/or noise sources) Except for seq37, all data directories mentioned here contain two subdirectories: seq??-?p-????/16kHz: the audio waveforms. seq??-?p-????/annotation: the annotation files. seq01-1p-0000/ -> Single static speaker. -> Continuous 2D and 3D mouth location annotation complete. -> Sparse 2D head annotation complete. -> Precise speech/silence segmentation complete. -> Example of use available in the EXAMPLES/AUDIO directory. seq11-1p-0100/ -> Single moving speaker sequence. -> Sparse 2D mouth location annotation complete. -> Sparse 2D head annotation complete. seq15-1p-0100/ -> Single moving speaker sequence (no ball marker). -> Sparse 2D mouth location annotation complete. -> Sparse 2D head annotation complete. seq18-2p-0101/ -> Two moving speakers, getting very close to each other. -> Sparse 2D mouth location annotation complete. -> Sparse 2D head annotation complete. seq24-2p-0111/ -> Two moving, walking speakers. -> Sparse 2D mouth location annotation complete. -> Sparse 2D head annotation complete. seq37-3p-0001/ -> Three static speakers. -> Continuous 2D and 3D mouth location annotation complete. -> Rough speech/silence segmentation complete. seq40-3p-0111/ -> Two static speakers + a third moving speaker. -> Sparse 2D mouth location annotation complete. -> Sparse 2D head annotation complete. seq45-3p-1111/ -> Three moving speakers with many occlusion cases. -> Sparse 2D mouth location annotation complete. -> Sparse 2D head annotation complete. ( 3.3 ) Other data directories These include the files in ( 2.1 ) + all other non-annotated sequences. The only difference between sessions is a minor horizontal shift in the image plane. If you need to do 3D reconstruction, go to the corresponding "CAL_sessionXX" directory or use the "seq_av163.m" Matlab function to obtain complete camera calibration information. session08/ session09/ session10/ session11/ session12/ This directory contains an additional group of three audio-only recordings made with loudspeakers at various locations (3D locations and speech/silence segmentation known by construction). It includes a README file and the original WAV files played by the loudspeakers: synthmultisource/ ( 3.4 ) Remaining directories: BAI/ -> Ball Annotation Interface (includes a tracker). HAI/ -> Head Annotation Interface. MAI/ -> Mouth Annotation Interface. CAL_session08/ -> Camera calibration parameters for sequences in session08. Text files and Matlab files. CAL_session09/ -> Minor image plane shift parameters (deltax,deltay) rel. to session08. Text files and Matlab files. CAL_session10/ -> Minor image plane shift parameters (deltax,deltay) rel. to session08. Text files and Matlab files. CAL_session11/ -> Minor image plane shift parameters (deltax,deltay) rel. to session08. Text files and Matlab files. CAL_session12/ -> Minor image plane shift parameters (deltax,deltay) rel. to session08. Text files and Matlab files. EXAMPLES/AUDIO/ -> Single audio source localization + comparison with ground-truth. Contains one README file. Contains all the MATLAB code necessary to run the example. EXAMPLES/VIDEO/ -> Multi-object tracking example. Contains one README file, and tracking results. EXAMPLES/3D-RECONSTRUCTION/ -> How to produce continuous 3D mouth location annotation. Contains one README file, an example, and all necessary MATLAB code. ================================================== 4. LOG ================================================== October 9th, 2006: Added the 2D & 3D mouth annotation for seq02 and seq03, in both ASCII and MATLAB 6.5.1 format. Note that each MATLAB file also contains a rough speech/silence segmentation. session09/seq02-1p-0000/annotation/*.mouthgt session09/seq02-1p-0000/seq02-1p-0000_gt.mat session08/seq03-1p-0000/annotation/*.mouthgt session08/seq03-1p-0000/seq03-1p-0000_gt.mat ---------- February 21st, 2006: Added the annotation of ball marker and mouth, including interpolated 3D measures, for seq15, seq40 and seq45. session10/seq40-3p-0111/annotation/*gt session10/seq45-3p-0111/annotation/*gt session11/seq15-1p-0100/annotation/*gt ---------- December 16th, 2004: Head annotatation (bounding box) complete, and "*.headgt" files uploaded for the following sequences: seq01-1p-0000 seq11-1p-0100 seq15-1p-0100 seq18-2p-0101 seq24-2p-0111 seq40-3p-0111 seq45-3p-1111 Thanks to all involved partners from the AMI project (TNO, Sheffield, BRNO and IDIAP). ---------- August 30th, 2004: Entire corpus uploaded. Audio and video examples uploaded. 2D mouth annotation files uploaded. 3D reconstruction example uploaded. ---------- August 8th, 2004: filenames were changed. Namely: "seq1_jitendra" became "seq01-1p-0000" "seq3_dig" became "seq37-3p-0001" "seq6_iain2" became "seq11-1p-0100" "seq9_daniel_guillaume" became "seq24-2p-0111" "seq10_fabien_viktoria" became "seq18-2p-0101" "seq11_d_jm_g" became "seq40-3p-0111" "seq12_d_jm_g" became "seq45-3p-1111" "seq19_guillaume2" became "seq15-1p-0100" The new names all follow the same coding scheme. For example the name "seq37-3p-0001" contains three parts: - "seq37" is the unique identifier of this recording: sequence #37. - "3p" means that overall 3 persons were recorded - but not necessarily all visible simultaneously. - "0001" are four binary flags giving a quick overview of the contents of this recording. From left to right: bit 1: 0 means "very constrained", 1 means "mostly unconstrained" (general behavior: although most recordings follow some sort of scenario, some include very strong constraints such as the speaker facing the microphone arrays at all times) bit 2: 0 means "static", 1 means "dynamic" (static = sporadic motion (e.g. mostly seated), dynamic = continous motion) bit 3: 0 means "minor occlusion(s)", 1 means "at least one major occlusion" (for at least one array or camera: whenever somebody passes in front of or behind somebody else) bit 4: 0 means "little overlap", 1 means "significant overlap" (audio: indicates whether there is a significant proportion of overlap between speakers and/or noise sources)