seq01-1p-0000, which means spatial location and speech time segmentation:
static_gt( 1 ).p2d is a 9 by 16 matrix of true 2D speaker locations:
[ x1; y1; 1; x2; y2; 1; x3; y3; 1 ]
static_gt( 1 ).p3d, is a 3 by 16 matrix of true 3D speaker locations, reconstructed from
static_gt( 1 ).p2d see http://glat.info/ma/av16.3/EXAMPLES/3D-RECONSTRUCTION/index.html
static_gt( 1 ).sp_seg, is a 2 by 169 matrix of speech segments:
static_gt( 1 ).pos_ind, is a 1 by 169 matrix of integers. Each integer is a column index in
.p3d : the integer tells where the speaker was for a given speech segment, because the columns of
sp_seg match the columns of
static_gt( 1 ).speaker_id, is a 1 by 169 matrix of integers, telling who spoke for a given segment in sp_seg (here there is only one speaker in the whole sequence).
static_gt( 1 ).array( 1 ).Pmat and
static_gt( 1 ).array( 2 ).Pmat are the 4x4 homogeneous 3D transform matrix (rotation + translation) defining the 3D referent of each microphone array, see http://glat.info/ma/av16.3/EXAMPLES/3D-RECONSTRUCTION/index.html for a concrete example.
Objet: Query regarding the UCA Hardware setup
I am writing in to ask you about some suggestions and information regarding the hardware setup you used in your thesis. I saw a picture of it on your personal webpage, what kind of microphones are you using, what preamps and sound card you were using for recording.
Actually we have 16 channel UCA in our lab, we built it on the baseline of NIST Mark III microphone array. We are using RME 800 sound card for audio capture. Recently we realized the microphones are not the best of quality for Blind Source Separation and the frequency response is mismatched between the microphones, we are trying to find some better replacement.
I do not remember which types of microphones we used - please simply write about this topic to Olivier Masson (firstname.lastname@example.org) on my behalf.
More specifically about the Nist Mark III array, other people already have had coherence problems similar to yours. Dr Luca Breyda did modifications to the Mark III to correct the problem  . He was at Eurecom at that time, maybe you can try to write him there.
In your paper <<Spatio-temporal Analysis of Spontaneous Speech with Microphone Arrays>>, in chapter 2,index 2.2 "Discrete Time Processing of Quasi-Stationary Signals",Note 3 tells me:"All singnals x(t),y(t) etc. are assumed to be limited to the frequency band[0,fs/2]".
My question is:all the *.wav files I downloaded from your website,have alread been limited to the frequency band[0,fs/2], or I have to limite the signals to the frequency band[0,fs/2] in my matlab programme?
The wav signals are already band-limited, within [0 fs/2]. You do not need to modify them, and you can use them as they are.
PS: Another way to understand this is that you can reconstruct the whole, *continuous* signal curve from the *discrete* samples in the .wav, assuming that the signal is band-limited (no frequency above half the sampling frequency: fs/2). This theoretical point becomes particularly practical when doing upsampling (e.g. in time domain GCC-PHAT).
I am now studying your implementation of SPR-PHAT and have several questions.
1. Why the following code can convert grid into the index ?
% We also convert the time-delays of the grid into a more usable format % -> 1-dimensional integer index in
up_gccphat(:)up_grid_index = up_rowzero - round( grid_td * p.upfactor ); up_grid_index = up_grid_index + repmat( up_nrows * (0:npairs-1).', 1, size( up_grid_index, 2 ) );
Our target is a matrix of upsampled time domain GCC-PHAT values. Upsampled means:
up_rowzerois the row index of the zero time delay.
grid_tdcontains time delay values.
repmat( ... )converts the row index values to 1-dimensional index values - so that we can write something like
up_gccphat( up_grid_index(...) ).
2. why upsampling is necessary?
tau = distance difference / speed of sound * sampling frequency
appear at non-integer delay values (e.g. 1.234 samples). So you need
an accuracy better than what the inverse FFT provides you (
tau = 1 or
Just play with any two signals, computing the frequency-domain GCC-PHAT, then the inverse FFT, and then play with upsample, and you will see the difference.
3. In :
halfportion = 1 + ceil( max( maxdelay ) ) + ceil( p.filterorder / p.upfactor ); up_rowzero = halforder + 1 + ( rowzero - rowstart ) * p.upfactor;
Could you please tell me why the left-hand side is equal to the right-hand side.
halfportion: it determines what portion of the upsampled FFT we need
(when we know that two microphones are 20 cm appart, we don't need to
consider delays bigger than what the
tau equation above tells us).
halfportion in turn should determine rowstart, if I remember
halforder compensates the delay introduced by
the low-pass filter used right after upsampling (
fir1(...) if I
As you can see in these two lines:
addpath ../.. o = seq_av163( 'seq01-1p-0000', '../..' );
...you need to install other things, including:
../../seq_av163.m (another MATLAB file)
...which you'll find through AV16.3's file index.
You also need to install the test data:
../../session08/seq01-1p-0000/ (a directory)
which you'll find in AV16.3's session08/seq01-1p-0000/ directory.
If you encounter any specific trouble running this baseline example, let me know.
For your information I did these programs under Matlab 6.5.1 (R13).
Now, as an attempt to help you, I offer several solution:
PAR.BLOCK_SIZE_SEC = 10; % in seconds
Similarly, you can also modify
PAR.BLOCK_SIZE_SEC = 2; % in seconds (any value that you want)
PAR.BLOCK_SIZE_SECin the following two files:
Yes, because the core assumptions are *not* specific to speech. We assume in time domain:
This is not specific to speech at all. For more details, see Section 5 in  (esp. Section 5.1 and Fig. 5).
You are by the way absolutely free to use other PDFs. I tried other, more complex PDFs than the ones in the paper, but that led to overfitting issues and suboptimal results.
Practically, it seems to be safer to stick to PDFs with very few parameters. In particular, the mathematical structure of the two PDFs should reflect assumptions A1. and A2. (e.g. "the signal of interest has *bigger* amplitudes than the noise signal" -> *shifted* Erlang pdf for the target speech signal).
In case that your noise signal does not match well Figs. 5a and 5b in , you may need some pre-processing - any sort of whitening, e.g. channel estimation and de-convolution, which is the idea of CHN in .
Now let us assume a noise signal that matches well Figs. 5a and 5b in , and a signal of interest that has at least long tails (=bigger amplitudes than noise), as in Fig. 5c in . An important assumption is "slowly-varying" in A1. That's why the code processes the signal in blocks :
% Size of the block on which we apply CHN / RSE-USS / GMN-USS par_default.block_size_sec = 1.0; % 1.0 second
The block should be one or two orders of magnitude larger than a typical variation of the target signal (10 to 20 ms in the case of speech). The noise signal is assumed to have a constant amplitude on one block (e.g. 1 second), which also means, that we allow noise amplitude varations from one block to the next (1 second in the speech case).
P(A) is the probability to observe the target (speech) signal at any given (time, frequency) point of the spectrum.
P(I) is the probability *not* to observe the target speech signal at any given (time, frequency) point of the spectrum.
Hence P(I) + P(A) = 1.
Within the Expectation-Maximization (EM) context, you can find an example in  (P(Zi = 1) and P(Zi = 2)).
We are using the EM algorithm to adjust all parameters together - including the priors - so as to maximize the likelihood of the observed data (again, see  for an example). That is what we call "fitting the model (Rayleigh + Shifted Erlang) to the data", and is implemented in two steps in , both fully automatic:
Step 1. Automatic initialization using a rough, but reliable initial estimate. In :
% Init priors raylsherl.p0 = max( 0.1, min( 0.9, numel( lsr_ind ) / nx ) );
--> yields automatically some reasonable value for P(I), garanteed to be between 0.1 and 0.9 .
% Update priors tmp = my_logsum_fast( log_w ); raylsherl.p0 = min( 1, exp( tmp( 1 ) - my_logsum_fast( tmp ) ) );
Step 1. is particularly crucial, because "an EM algorithm may converge to a local maximum of the observed data likelihood function, depending on starting values." (, Section "Properties"). Incompetent researchers choose to ignore this mathematical fact.
In other words, the EM algorithm will never converge to an absolute optimum choice of parameters (= priors and pdf parameters), which is why you should pay a lot of attention to Step 1, and have a "rough, but reliable initial estimate" of the priors. "reliable" means that the initial estimate of the priors should guarantee a decent result of the EM convergence, and this in various situations (= various recordings, various contexts, various background noises).
In practice, initialization may be easier if you keep your PDFs simple (few parameters).
In the worst case, if for some reason you do not know (yet) what a good initialization should be, you can always use p0 = 0.5, which would mean: P(I) = 0.5 and P(A) = 1 - P(I) = 0.5
On the other hand, let me advise *against* manual tuning of the initial prior values, because it would mean that you are overfitting your data: you may obtain very good performance on a particular signal, and very bad performance on another = meaningless result.
 L. Brayda, C. Bertotti, L. Cristoforetti, M. Omologo, P. Svaizer, "Modifications on NIST MarkIII array to improve coherence properties among input signals", AES, 118th Audio Engineering Society Convention, Barcelona, Spain, May 2005.
 L. Brayda, C. Bertotti, L. Cristoforetti, M. Omologo, P. Svaizer, "On calibration and coherence signal analysis of the CHIL microphone network at IRST", Joint Workshop on Hands-Free Speech Communication and Microphone Arrays, Piscataway, NJ, March 2005.
Produced on 2011-09-27 by qa.scm - by Guillaume Lathoud (glathoud _at_ yahoo _dot_ fr)