seq01-1p-0000
, which means spatial location and speech time segmentation:static_gt( 1 ).p2d
is a 9 by 16 matrix of true 2D speaker locations:
[ x1; y1; 1; x2; y2; 1; x3; y3; 1 ]
static_gt( 1 ).p3d,
is a 3 by 16 matrix of true 3D speaker locations, reconstructed from static_gt( 1 ).p2d
see http://glat.info/ma/av16.3/EXAMPLES/3D-RECONSTRUCTION/index.html
static_gt( 1 ).sp_seg,
is a 2 by 169 matrix of speech segments:
static_gt( 1 ).pos_ind,
is a 1 by 169 matrix of integers. Each integer is a column index in .p3d
: the integer tells where the speaker was for a given speech segment, because the columns of sp_seg
match the columns of pos_ind.
static_gt( 1 ).speaker_id,
is a 1 by 169 matrix of integers, telling who spoke for a given segment in sp_seg (here there is only one speaker in the whole sequence).
static_gt( 1 ).array( 1 ).Pmat
and static_gt( 1 ).array( 2 ).Pmat
are the 4x4 homogeneous 3D transform matrix (rotation + translation) defining the 3D referent of each microphone array, see http://glat.info/ma/av16.3/EXAMPLES/3D-RECONSTRUCTION/index.html for a concrete example.
Objet: Query regarding the UCA Hardware setup
I am writing in to ask you about some suggestions and information regarding the hardware setup you used in your thesis. I saw a picture of it on your personal webpage, what kind of microphones are you using, what preamps and sound card you were using for recording.
Actually we have 16 channel UCA in our lab, we built it on the baseline of NIST Mark III microphone array. We are using RME 800 sound card for audio capture. Recently we realized the microphones are not the best of quality for Blind Source Separation and the frequency response is mismatched between the microphones, we are trying to find some better replacement.
I do not remember which types of microphones we used - please simply write about this topic to Olivier Masson (olivier.masson@idiap.ch) on my behalf.
More specifically about the Nist Mark III array, other people already have had coherence problems similar to yours. Dr Luca Breyda did modifications to the Mark III to correct the problem [1] [2]. He was at Eurecom at that time, maybe you can try to write him there.
Good luck,
Dr Lathoud
In your paper <<Spatio-temporal Analysis of Spontaneous Speech with Microphone Arrays>>, in chapter 2,index 2.2 "Discrete Time Processing of Quasi-Stationary Signals",Note 3 tells me:"All singnals x(t),y(t) etc. are assumed to be limited to the frequency band[0,fs/2]".
My question is:all the *.wav files I downloaded from your website,have alread been limited to the frequency band[0,fs/2], or I have to limite the signals to the frequency band[0,fs/2] in my matlab programme?
The wav signals are already band-limited, within [0 fs/2]. You do not need to modify them, and you can use them as they are.
Best regards,
Dr. Lathoud
PS: Another way to understand this is that you can reconstruct the whole, *continuous* signal curve from the *discrete* samples in the .wav, assuming that the signal is band-limited (no frequency above half the sampling frequency: fs/2). This theoretical point becomes particularly practical when doing upsampling (e.g. in time domain GCC-PHAT).
I am now studying your implementation of SPR-PHAT and have several questions.
1. Why the following code can convert grid into the index ?
% We also convert the time-delays of the grid into a more usable format % -> 1-dimensional integer index in
up_gccphat(:)
up_grid_index = up_rowzero - round( grid_td * p.upfactor ); up_grid_index = up_grid_index + repmat( up_nrows * (0:npairs-1).', 1, size( up_grid_index, 2 ) );
Our target is a matrix of upsampled time domain GCC-PHAT values. Upsampled means:
About indexing:
up_rowzero
is the row index of the zero time delay.grid_td
contains time delay values.repmat( ... )
converts the row index values to 1-dimensional
index values - so that we can write something like up_gccphat( up_grid_index(...) )
.2. why upsampling is necessary?
GCC-PHAT peaks:
tau = distance difference / speed of sound * sampling frequency
appear at non-integer delay values (e.g. 1.234 samples). So you need
an accuracy better than what the inverse FFT provides you (tau
= 1 or
2).
Just play with any two signals, computing the frequency-domain GCC-PHAT, then the inverse FFT, and then play with upsample, and you will see the difference.
3. In :
halfportion = 1 + ceil( max( maxdelay ) ) + ceil( p.filterorder / p.upfactor ); up_rowzero = halforder + 1 + ( rowzero - rowstart ) * p.upfactor;
Could you please tell me why the left-hand side is equal to the right-hand side.
halfportion
: it determines what portion of the upsampled FFT we need
(when we know that two microphones are 20 cm appart, we don't need to
consider delays bigger than what the tau
equation above tells us).
halfportion
in turn should determine rowstart, if I remember
correctly.
up_rowzero
& halforder
: halforder
compensates the delay introduced by
the low-pass filter used right after upsampling (fir1(...)
if I
remember correctly).
As you can see in these two lines:
addpath ../..
o = seq_av163( 'seq01-1p-0000', '../..' );
...you need to install other things, including:
../../seq_av163.m (another MATLAB file)
...which you'll find through AV16.3's file index.
You also need to install the test data:
../../session08/seq01-1p-0000/ (a directory)
which you'll find in AV16.3's session08/seq01-1p-0000/ directory.
If you encounter any specific trouble running this baseline example, let me know.
For your information I did these programs under Matlab 6.5.1 (R13).
Now, as an attempt to help you, I offer several solution:
feature memstats
PAR.BLOCK_SIZE_SEC = 10; % in seconds
with:PAR.BLOCK_SIZE_SEC = 2; % in seconds (any value that you want)
Similarly, you can also modify PAR.BLOCK_SIZE_SEC
in the following two files:Yes, because the core assumptions are *not* specific to speech. We assume in time domain:
This is not specific to speech at all. For more details, see Section 5 in [3] (esp. Section 5.1 and Fig. 5).
You are by the way absolutely free to use other PDFs. I tried other, more complex PDFs than the ones in the paper, but that led to overfitting issues and suboptimal results.
Practically, it seems to be safer to stick to PDFs with very few parameters. In particular, the mathematical structure of the two PDFs should reflect assumptions A1. and A2. (e.g. "the signal of interest has *bigger* amplitudes than the noise signal" -> *shifted* Erlang pdf for the target speech signal).
In case that your noise signal does not match well Figs. 5a and 5b in [3], you may need some pre-processing - any sort of whitening, e.g. channel estimation and de-convolution, which is the idea of CHN in [3].
Now let us assume a noise signal that matches well Figs. 5a and 5b in [3], and a signal of interest that has at least long tails (=bigger amplitudes than noise), as in Fig. 5c in [3]. An important assumption is "slowly-varying" in A1. That's why the code processes the signal in blocks [4]:
% Size of the block on which we apply CHN / RSE-USS / GMN-USS
par_default.block_size_sec = 1.0; % 1.0 second
The block should be one or two orders of magnitude larger than a typical variation of the target signal (10 to 20 ms in the case of speech). The noise signal is assumed to have a constant amplitude on one block (e.g. 1 second), which also means, that we allow noise amplitude varations from one block to the next (1 second in the speech case).
P(A) is the probability to observe the target (speech) signal at any given (time, frequency) point of the spectrum.
P(I) is the probability *not* to observe the target speech signal at any given (time, frequency) point of the spectrum.
Hence P(I) + P(A) = 1.
Within the Expectation-Maximization (EM) context, you can find an example in [5] (P(Zi = 1) and P(Zi = 2)).
We are using the EM algorithm to adjust all parameters together - including the priors - so as to maximize the likelihood of the observed data (again, see [5] for an example). That is what we call "fitting the model (Rayleigh + Shifted Erlang) to the data", and is implemented in two steps in [6], both fully automatic:
Step 1. Automatic initialization using a rough, but reliable initial estimate. In [6]:
% Init priors
raylsherl.p0 = max( 0.1, min( 0.9, numel( lsr_ind ) / nx ) );
--> yields automatically some reasonable value for P(I), garanteed to be between 0.1 and 0.9 .
Step 2. At each iteration of the EM algorithm, in the M step, automatic adjustment of the priors, as explained in [5]. In our case the code [6] is:
% Update priors
tmp = my_logsum_fast( log_w );
raylsherl.p0 = min( 1, exp( tmp( 1 ) - my_logsum_fast( tmp ) ) );
Step 1. is particularly crucial, because "an EM algorithm may converge to a local maximum of the observed data likelihood function, depending on starting values." ([5], Section "Properties"). Incompetent researchers choose to ignore this mathematical fact.
In other words, the EM algorithm will never converge to an absolute optimum choice of parameters (= priors and pdf parameters), which is why you should pay a lot of attention to Step 1, and have a "rough, but reliable initial estimate" of the priors. "reliable" means that the initial estimate of the priors should guarantee a decent result of the EM convergence, and this in various situations (= various recordings, various contexts, various background noises).
In practice, initialization may be easier if you keep your PDFs simple (few parameters).
In the worst case, if for some reason you do not know (yet) what a good initialization should be, you can always use p0 = 0.5, which would mean: P(I) = 0.5 and P(A) = 1 - P(I) = 0.5
On the other hand, let me advise *against* manual tuning of the initial prior values, because it would mean that you are overfitting your data: you may obtain very good performance on a particular signal, and very bad performance on another = meaningless result.
Yes you are. If you only use USS, please cite the ASRU paper [7]. If you use CHN as well, please cite the RR 06-09 [3].
[3] Channel Normalization for Unsupervised Spectral Subtraction
[4] USS implementation: uss_filter.m
[5] Wikipedia - Expectation-Maximization algorithm
Produced on 2011-09-27 by qa.scm - by Guillaume Lathoud (glathoud _at_ yahoo _dot_ fr)