Question

Q: AV16.3 Corpus: Where can I access, browse and download files?

Answer 1

A: It contains, the speaker ground-truth annotation for the recording seq01-1p-0000, which means spatial location and speech time segmentation:

static_gt( 1 ).p2d is a 9 by 16 matrix of true 2D speaker locations:

each column corresponds to one of the 16 speaker locations,
each column has 9 rows = 3 times homogeneous 2D pixel values, one for each camera: [ x1; y1; 1; x2; y2; 1; x3; y3; 1 ]

static_gt( 1 ).p3d, is a 3 by 16 matrix of true 3D speaker locations, reconstructed from static_gt( 1 ).p2d see http://glat.info/ma/av16.3/EXAMPLES/3D-RECONSTRUCTION/index.html

static_gt( 1 ).sp_seg, is a 2 by 169 matrix of speech segments:

row 1 contains start time of a speech segment, in seconds,
row 2 contains stop time of a speech segment, in seconds.

static_gt( 1 ).pos_ind, is a 1 by 169 matrix of integers. Each integer is a column index in .p3d : the integer tells where the speaker was for a given speech segment, because the columns of sp_seg match the columns of pos_ind.

static_gt( 1 ).speaker_id, is a 1 by 169 matrix of integers, telling who spoke for a given segment in sp_seg (here there is only one speaker in the whole sequence).

static_gt( 1 ).array( 1 ).Pmat and static_gt( 1 ).array( 2 ).Pmat are the 4x4 homogeneous 3D transform matrix (rotation + translation) defining the 3D referent of each microphone array, see http://glat.info/ma/av16.3/EXAMPLES/3D-RECONSTRUCTION/index.html for a concrete example.

Answer 2

A: (email excerpt)

Objet: Query regarding the UCA Hardware setup
I am writing in to ask you about some suggestions and information regarding the hardware setup you used in your thesis. I saw a picture of it on your personal webpage, what kind of microphones are you using, what preamps and sound card you were using for recording.
Actually we have 16 channel UCA in our lab, we built it on the baseline of NIST Mark III microphone array. We are using RME 800 sound card for audio capture. Recently we realized the microphones are not the best of quality for Blind Source Separation and the frequency response is mismatched between the microphones, we are trying to find some better replacement.

I do not remember which types of microphones we used - please simply write about this topic to Olivier Masson (olivier.masson@idiap.ch) on my behalf.

More specifically about the Nist Mark III array, other people already have had coherence problems similar to yours. Dr Luca Breyda did modifications to the Mark III to correct the problem [1] [2]. He was at Eurecom at that time, maybe you can try to write him there.

Good luck,

Dr Lathoud

Answer 3

A: (email excerpt)

In your paper <<Spatio-temporal Analysis of Spontaneous Speech with Microphone Arrays>>, in chapter 2,index 2.2 "Discrete Time Processing of Quasi-Stationary Signals",Note 3 tells me:"All singnals x(t),y(t) etc. are assumed to be limited to the frequency band[0,fs/2]".
My question is:all the *.wav files I downloaded from your website,have alread been limited to the frequency band[0,fs/2], or I have to limite the signals to the frequency band[0,fs/2] in my matlab programme?

The wav signals are already band-limited, within [0 fs/2]. You do not need to modify them, and you can use them as they are.

Best regards,

Dr. Lathoud

PS: Another way to understand this is that you can reconstruct the whole, *continuous* signal curve from the *discrete* samples in the .wav, assuming that the signal is band-limited (no frequency above half the sampling frequency: fs/2). This theoretical point becomes particularly practical when doing upsampling (e.g. in time domain GCC-PHAT).

Answer 4

A: (email excerpt)

I am now studying your implementation of SPR-PHAT and have several questions.
1. Why the following code can convert grid into the index ?
   % We also convert the time-delays of the grid into a more usable format
   % -> 1-dimensional integer index in up_gccphat(:)

   up_grid_index = up_rowzero - round( grid_td * p.upfactor );
   up_grid_index = up_grid_index + repmat( up_nrows * (0:npairs-1).', 1, size( up_grid_index, 2 ) );

Our target is a matrix of upsampled time domain GCC-PHAT values. Upsampled means:

same number of columns (one column per pair of microphones).
p.upfactor more rows (one row per sub-sample in the time domain).

About indexing:

up_rowzero is the row index of the zero time delay.
grid_td contains time delay values.
the repmat( ... ) converts the row index values to 1-dimensional index values - so that we can write something like up_gccphat( up_grid_index(...) ) .

2. why upsampling is necessary?

GCC-PHAT peaks:

 tau = distance difference / speed of sound * sampling frequency

appear at non-integer delay values (e.g. 1.234 samples). So you need an accuracy better than what the inverse FFT provides you (tau = 1 or 2).

Just play with any two signals, computing the frequency-domain GCC-PHAT, then the inverse FFT, and then play with upsample, and you will see the difference.

3. In :
    halfportion = 1 + ceil( max( maxdelay ) ) + ceil( p.filterorder / p.upfactor );
    up_rowzero = halforder + 1 + ( rowzero - rowstart ) * p.upfactor;
Could you please tell me why the left-hand side is equal to the right-hand side.

halfportion: it determines what portion of the upsampled FFT we need (when we know that two microphones are 20 cm appart, we don't need to consider delays bigger than what the tau equation above tells us).

halfportion in turn should determine rowstart, if I remember correctly.

up_rowzero & halforder: halforder compensates the delay introduced by the low-pass filter used right after upsampling (fir1(...) if I remember correctly).

Answer 5

A:

Try first to run the baseline example: a simple but very slow method where we evaluate SRP-PHAT at each point of a large grid, and pick the best point, for each time frame.
As you can see in these two lines:
```
addpath ../..
o = seq_av163( 'seq01-1p-0000', '../..' );
```
...you need to install other things, including:
```
../../seq_av163.m      (another MATLAB file)
```
...which you'll find through AV16.3's file index.
You also need to install the test data:
```
../../session08/seq01-1p-0000/        (a directory)
```
which you'll find in AV16.3's session08/seq01-1p-0000/ directory.
If you encounter any specific trouble running this baseline example, let me know.
Once you have the above (slow) baseline working, try to run an faster, multispeaker version.

Answer 6

A: (email excerpt)

For your information I did these programs under Matlab 6.5.1 (R13).

Now, as an attempt to help you, I offer several solution:

If you can, try with Matlab 6.5.1 (R13).
If it still does not work:
- Verify how much memory is available to MATLAB by typing:
```
feature memstats
```
- Try to increase the memory available to MATLAB. Instructions can be found on the Matlab support website.
If it still does not work, you can try to reduce the block size. For example, in FASTTDE_detect_locate_wrapper.m you can replace this line:
```
PAR.BLOCK_SIZE_SEC = 10; % in seconds
```
with:
```
PAR.BLOCK_SIZE_SEC = 2; % in seconds (any value that you want)
```
Similarly, you can also modify PAR.BLOCK_SIZE_SEC in the following two files:
- FAST_detect_locate_wrapper.m
- FULL_detect_locate_wrapper.m

Answer 7

A: (email excerpt)

Yes, because the core assumptions are *not* specific to speech. We assume in time domain:

A1. a slowly-varying, white gaussian background noise,
A2. the signal of interest should ideally be non-white (although not a strict requirement), and more important, its pdf must have longer tails (= bigger amplitudes that noise).

This is not specific to speech at all. For more details, see Section 5 in [3] (esp. Section 5.1 and Fig. 5).

You are by the way absolutely free to use other PDFs. I tried other, more complex PDFs than the ones in the paper, but that led to overfitting issues and suboptimal results.

Practically, it seems to be safer to stick to PDFs with very few parameters. In particular, the mathematical structure of the two PDFs should reflect assumptions A1. and A2. (e.g. "the signal of interest has *bigger* amplitudes than the noise signal" -> *shifted* Erlang pdf for the target speech signal).

Answer 8

A: (email excerpt)

In case that your noise signal does not match well Figs. 5a and 5b in [3], you may need some pre-processing - any sort of whitening, e.g. channel estimation and de-convolution, which is the idea of CHN in [3].

Now let us assume a noise signal that matches well Figs. 5a and 5b in [3], and a signal of interest that has at least long tails (=bigger amplitudes than noise), as in Fig. 5c in [3]. An important assumption is "slowly-varying" in A1. That's why the code processes the signal in blocks [4]:

  % Size of the block on which we apply CHN / RSE-USS / GMN-USS
  par_default.block_size_sec = 1.0; % 1.0 second

The block should be one or two orders of magnitude larger than a typical variation of the target signal (10 to 20 ms in the case of speech). The noise signal is assumed to have a constant amplitude on one block (e.g. 1 second), which also means, that we allow noise amplitude varations from one block to the next (1 second in the speech case).

Answer 9

A: (email excerpt)

P(A) is the probability to observe the target (speech) signal at any given (time, frequency) point of the spectrum.

P(I) is the probability *not* to observe the target speech signal at any given (time, frequency) point of the spectrum.

Hence P(I) + P(A) = 1.

Within the Expectation-Maximization (EM) context, you can find an example in [5] (P(Zi = 1) and P(Zi = 2)).

Answer 10

A: (email excerpt)

We are using the EM algorithm to adjust all parameters together - including the priors - so as to maximize the likelihood of the observed data (again, see [5] for an example). That is what we call "fitting the model (Rayleigh + Shifted Erlang) to the data", and is implemented in two steps in [6], both fully automatic:

Step 1. Automatic initialization using a rough, but reliable initial estimate. In [6]:

  % Init priors

  raylsherl.p0 = max( 0.1, min( 0.9, numel( lsr_ind ) / nx ) );

--> yields automatically some reasonable value for P(I), garanteed to be between 0.1 and 0.9 .

Step 2. At each iteration of the EM algorithm, in the M step, automatic adjustment of the priors, as explained in [5]. In our case the code [6] is:

      % Update priors

      tmp = my_logsum_fast( log_w );
      
      raylsherl.p0 = min( 1, exp( tmp( 1 ) - my_logsum_fast( tmp ) ) );

Step 1. is particularly crucial, because "an EM algorithm may converge to a local maximum of the observed data likelihood function, depending on starting values." ([5], Section "Properties"). Incompetent researchers choose to ignore this mathematical fact.

In other words, the EM algorithm will never converge to an absolute optimum choice of parameters (= priors and pdf parameters), which is why you should pay a lot of attention to Step 1, and have a "rough, but reliable initial estimate" of the priors. "reliable" means that the initial estimate of the priors should guarantee a decent result of the EM convergence, and this in various situations (= various recordings, various contexts, various background noises).

In practice, initialization may be easier if you keep your PDFs simple (few parameters).

In the worst case, if for some reason you do not know (yet) what a good initialization should be, you can always use p0 = 0.5, which would mean: P(I) = 0.5 and P(A) = 1 - P(I) = 0.5

On the other hand, let me advise *against* manual tuning of the initial prior values, because it would mean that you are overfitting your data: you may obtain very good performance on a particular signal, and very bad performance on another = meaningless result.

Answer 11

A: (email excerpt)

Yes you are. If you only use USS, please cite the ASRU paper [7]. If you use CHN as well, please cite the RR 06-09 [3].

Questions and answers about microphone array ressources and Unsupervised Spectral Subtraction

References