Voice recognition from live stream from microphone?

Hi,

We are trying to record 1 second samples from microphone and use 0.5 second overlap with samples and feed it kws example. We have issues with the recognition, now matter what we speak to microphone we get probability value : 0.083333. We have tried to increase volume by multiplying values. Have you ever tried something like this? Does the PCM mono data coming from the microphone be manipulated? I also tried to save my voice as hex values, converted it wav and used that in example. This seems to work if I increase the volume with for example Audacity.

Br,

Teppo

Hi Teppo,

That logic of taking 1 second of audio samples and then “sliding” by 0.5 seconds so you overlap by 0.5 seconds is what is currently done in the KWS sample and works as expected. If you use one of the longer clips in the resources folder you will see this in action and get multiple different keywords spotted in a sample.

Regarding the data, the preprocessing for the sample just expects PCM mono int16 data sampled at 16000Hz.

If you say you saved your voice sample and converted it to wav and this works, then it may be likely that there is some issue with sending the data from microphone to the preprocessing stage and something is going wrong at this point?

One other thing to be aware of when taking audio from mic - you might want to reduce how much you stride by through the raw data, instead of 0.5 seconds you might want to do an inference every 640 audio samples and calculate a running average to increase accuracy of results.

Hope this helps,
Richard

Hi Richard, thanks for the response!

We got it working when we take 2 second mono samples with 16KHz, but it only recognizes the words from the first 2 second sample. We have tried to use just 2 second samples and no overlapping between 2 second samples. We tried to initialize arm::app::Model and create everything again but issue is the same, after the device reset only the first 2 second sample gets recognized. Do we need to reset some memory areas, model/tensors, variables or such?

-Teppo

Hi @Burton2000,

And additional info is that we use ML exactly the same way as with multiple wav files (which we tested to work ok). The only difference is that with MIC we use the same buffer each round data (we record the audio from mic each round just before creating the audioDataSlider) and with wav files system seems to have an own buffer for each sample.

*Kimmo

It’s sounding like an issue with getting the data correctly into preprocessing when using the MIC.

From what I am understanding, you tested with a 2 second wav file and everything is okay. When you move to using the MIC you record 2 seconds of audio, store this in a buffer and create an audioDataSlider using it. Anything in that first 2 seconds is recognized correctly (this would be 4 inferences total if none of the other settings have been changed). You then take the next 2 seconds, overwrite the old buffer, but now don’t get any detections in this next 2 seconds of audio, is that correct?

The audioDataSlider is meant for grabbing data from a baked-in array of audio data so there might be issues if trying to re-use a buffer for live audio. You would need to call Reset on the slider and point to the start of the buffer again. However, if you are creating a new audioDataSlider each time you grab enough new data from the MIC it shouldn’t be an issue I would think.

There is a caching mechanism in the preprocessing that might be causing issues, you can disable this by setting preProcess.m_audioWindowIndex = 0 just before DoPreProcess is called. It will be a bit slower to run but you can see if that is the issue. I don’t believe there is any other variables etc. that need to be changed or reset.

One other thing, are you sure that the audio data being written into the buffer from the mic is the correct recorded audio with the keywords in it? When we did this sample: (ML-examples/tflu-kws-cortex-m/kws_cortex_m at main · ARM-software/ML-examples · GitHub) that works on live audio that was one of the issues we faced then.

Hi @Burton2000

You then take the next 2 seconds, overwrite the old buffer, but now don’t get any detections in this next 2 seconds of audio, is that correct?

Yes this is the situation, we have tested with new audioDataSlider and resetting it. Weird thing is that if we run the example with 4 audios, it works. If we give our own voice from microphone only first 2 second sample is recognized. Also if the first 2 second sample is recognized, then all the following samples are recognized with same keyword and probability. This indicates that something is cached.
We also tested disabling cache and using different buffer.
We have also verified that samples coming from microphone are valid, we printed it as hex, converted to pcm and listened the pcm samples. Any ideas?

-Teppo

Hi @Burton2000,

Some additional info on top of Teppo’s comment. Here is the pseudo code and our modifications to kws example:

/* KWS inference handler. */
bool ClassifyAudioHandler(ApplicationContext& ctx, uint32_t clipIndex, bool runAll)
{
...
/* Loop to process audio clips. */
do {
	hal_lcd_clear(COLOR_BLACK);

	<here we record audio into audio_sample -variable (int16_t audio_sample[64000] IFM_BUF_ATTRIBUTE)
	 and convert it to mono (our I2S driver records stereo only at the moment), so after the conversion
	 only the first 32000 bytes from the beginning are valid>
	
	uint32_t currentIndex = 0; //auto currentIndex = ctx.Get<uint32_t>("clipIndex");

	/* Creating a sliding window through the whole audio clip. */
	auto audioDataSlider = audio::SlidingWindow<const int16_t>(
			    audio_sample, //get_audio_array(currentIndex),
				32000, //get_audio_array_size(currentIndex),
				preProcess.m_audioDataWindowSize, preProcess.m_audioDataStride);
...
     }while (1); //(runAll && ctx.Get<uint32_t>("clipIndex") != initialClipIdx);

*Kimmo

Hi Richard, do you have access to any Alif dev kits? Just wondering if you have the possibility to reproduce the issue locally on Alif HW.

Regards,

Neil.