Continuous speech to text on the server with Cognitive Services

July 9, 2018
cognitive-services speech

Navigating current Microsoft’s offering of speech to text (S2T) services can get quite confusing. There are several services, which seemingly do the same, and twice as much SDKs. Fortunately, Cognitive Services team introduced the new Speech service, currently in Preview, which covers traditional Bing Speech API, Custom Speech and Speech Translator under one umbrella.

This article demonstrates how to solve the challenge of continuous speech to text transcription on the server side.

Imagine that someone talks to a microphone for an hour and instead of sending audio stream directly to Speech service, we first pass it through our API and then continously process results and send them further (to translator, to projector, anywhere…). Something like this:

Schema overview - audio goes to API first

In this article you will learn how to:

Prerequisites

Speech service is part of Microsoft Cognitive Services. If you don’t have an Azure subscription, you can register for a free Cognitive Services key.

Get API Key for Speech services

This tutorial uses Visual Studio 2017 with ASP.NET and Azure workloads.

Visual Studio 2017 workloads

Speech client library for .NET is currently only available for Windows and .NET Framework.

Prepare the backend application

No matter if you choose ASP.NET or ASP.NET Core, it’s important to remember to set up the project as .NET Framework application as the Speech SDK does not yet work on netcoreapp.

[UPDATE September 2018] Speech SDK is now available for .NET Core as well!

In the project I have recently worked on we have used WebSockets (SignalR) to stream byte arrays from a client application. Client takes care of resampling source audio into the correct format and chunk it into series of byte arrays as it comes from the microphone. Each chunk is one SignalR message.

byte[] buffer = new byte[2048]; // using 2kB byte arrays
int bytesRead;
while ((bytesRead = cs.Read(buffer, 0, buffer.Length)) > 0)
{
    Console.WriteLine("Sending chunk");
    // connection is declared somewhere else like this:
    // HubConnection connection = new HubConnectionBuilder().WithUrl...
    await connection.InvokeAsync("ReceiveAudio", buffer); // sending the ReceiveAudio message which API is able to process
}

Following is the required audio format (from documentation):

Property Value
File Format RIFF (WAV)
Sampling Rate 8000 Hz or 16000 Hz
Channels 1 (mono)
Sample Format PCM, 16-bit integers
File Duration 0.1 seconds < duration < 60 seconds
Silence Collar > 0.1 seconds

Prepare SpeechRecognizer

The most important class is the SpeechRecognizer from Microsoft.CognitiveServices.Speech NuGet package.

Install-Package Microsoft.CognitiveServices.Speech

Most of my code lived in the SignalR Hub. I’ve made my SpeechRecognizer and VoiceAudioStream static fields of the SignalR hub, because there will be only one client streaming at a time. (Otherwise I would investigate stream pooling etc.)

public class VoiceHub : Hub
{
    private static IConfiguration _config;
    private static SpeechRecognizer _speechClient;
    private static VoiceAudioStream _voiceAudioStream;

    public VoiceHub(IConfiguration configuration)
    {
        if (_config == null)
            _config = configuration;

        if (_speechClient == null)
        {
            var format = new AudioInputStreamFormat()
            {
                BitsPerSample = 16,
                BlockAlign = 2,
                AvgBytesPerSec = 32000,
                Channels = 1,
                FormatTag = 1,
                SamplesPerSec = 16000
            };

            _voiceAudioStream = new VoiceAudioStream(format); // custom AudioInputStream

            var factory = SpeechFactory.FromSubscription(
                _config.GetValue<string>("SpeechToTextApiKey"), 
                _config.GetValue<string>("SpeechToTextApiRegion")
            );
            _speechClient = factory.CreateSpeechRecognizerWithStream(_voiceAudioStream, "en-us");
            _speechClient.RecognitionErrorRaised += _speechClient_RecognitionErrorRaised;
            _speechClient.IntermediateResultReceived += _speechClient_IntermediateResultReceived;
            _speechClient.FinalResultReceived += _speechClient_FinalResultReceived;
        }
    }
    
    ...
}

There’s SpeechFactory available which creates Speech recognizers of various types - I was interested in SpeechRecognizerWithStream. First parameter is a stream object of custom type that inherits from AudioInputStream - more about that later.

You will need your Speech service subscription key and region where your instance is hosted (both can be found at the Azure Portal, free trial is hosted in westus region).

Region and subscription keys

Use lowercase, no spaces name of your region (North Europe will be northeurope, West US will be westus etc.).

Send audio to the Speech service

Here comes the tricky part. SpeechRecognizer uses a special type of stream: AudioInputStream. You cannot instantiate it (it’s an abstract class) and there’s no Write() method to override.

There’s a sample on GitHub which shows how to use AudioInputStream coming from a static file. Unfortunately, that’s not usable for continuous streaming.

Simple MemoryStream combined with BinaryStreamReader is not usable either, because it doesn’t block when there’s no more data to read. AudioInputStream expects Read() to return 0 only when the stream ends and no more data is expected.

The solution is to create custom audio stream class, derive from AudioInputStream and adding blocking variant of MemoryStream.

public class VoiceAudioStream : AudioInputStream
{
    AudioInputStreamFormat _format;
    EchoStream _dataStream;

    public VoiceAudioStream(AudioInputStreamFormat format)
    {
        // Making the job slightly easier by requiring audio format in the constructor.
        // Cognitive Speech services expect:
        //  - PCM WAV
        //  - 16k samples/s
        //  - 32k bytes/s
        //  - 2 block align
        //  - 16 bits per sample
        //  - mono
        _format = format;
        _dataStream = new EchoStream();
    }

    public override AudioInputStreamFormat GetFormat()
    {
        return _format;
    }

    public override int Read(byte[] dataBuffer)
    {
        return _dataStream.Read(dataBuffer, 0, dataBuffer.Length);
    }

    public void Write(byte[] buffer, int offset, int count)
    {
        _dataStream.Write(buffer, offset, count);
    }

    public override void Close()
    {
        _dataStream.Dispose();
        base.Close();
    }
}

Making the MemoryStream blocking (that is wait until there’s any new data to read, instead of returning 0) was already solved here. So borrowing the implementation:

public class EchoStream : MemoryStream
{
    private readonly ManualResetEvent _DataReady = new ManualResetEvent(false);
    private readonly ConcurrentQueue<byte[]> _Buffers = new ConcurrentQueue<byte[]>();

    public bool DataAvailable { get { return !_Buffers.IsEmpty; } }

    public override void Write(byte[] buffer, int offset, int count)
    {
        _Buffers.Enqueue(buffer.Take(count).ToArray()); // add new data to buffer
        _DataReady.Set(); // allow waiting reader to proceed
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        _DataReady.WaitOne(); // block until there's something new to read

        byte[] lBuffer;

        if (!_Buffers.TryDequeue(out lBuffer)) // try to read
        {
            _DataReady.Reset();
            return -1;
        }

        if (!DataAvailable)
            _DataReady.Reset();

        Array.Copy(lBuffer, buffer, lBuffer.Length);
        return lBuffer.Length;
    }
}

With these two implementations in place, I was ready to stream continuously.

public async void AudioStart()
{
    // client is first expected to send the "AudioStart" message
    await _speechClient.StartContinuousRecognitionAsync();
}

public void ReceiveAudio(byte[] audioChunk)
{
    // client can then stream byte arrays as "ReceiveAudio" messages
    _voiceAudioStream.Write(audioChunk, 0, audioChunk.Length);
}

Each chunk of audio that is received by SignalR method ReceiveAudio is written to the VoiceAudioStream which is being read by SpeechRecognizer and sent to the Speech service.

Receive speech to text response

In my case I processed both the intermediate results and final results of transcription. Showing the changing transcription to the user improved their experience, because they were able to see that “something” was happening.

There’s nothing too special about the implementation. It’s just two event handlers:

private async void _speechClient_IntermediateResultReceived(object sender, SpeechRecognitionResultEventArgs e)
{
    if (e.Result.Text.Length > 10)
        await EnqueueTranscriptAsync(e.Result.Text, false); // do anything with the result here

    Debug.WriteLine("Intermediate result: " + e.Result.Text);
}

private async void _speechClient_FinalResultReceived(object sender, SpeechRecognitionResultEventArgs e)
{
    if (e.Result.RecognitionStatus == RecognitionStatus.Recognized)
    {
        await EnqueueTranscriptAsync(e.Result.Text, true); // do anything with the result here
        _correlationId = Guid.Empty;
    }

    Debug.WriteLine("Final result: " + e.Result.Text);
}

When there’s no more speech to process, the client sends AudioEnd and SignalR hub stops recognition:

public void AudioEnd()
{
    _speechClient.StopContinuousRecognitionAsync();
}

Create custom acoustic model

The out-of-the-box speech model is great for native English speakers who talk in a noise free environment, don’t make many mistakes and don’t use special terminology.

My case was a conference transcription with experts from their field who were not native English speakers. Thankfully the Speech service offers customization in the form of custom acoustic model and custom language model.

Creating the customized model is straight-forward (just follow the documentation) with one caveat!

If the Speech service is created through Azure Portal in a region different from West US, it’s necessary to use the CRIS portal at URL specific to this region. In my case that was https://production-neu.cris.ai (North Europe).

Overview of the necessary steps:

  1. Prepare voice sampes (WAV files) in the right format (PCM, 16bit, mono, 16kHz etc.) - one or two sentences per sample.

  2. ZIP those files - all in root, no subfolders.

  3. Prepare transcripts of those samples (TXT file, \t delimited) - watch out if your text editor converts TABs to spaces.

filename.wav(tab)Transcript

  1. Sign in to https://production-neu.cris.ai (if your Speech service is in North Europe).

  2. Go to Subscriptions and link your Speech service key to the CRIS portal (it’s the same key that is used in the app).

  3. Go to Adaptation Data.

Custom Speech - Adaptation Data menu

  1. Import new acoustic dataset.

Custom Speech - import acoustic dataset

  1. Fill the form and click Import. It’s good to give the data a meaningful name and description to keep track of various data sets you may use.

Custom Speech - import acoustic data form

  1. Wait for it to process.

  2. Go to Acoustic Models and click Create New.

  3. Fill the form again and select the acoustic dataset added earlier.

  4. Wait for the processing to finish (it will take a few minutes, depending on the size of the dataset).

Use custom acoustic model

The best thing about the new Speech service is that there’s no code change needed in order to use the custom acoustic model. Just create a new endpoint and use the same key.

  1. Go to Endpoints and click Create New.
  2. Pick the right Subscription and Acoustic Model.
  3. You will get a list of endpoints, your key and other information used to call the model.

Next steps

Prepare your client application, connect it to microphone and do something smart with the transcripts ;)

Also, check out the Speech Translation service which does live translations of spoken word (in several languages).

comments powered by Disqus