Continuous speech to text on the server with Cognitive Services
cognitive-services speech

January 4, 2019

Navigating current Microsoft's offering of speech to text (S2T) services can get quite confusing. There are several services, which seemingly do the same, and twice as much SDKs. Fortunately, Cognitive Services team introduced the new Speech service, which covers traditional Bing Speech API, Custom Speech and Speech Translator under one umbrella.

[This post was massively updated in January 2019 to reflect significant improvements in the service which happened by the end of 2018.]

Imagine that someone talks to a microphone for an hour and instead of sending audio stream directly to Speech service, we first pass it through our API and then continously process results and send them further (to translator, to projector, anywhere...). Something like this:

Schema overview - audio goes to API first

The same approach can be used for live captioning on the web. This is my sample prototype application:

Web application prototype

Let's see how to solve the challenge of continuous speech to text transcription on the server side.

In this article you will learn how to:

Prerequisites

Speech service is part of Microsoft Cognitive Services. If you don't have an Azure subscription, you can register for a free Cognitive Services key.

Get API Key for Speech services

This tutorial uses Visual Studio 2017 with ASP.NET and Azure workloads.

Visual Studio 2017 workloads

Prepare the backend application

The Speech SDK is now available for .NET Core so it doesn't matter if you choose ASP.NET or ASP.NET Core.

In the project I have worked on we have used WebSockets (SignalR) to stream byte arrays from a client application. As the source audio comes from a microphone, client handles resampling into the correct format and then chunks it into series of byte arrays. Each chunk is one SignalR message.

byte[] buffer = new byte[2048]; // using 2kB byte arrays
int bytesRead;
while ((bytesRead = cs.Read(buffer, 0, buffer.Length)) > 0)
{
    Console.WriteLine("Sending chunk");
    // connection is declared somewhere else like this:
    // HubConnection connection = new HubConnectionBuilder().WithUrl...
    await connection.InvokeAsync("ReceiveAudio", buffer); // sending the ReceiveAudio message which API is able to process
}

Following is the required audio format (from documentation):

Property Value
File Format RIFF (WAV)
Sampling Rate 8000 Hz or 16000 Hz
Channels 1 (mono)
Sample Format PCM, 16-bit integers
File Duration 0.1 seconds < duration < 60 seconds
Silence Collar > 0.1 seconds

This GitHub comment specifies the format further:

If you want to test with a file, you can use ffmpeg to convert it to the right format:

ffmpeg -i "<source>.mp3" -acodec pcm_s16le -ac 1 -ar 16000 "<output>.wav"

Server-side recognition

You can find full source code on GitHub.

The most important class is the SpeechRecognizer from Microsoft.CognitiveServices.Speech NuGet package.

Install-Package Microsoft.CognitiveServices.Speech

Most of my code lives in the SignalR Hub. To support multiple connected clients (with their own audio source and recognition), I use a static Dictionary of connections.

Connection.cs

public class Connection
{
    public string SessionId;
    public SpeechRecognizer SpeechClient;
    public VoiceAudioStream AudioStream;
}

I've made my SpeechRecognizer and VoiceAudioStream static fields of the SignalR hub, because there will be only one client streaming at a time. (Otherwise I would investigate stream pooling etc.)

public class VoiceHub : Hub
{
    private static IConfiguration _config;
    private static IHubContext<VoiceHub> _hubContext;
	private static Dictionary<string, Connection> _connections;

    public VoiceHub(IConfiguration configuration, IHubContext<VoiceHub> ctx)
    {
        if (_config == null)
            _config = configuration;

		if (_connections == null)
        	_connections = new Dictionary<string, Connection>();

		if (_hubContext == null)
        	_hubContext = ctx;
    }
    
    ...
}

Every connection is initialized with first websocket message (AudioStart) and added to the dictionary of connections.

public async void AudioStart(byte[] args)
{
    Debug.WriteLine($"Connection {Context.ConnectionId} starting audio.");

    // Client configuration information comes in the message as a byte array.
    var str = System.Text.Encoding.ASCII.GetString(args);
    var keys = JObject.Parse(str);

    // We will continuously write incoming audio to shis stream.
    var audioStream = new VoiceAudioStream();

    // Initialize with the format required by the Speech service
    var audioFormat = AudioStreamFormat.GetWaveFormatPCM(16000, 16, 1);
    // Configure speech SDK to work with the audio stream in right format.
    // Alternatively this can be a direct microphone input.
    var audioConfig = AudioConfig.FromStreamInput(audioStream, audioFormat);
    // Get credentials and region from client's message.
    var speechConfig = SpeechConfig.FromSubscription(keys["speechKey"].Value<string>(), keys["speechRegion"].Value<string>());
    // Endpoint needs to be set separately.
    speechConfig.EndpointId = keys["speechEndpoint"].Value<string>();

    // With all the info at hand, finally create the SpeechRecognizer instance.
    var speechClient = new SpeechRecognizer(speechConfig, audioConfig);

    // Main events to handle.
    // - Recognized = recognition of a block of text finished
    // - Recognizing = intermediate results
    // - Canceled = something went wrong, such as more request than the allowed concurrency
    speechClient.Recognized += _speechClient_Recognized;
    speechClient.Recognizing += _speechClient_Recognizing;
    speechClient.Canceled += _speechClient_Canceled;

	// With the recognizer created, we have access to the speech session ID.
    // It will be used later to notify the right client.
    string sessionId = speechClient.Properties.GetProperty(PropertyId.Speech_SessionId);

    var conn = new Connection()
    {
        SessionId = sessionId,
        AudioStream = audioStream,
        SpeechClient = speechClient,
    };

    // Connections are indexed by the SignalR connection ID.
    _connections.Add(Context.ConnectionId, conn);

    // Finally start recognition on this particular client.
    await speechClient.StartContinuousRecognitionAsync();

    Debug.WriteLine("Audio start message.");
}

Comments inline. The key is to properly initialize the SpeechRecognizer to work with VoiceAudioStream (which is a custom class and we will get to it in the next section).

This code expects the client to provide information about the service. You will need your Speech service subscription key and region where your instance is hosted (both can be found at the Azure Portal, free trial is hosted in westus region).

Region and subscription keys

Use lowercase, no spaces name of your region (North Europe will be northeurope, West US will be westus etc.).

Send audio to the Speech service

Here comes the tricky part. SpeechRecognizer uses a special type of stream: PullAudioStreamCallback. This stream pulls data from buffer every time the service is ready to process more. If there's no data available, it will wait. If there's no data expected, the stream should return 0. You cannot instantiate this class (it's an abstract class) and there's no Write() method to override.

There's a sample on GitHub which shows how to use PullAudioStreamCallback with a static file. That's not usable for continuous streaming.

Simple MemoryStream combined with BinaryStreamReader is not usable either, because it doesn't block when there's no more data to read. PullAudioStreamCallback expects Read() to return 0 only when the stream ends and no more data is expected.

The solution is to create custom audio stream class, derive from PullAudioStreamCallback, add the Write() method and add blocking variant of MemoryStream.

VoiceAudioStream.cs

public class VoiceAudioStream : PullAudioInputStreamCallback
{
    private readonly EchoStream _dataStream = new EchoStream();
    private ManualResetEvent _waitForEmptyDataStream = null;

    public override int Read(byte[] dataBuffer, uint size)
    {
        if (_waitForEmptyDataStream != null && !_dataStream.DataAvailable)
        {
            _waitForEmptyDataStream.Set();
            return 0;
        }

        return _dataStream.Read(dataBuffer, 0, dataBuffer.Length);
    }

    public void Write(byte[] buffer, int offset, int count)
    {
        _dataStream.Write(buffer, offset, count);
    }

    public override void Close()
    {
        if (_dataStream.DataAvailable)
        {
            _waitForEmptyDataStream = new ManualResetEvent(false);
            _waitForEmptyDataStream.WaitOne();
        }

        _waitForEmptyDataStream.Close();
        _dataStream.Dispose();
        base.Close();
    }
}

Making the MemoryStream blocking (that is wait until there's any new data to read, instead of returning 0) was already solved here. So borrowing the implementation:

EchoStream.cs

public class EchoStream : MemoryStream
{
    private readonly ManualResetEvent _DataReady = new ManualResetEvent(false);
    private readonly ConcurrentQueue<byte[]> _Buffers = new ConcurrentQueue<byte[]>();

    public bool DataAvailable { get { return !_Buffers.IsEmpty; } }

    public override void Write(byte[] buffer, int offset, int count)
    {
        _Buffers.Enqueue(buffer.Take(count).ToArray()); // add new data to buffer
        _DataReady.Set(); // allow waiting reader to proceed
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        _DataReady.WaitOne(); // block until there's something new to read

        byte[] lBuffer;

        if (!_Buffers.TryDequeue(out lBuffer)) // try to read
        {
            _DataReady.Reset();
            return -1;
        }

        if (!DataAvailable)
            _DataReady.Reset();

        Array.Copy(lBuffer, buffer, lBuffer.Length);
        return lBuffer.Length;
    }
}

With these two implementations in place, streaming can begin (with the AudioStart message described above). Client is streaming audio as byte arrays in ReceiveAudio SignalR messages.

public void ReceiveAudio(byte[] audioChunk)
{
	// Audio chunk received. Write it to the appropriate audio stream.
    // This helps separate different streaming clients.
    _connections[Context.ConnectionId].AudioStream.Write(audioChunk, 0, audioChunk.Length);
}

Each chunk of audio that is received by SignalR method ReceiveAudio is written to the VoiceAudioStream which is being read by SpeechRecognizer and sent to the Speech service.

Receive speech to text response

In my case I processed both the intermediate results and final results of transcription. Showing the changing transcription to the user improved their experience, because they were able to see that "something" was happening.

There's nothing too special about the implementation. It's just three event handlers:

private void _speechClient_Canceled(object sender, SpeechRecognitionCanceledEventArgs e)
{
    // Would be nice to actually notify the client..... ;)
    Debug.WriteLine("Recognition was cancelled.");
}

// Intermediate results - may differ from the final result, because it's getting more precise with each chunk of processed audio.
private async void _speechClient_Recognizing(object sender, SpeechRecognitionEventArgs e)
{
    Debug.WriteLine($"{e.SessionId} > Intermediate result: {e.Result.Text}");
    await SendTranscript(e.Result.Text, e.SessionId);
}

// Final result for given speech segment.
private async void _speechClient_Recognized(object sender, SpeechRecognitionEventArgs e)
{
    Debug.WriteLine($"{e.SessionId} > Final result: {e.Result.Text}");
    await SendTranscript(e.Result.Text, e.SessionId);
}

At this point it's not surprising that the SendTranscript() method produces another SignalR message, this time for the client:

public async Task SendTranscript(string text, string sessionId)
{
    var connection = _connections.Where(c => c.Value.SessionId == sessionId).FirstOrDefault();
    await _hubContext.Clients.Client(connection.Key).SendAsync("IncomingTranscript", text);
}

SignalR hub stops recognition when the client disconnects.

public async override Task OnDisconnectedAsync(Exception exception)
{
    var connection = _connections[Context.ConnectionId];
    await connection.SpeechClient.StopContinuousRecognitionAsync();
    connection.SpeechClient.Dispose();
    connection.AudioStream.Dispose();
    _connections.Remove(Context.ConnectionId);

    await base.OnDisconnectedAsync(exception);
}

Custom acoustic model

The out-of-the-box speech model is great for native English speakers who talk in a noise free environment, don't make many mistakes and don't use special terminology.

My case was a conference transcription with experts from their field who were not native English speakers. Thankfully the Speech service offers customization in the form of custom acoustic model and custom language model.

Create custom model

Creating the customized model is straight-forward (just follow the documentation) with one caveat!

If the Speech service is created through Azure Portal in a region different from West US, it's necessary to use the CRIS portal at URL specific to this region. In my case that was https://northeurope.cris.ai (North Europe).

Overview of the necessary steps:

  1. Prepare voice sampes (WAV files) in the right format (PCM, 16bit, mono, 16kHz etc.) - one or two sentences per sample.

  2. ZIP those files - all in root, no subfolders.

  3. Prepare transcripts of those samples (TXT file, \t delimited) - watch out if your text editor converts TABs to spaces.

    filename.wav(tab)Transcript

  4. Sign in to https://northeurope.cris.ai (if your Speech service is in North Europe).

  5. Go to Subscriptions and link your Speech service key to the CRIS portal (it's the same key that is used in the app).

  6. Go to Adaptation Data.

    Custom Speech - Adaptation Data menu

  7. Import new acoustic dataset.

    Custom Speech - import acoustic dataset

  8. Fill the form and click Import. It's good to give the data a meaningful name and description to keep track of various data sets you may use.

    Custom Speech - import acoustic data form

  9. Wait for it to process.

  10. Go to Acoustic Models and click Create New.

  11. Fill the form again and select the acoustic dataset added earlier.

  12. Wait for the processing to finish (it will take a few minutes, depending on the size of the dataset).

Alternatively, you can use the unofficial Speech CLI.

Use custom acoustic model

The best thing about the new Speech service is that there's no code change needed in order to use the custom acoustic model. Just create a new endpoint and use the same key.

  1. Go to Endpoints and click Create New.
  2. Pick the right Subscription and Acoustic Model.
  3. You will get a list of endpoints, your key and other information used to call the model.
  4. Grab the enpoint ID, key and region and use them in your client application.

Endpoint ID, key and region on the CRIS portal

Client application

Thanks to the fact that the Speech service communication is handled by the server, client application can be implemented in any way as long as it supports SignalR. I've used .NET Core console app for testing in development and Vue.js web frontend for final demo app.

Web frontend prototype with video player and settings

The client has to take care of three things:

  1. capture input audio,
  2. convert it to the right format,
  3. send it to the API.

My web frontend uses @aspnet/signalr and @aspnet/signalr-protocol-msgpack. Most of the heavy lifting is done in the Transcriptor class:

Transcriptor.ts

constructor(voiceHubUrl: string, videoElement: HTMLVideoElement) {
    this._voiceHub = new signalR.HubConnectionBuilder()
        .withUrl(voiceHubUrl)
        .withHubProtocol(new signalRmsgpack.MessagePackHubProtocol())
        .configureLogging(signalR.LogLevel.Information)
        .build();

    // Handle incoming transcript as a SignalR message.
    this._voiceHub.on("IncomingTranscript", (message) => {
        console.log("Got message: " + message);
        this.transcriptReadyHandler(message);
    });
    
    this._videoElement = videoElement;
}

First message when starting the stream must be AudioStart immediately followed by the WAV/RIFF header as first ReceiveAudio message.

async startTranscript(speechKey: string, speechRegion: string, speechEndpoint: string) {
    if (this._voiceHub.state != signalR.HubConnectionState.Connected)
        await this._voiceHub.start();
    
    var startMessage = JSON.stringify({speechKey: speechKey, speechRegion: speechRegion, speechEndpoint: speechEndpoint});

    // Opened connection is set to work with byte arrays, so everything has to be converted to a series of bytes. Even JSON objects.
    let buf: ArrayBuffer = new ArrayBuffer(startMessage.length);
    let bufView: Uint8Array = new Uint8Array(buf);

    for (var i = 0; i < startMessage.length; ++i) {
        var code = startMessage.charCodeAt(i);
        bufView[i] = code;
    }
    
    // First message, initialize recognizer on the server.
    this._voiceHub.send("AudioStart", bufView);
    // Immediately followed by the WAV header.
    this._voiceHub.send("ReceiveAudio", new Uint8Array(this.createStreamRiffHeader()));
    // And then actual data.
    this.startStreaming();
}

In this case I'm capturing audio source from an HTML5 <video> element.

private startStreaming() {
    if (this.context == null)            
        this.context = new AudioContext();
    else
        this.context.resume();
    
    if (this.mediaSource == null)
        this.mediaSource = this.context.createMediaElementSource(this._videoElement);
    
    this.filter = this.context.createBiquadFilter();
    this.filter.type = "lowpass";
    this.filter.frequency.setValueAtTime(8000, this.context.currentTime);

    // This will be a step in the audio processing pipeline. It will send the data to API.
    this.jsScriptNode = this.context.createScriptProcessor(4096, 1, 1);
    this.jsScriptNode.onaudioprocess = 
                (processingEvent: any) => 
                    this.processAudio(processingEvent); // passing as arrow to preserve context of 'this'

    // First a lowpass filter.
    if (this.mediaSource.mediaSource == null)
        this.mediaSource.connect(this.filter);
    // Then send to the service.
    this.filter.connect(this.jsScriptNode);
    // Finally send to the default output.
    this.jsScriptNode.connect(this.context.destination);        
}

Audio passes through a processing pipeline one of the steps sends audio data to our API:

private processAudio(audioProcessingEvent: any) {
    var inputBuffer = audioProcessingEvent.inputBuffer;
    // The output buffer contains the samples that will be modified and played
    var outputBuffer = audioProcessingEvent.outputBuffer;

    var isampleRate = inputBuffer.sampleRate;
    var osampleRate = 16000;
    var inputData = inputBuffer.getChannelData(0);
    var outputData = outputBuffer.getChannelData(0);
    var output = this.downsampleArray(isampleRate, osampleRate, inputData);

    for (var i = 0; i < outputBuffer.length; i++) {
        outputData[i] = inputData[i];
    }

    // SignalR message.
    this._voiceHub.send("ReceiveAudio", new Uint8Array(output.buffer)).catch(this.handleError);
}

Next steps

Prepare your client application, connect it to microphone and do something smart with the transcripts ;)

Also, check out the Speech Translation service which does live translations of spoken word (in several languages).

Full source code is available on GitHub.

Found something inaccurate or plain wrong? Was this content helpful to you? Let me know!

šŸ“§ codez@deedx.cz