← All posts

Improving Speech To Text With AI

There are a few techniques for improving AI speech-to-text transcription without relying on throwing more hardware at the problem (although that always helps).

One of the features of NeoSCAN is to send transmissions to a speech-to-text service in order to keep a transcript of the day's radio traffic. This is accomplished by running Whisper AI on a local machine with a simple API wrapper that accepts the audio file from NeoSCAN and returns the transcribed text.

Two benefits of using Whisper AI are that you don't have to incur an ongoing charge for an AI subscription, and Whisper has several models of different size that can be used based on the hardware available.

When looking at a speech-to-text solution, there are two major factors to consider.

  • Accuracy of the model being used. How well does the model transcribe radio-quality audio into text?
  • Context. Radio transmissions in a given area will include the names of towns, streets, businesses, and other landmarks. Also, first responders have their own jargon that adds context and nuance to the actual words.

For accuracy, we have some control over the situation. We can run a bigger model, or use an external service. The limiting factor is often cost over capability, and the models get better every day.

Context is a different story. Even the best models are trained on generic data sets. A strict audio-to-text transcription doesn't know what town you live in, what radio transmissions you are monitoring, and what shorthand your local responders use. Some context can be fed into Whisper when starting the transcription. For example, telling it "this is a first responder radio transmission" improves things somewhat.

A technique to gain additional accuracy is to run the transcribed text through a text-based LLM like ChatGPT or Claude and have it correct the text using contextual clues like geography and domain. For example, "this is a police transmission from Anytown, USA" or "the Anytown fire department has four vehicles named Car 1, Truck 3, Engine 2, and Medic 1".

Here is an example using real data from NeoSCAN:

Actual Text (human transcribed):

328, 328. Can you respond to the area of number 9, Wayside Road? Caller says she lives in the neighborhood. She believes that's the house she's speaking of, but she states there's multiple cars parked in the driveway that's starting to block the sidewalk. She'd like to have someone speak to them.

Whisper (running locally, Medium model):

I'll give you 328. 328. Can you respond to the area of number 9, Wayside Road? Carla says she lives in the neighborhood. She believes that's the house she's speaking of, but she says multiple cars parked in the driveway that's starting to block the sidewalk. She doesn't want to speak to them.

Whisper (running locally, Large model):

328. 328. Can you respond to the area of number 9 Wayside Road? Carla states she lives in the neighborhood. She believes that's the house she's speaking of, but she states there's multiple cars parked in the driveway that's starting to block the sidewalk. She'd like to have someone speak to them.

The Large Whisper does a better job of getting the exact wording correct. It also doesn't hallucinate words at the beginning of the transmission. However, both models think Carla is calling the cops.

The next example takes the text from the Whisper Large model and feeds it through Anthropic AI using the Haiku model with a prompt that includes details about the neighboring towns, specific police codes used locally, and similar context.

Whisper (running locally, Large model, corrected by Claude Haiku):

328. 328. Can you respond to the area of number 9, Wayside Road? Caller says she lives in the neighborhood. She believes that's the house she's speaking of, but she states that multiple cars parked in the driveway that's starting to block the sidewalk. She like to have someone speak to them.

This is very accurate. Some of the improvements are likely caused by Whisper doing a better job on the original audio file.

Running a local model like Whisper isn't always an option, so I have done some tests using AssemblyAI. AssemblyAI is an affordable cloud-based service that works very similarly to Whisper AI. (As an example of the affordability - I was able to convert 100 hours of audio to text using the free tier.)

Here's a transcription of the same audio file using just AssemblyAI:

Dispatch, 10-4, 10-7, 10-8, 10-9, Wastside Road. Carla states she lives in the neighborhood. She believes that's the house she's speaking of, but she states there's multiple cars parked in the driveway that's starting to block the sidewalk. She'd like to have someone speak to them.

Not great. But when combined with a second pass through Claude Haiku, we get much better results:

328, 328. Can you respond to the area of number 9, Wayside Road? Caller says she lives in the neighborhood. She believes that's the house she's speaking of, but she states there's multiple cars parked in the driveway that's starting to block the sidewalk. She'd like to have someone speak to them.

Again, some of the improvement must have come from getting a better "run" through AssemblyAI before Haiku saw the transcribed text. However, the two-model method does consistently lead to better results.

How do I know? I've been working on the real solution - creating a custom Whisper model that has the geographical and domain context built in. To create this model I captured 100 hours of real scanner traffic from the channels I monitor. I used all the techniques above to get as good a transcription as I could get before doing the tedious task of reviewing all 100 hours of audio and providing a human-verified transcription.

I have generated about 10 hours of verified text so far. Having the pre-transcribed text has been a time saver. I wrote a small utility program that lets me pick the most accurate transcription as a starting point, so I only need to make small edits for most transmissions. The plan is to transcribe the first 25 hours, then build the custom model and do some testing to make sure the effort is worth the payoff.

I'll have more details on how to train a custom Whisper AI model in a future post.