Improving Speech To Text With AI
Thursday, June 4th 2026
There are a few techniques to improving AI speech to text translation without relying on throwing more hardware at the problem (although that always helps).
One of the features of NeoSCAN is to send transmissions to a speech-to-text service in order to keep a transcript of the day's radio traffic. This is accomplished by running Whisper AI on a local machine with a simple API wrapper that accepts the audio file from NeoSCAN and returns the translated text.
Two benefits of using Whisper AI is that you don't have incur an ongoing charge for an AI subscription, and Whisper has several models of different size that can be used based on the hardware available.
When looking at a speech-to-text solution, there are two major factors to consider.
- Accuracy of the model being used. How good is the model at taking radio-quality audio and accurately transcibing the audio into text.
- Context. Radio tranmissions in a given area will include the names of towns, streets, businesses and other landmarks. Also, first responders have their own jargon that adds context and nuance to the actual words.
For accuracy, we have some control over the situation. We can run a bigger model, or use an external service. The limiting factor is often cost over capability, and the models get better every day.
Context is a different story. Even the best models are training on generic data sets. A strict audio to text translation doesn't know what town you live it, what radio transmissions you are monitoring, and what shorthand your local responders use. Some context can be fed into Whisper when starting the translation. For example, "this is a first responder radio transmission" that improves things somewhat.
A technique to gain additional accuracy is to run the translated text through a text-based LLM like ChatGPT or Claude and have it correct the text using contextual clues like geography and domain. For example, "this is a police transmission from Anytown, USA" or "the Anytown fire department has four vehicles named Car 1, Truck 3, Engine 2 and Medic 1".
Here is an example using real data from NeoSCAN:
Actual Text (human translated):
328, 328. Can you respond to the area of number 9, Wayside Road? Caller says she lives in the neighborhood. She believes that's the house she's speaking of, but she states there's multiple cars parked in the driveway that's starting to block the sidewalk. She'd like to have someone speak to them.
Whisper (running locally, Medium model):
I'll give you 328. 328. Can you respond to the area of number 9, Wayside Road? Carla says she lives in the neighborhood. She believes that's the house she's speaking of, but she says multiple cars parked in the driveway that's starting to block the sidewalk. She doesn't want to speak to them.
Whisper (running locally, Large model):
328. 328. Can you respond to the area of number 9 Wayside Road? Carla states she lives in the neighborhood. She believes that's the house she's speaking of, but she states there's multiple cars parked in the driveway that's starting to block the sidewalk. She'd like to have someone speak to them.
The Large Whisper does a better job of getting the exact wording correct. It also doesn't hallucinate words at the beginning of the transmission. However, both models think Carla is calling the cops.
The next example takes the text from the Whisper Large model and feeds it through Anthropic AI using the Haiku model with a prompt that includes details about the neighboring towns, specific police codes used locally, and similar context.
Whisper (running locally, Large model, corrected by Claude Haiku):
328. 328. Can you respond to the area of number 9, Wayside Road? Caller says she lives in the neighborhood. She believes that's the house she's speaking of, but she states that multiple cars parked in the driveway that's starting to block the sidewalk. She like to have someone speak to them.
This is very accurate. Some of the improvements are likely caused by Whisper doing a better job on the original audio file.
Running a local model like Whisper isn't always an option, so I have done some tests using Assembly AI. Assembly AI is an affordable cloud-based service that works very similarly to Whisper AI. (As an example of the affordability - I was able to convert 100 hours of audio to text using the free tier.)
Here's a translation of the same audio file using just Assembly AI:
Dispatch, 10-4, 10-7, 10-8, 10-9, Wastside Road. Carla states she lives in the neighborhood. She believes that's the house she's speaking of, but she states there's multiple cars parked in the driveway that's starting to block the sidewalk. She'd like to have someone speak to them.
Not great. But when combined with a second pass through Claude Haiku, we get much better results:
328, 328. Can you respond to the area of number 9, Wayside Road? Caller says she lives in the neighborhood. She believes that's the house she's speaking of, but she states there's multiple cars parked in the driveway that's starting to block the sidewalk. She'd like to have someone speak to them.
Again, some of the improvement must have come from getting a better "run" through Assembly AI before Haiku saw the translated text. However, the two-model method does consistently lead to better results.
How do I know? I've been on working on the real solution - creating a custom Whisper model that has the geographical and domain context built in. To create this model I captured 100 hours of real scanner traffic from the channels I monitor. I used all the techniques above to get as good a translation as I could get before doing the tedious task of reviewing all 100 hours of audio and providing a human-verified translation.
I have generated about 10 hours of verified text so far. Having the pre-translated text has been a time saver. I wrote a small utilty program that lets me pick the most accurate translation as a starting point, so I only need to make small edits for most transmissions. The plan is to translate the first 25 hours, then build the custom model and do some testing to make sure the effort is worth the payoff.
I'll have more details on how to train a custom Whisper AI model in a future post.