Story Time
My designer friend started vibe coding a WhisprFlow/Granola alternative and he shared me the DMG to test. I have had bad experience with WhisprFlow in the past as I don’t speak very clearly like for eg, “testing” gets transcibed as “kes king”. I still decided to give it a try as it had the Nvidia’s Parakeet v2 model as well, which I haven’t tried yet and to my disapointment it was not that great either.
This made me wonder, every app is moving towards the voice integration, recently saw the same in Claude Code and it’s obviously faster to speak than type, so I thought of figuring a way out to get this working for me. The answer - record bunch of audio, fine tune whisper and volla it would be amazing. Reality - not so straightforward.
Unsloth - The Fine Tuning King?
I have seen a lot of talks by Daniel from Unsloth and in the videos it always look quite easy to add data, tune parameters and run on Colab and boom you have a fine tuned model.
Googling a bit, I found this detailed guide [https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning] and thought of giving it a try. It says TTS but has Whisper STT guide as well. Unsloth please fix, I ignored it twice just by looking at the name.
The Dataset
I started going to and fro with Claude and ChatGPT to find a good balanced dataset which can capture the problems in my speech patterns and found Harvard Sentences[https://www.cs.columbia.edu/~hgs/audio/harvard.html]. It consists of 72 lists of 10 sentences each so 720 sentences in total where each sentence can take ~2.8 seconds i.e ~33 minutes of audio in total. LLMs were of the opinion it’s less for finetuning but my theory was if humans can understand me almost 90-95% of time then 30 mins should be decent to start with.
Curation
Opened Claude Code and asked it to make me a dataset recorded, gave the harvard sentences as input and it made me a beautiful web ui which shoes me one sentence at a time and I can record it. It also post processed the files to be WAV and 16khz sampling as required by whisper.
The whole process took roughly 2 hours from making to recording 720 lines. Once done, I pushed it to hugging face and started following the guide.
Model to fine tune
Initially I had two options Whisper and Parakeet, but the guide was for Whisper so I discarded Parakeet quite instantly. Now whisper comes in multiple sizes - tiny, small, base, medium, large.
I went with large to start with.
Technique
The initial idea by quite simple, make the model pass through the training data and it will learn the patterns. But the guide mentioned about PeFT LoRA, which I had no clue about so did some more research and amazed by the idea of LORA. You don’t have to fine tune the whole 1.6GB model, rather just add an adapter and train the adapter. Fast, efficient, works on Colab - I was sold.
Whisper-v3-large -> Training Data -> LORA -> Adapter -> Fine Tuned Model -> CoreML Model
Technical Details
All the ASR (Automatic Speech Recognition) models measure something called as WER (Word Error Rate), v3-turbo had WER of 9% on English language. I passed along my harvard sentences dataset to the base model and it scored 27% WER. 300% performance degrade on my voice samples, now you can assume why all STT tools are unusable to me
Moving ahead, few terminology to understand:
- Learning Rate - Rate at which the model learns. Higher the rate, faster the model learns but more prone to errors. Lower the rate, slower the model learns but less prone to errors.
- Rank - LoRA Rank is the rank of the low-rank matrices that are added to the transformer layers. Higher the rank, more the parameters to train, more the memory required, more the performance. Lower the rank, less the parameters to train, less the memory required, less the performance.
- Num of Epoch - Number of times the model goes through the training data. Higher the number of epochs, more the model learns but more prone to overfitting. Lower the number of epochs, less the model learns but less prone to overfitting.
- Training Loss - The loss function measures how well the model is performing on the training data. Lower the loss, better the model.
- Validation Loss - The loss function measures how well the model is performing on the test data. Lower the loss, better the model.
Gist - Reduce training loss and validation loss along with WER on each epoch.
Training
1st run - max_steps=60, lr=1e-4, 1 epoch
WER dropped instantly from 27% to 15% with just 60 steps and then platued. This took like 5 mins to run and I could see a big improvement already, this pulled in a lot of interest.
2nd run - num_train_epochs=10, lr=5e-5
WER dropped till 9% but validation loss was climbing up means model was overfitting
Played around with Claude on these numbers for sometime and then I hit this run
num_train_epochs=5, lr=1e-5
| Step | Training Loss | Validation Loss | WER |
|---|---|---|---|
| 50 | 2.8517 | 2.3652 | 27.4744 |
| 100 | 2.3737 | 2.1784 | 24.0614 |
| 150 | 2.1636 | 1.9048 | 20.4778 |
| 200 | 1.4450 | 1.4194 | 20.1365 |
| 250 | 0.9385 | 1.0925 | 16.8941 |
| 300 | 0.9135 | 0.9767 | 15.1877 |
| 350 | 0.8882 | 0.9181 | 14.3344 |
| 400 | 0.7462 | 0.8889 | 13.8225 |
| 450 | 0.7540 | 0.8767 | 13.8225 |
| 500 | 0.6900 | 0.8647 | 13.1399 |
| 550 | 0.8393 | 0.8448 | 12.9692 |
| 600 | 0.4855 | 0.8388 | 12.1160 |
| 650 | 0.4973 | 0.8514 | 10.7508 |
| 700 | 0.2550 | 0.8629 | 10.7508 |
| 750 | 0.2673 | 0.8616 | 10.4095 |
| 800 | 0.2513 | 0.8600 | 10.7508 |
Good convergence, stable loss deduction. I was quite happy here, decided to export this model as CoreML for real life testing.
Note - Apple ARM chips have a dedicated Neural Engine for ML tasks and it needs models to be in CoreML format to run on it.
Command -
whisperkit-generate-model \ --model-version "vivekkairi/whisper-large-v3-turbo-finetuned" \ --output-dir ./whisperkit_v3_outputTook ~70mins on my M1 Pro mac for the conversation. Loaded it in my friend’s app and I started to see a lot of improvement in the output quality and it was really nice but it was slow. Each inference was taking 3-5 seconds, the problem was the model size it was 1.5B parameter model. Quite big for a old macbook to run.
A little bit of research and landed up on Turbo model, which is optimized from the large model and is much smaller (800M parameters) without much performance degradation.
Training Again
Same setup, changed the model name from whisper-v3-large to whisper-v3-turbo and the WER started off with 53% quite suprising but it let it run for some time.
The run ended with WER struggling to drop below 50%, basically fine tuning was degrating the performance quite a lot. Did a lot of to and fro with config but nothing helped much here
Claude’s conclusion - Turbo is a distilled model — it was compressed to be fast at standard English by removing the internal complexity that makes fine-tuning flexible. For standard accent adaptation it works fine, but speech sound disorders require deep acoustic-to-phoneme remapping that needs the full model's capacity
Whisper Turbo Architecture
Whisper has 2 components - Encoder and Decoder. Encoder converts the audio to token and decoder convert token to text. This is simple explanation of how Whisper works, now if you see v3-turbo the number of layers are less but also it’s not same across encoder and decoder.
I was finetuning both encoder and decoder weights, but do I need to? Quick chat with Claude and technically decoder can stay the same as it’s just converting tokens to text, the main work is done by encoder to convert audio to tokens. As the decoder is so small just 4 layers any attempt to fine tuning was pushing the model to 53% WER what we saw earlier.
More Struggle
Unsloth started showing multiple errors -
Problems with max_seq_length
Zero trainable parameters
Not sure why but it struggled a lot with turbo model, so I decided to jump to raw transformers + peft. Unsloth helps with memory and faster training but turbo is a 800M model and can easily fit in Colab T4 instance, so shouldn’t be a problem.
Made changes to raw transformers and config to freeze the decoder weights and only train encoder weights and here’s the run
| Step | Training Loss | Validation Loss | WER (%) |
|---|---|---|---|
| 100 | 2.7737 | 2.9873 | 26.11 |
| 200 | 0.8632 | 0.9783 | 11.43 |
| 300 | 0.3165 | 0.4686 | 8.70 |
| 400 | 0.1252 | 0.2542 | 7.85 |
| 500 | 0.1243 | 0.2194 | 6.83 |
| 600 | 0.1175 | 0.2066 | 7.00 |
| 700 | 0.0320 | 0.1954 | 6.83 |
| 800 | 0.0437 | 0.1939 | 6.66 |
| 900 | 0.0198 | 0.1945 | 6.48 |
| 1000 | 0.0487 | 0.1961 | 6.83 |
| 1100 | 0.2725 | 0.1959 | 6.48 |
| 1200 | 0.0220 | 0.1943 | 6.66 |
The WER steadily dropped from 26% to 6.5%.
Insights
With just 720 sentences, the drop was quite impressive. Now I need more focused examples on particular phonotics to drop the score even more, maybe we will see a part 2 of this.
Overall it turned out simply than I thought and quite fast.