Transcribing speech into text has never been more accessible, thanks to models like OpenAI’s Whisper. In this post, we’ll walk through how to use the MBS Xojo Plugins in Xojo to transcribe audio files using Whisper.
We’ll go step-by-step from loading required libraries to getting the final transcription.
Step 1: Load the Whisper Dynamic Library
Before using Whisper with our WhisperMBS module, we need to load the appropriate dynamic library (.dylib) that matches your Whisper version.
1. load the dylib for the whisper version you have.
Var appFile As FolderItem = app.ExecutableFile
Var macOSFolder As FolderItem = appfile.Parent
Var appContents As FolderItem = macOSFolder.parent
Var Frameworks As FolderItem = appContents.Child("Frameworks")
Var LibFile As FolderItem = Frameworks.Child("libwhisper.1.7.4.dylib")
Ensure the library exists before continuing:
If Not LibFile.Exists Then
Log LibFile.name + " file missing?"
Quit
ElseIf WhisperMBS.LoadLibrary(LibFile) Then
'Log "Okay"
Else
Log "Failed to load library: "+WhisperMBS.LoadErrorMessage
Quit
End If
Step 2: Load the libsndfile Library
We also need a library to read audio files like WAV or FLAC. The libsndfile.dylib is ideal for this and can be downloaded from our Libs page.
// 2. Load sndfile library
// https://www.monkeybreadsoftware.de/xojo/download/plugin/Libs/
LibFile = Frameworks.Child("libsndfile.dylib")
If SoundFileMBS.LoadLibrary(LibFile) Then
'Log "Okay"
Else
Log "Failed to load library: "+SoundFileMBS.LoadErrorMessage
Quit
End If
Step 3: Load the Audio File
You can now open your audio file using SoundFileMBS. Make sure to adjust the file path accordingly.
// 3. Load audio file.
// Please change path!
Var f As FolderItem = GetFolderItem("ep306_16kHz_16bit.wav")
// SoundFileMBS is in our Tools plugin
Var s As SoundFileMBS = SoundFileMBS.Open(f)
If s = Nil Then
Log "Failed to open sound."
Quit
End If
Read the audio frames into memory:
Var info As SoundFileInfoMBS = s.Info
System.DebugLog Str(info.Frames)+" frames, "+Str(info.SampleRate)+" Hz."
Var samples As New MemoryBlock(info.Frames * 4)
Var SamplesCount As Integer = s.ReadSingleFrames(samples, info.Frames)
Step 4: Resample Audio (If Needed)
Whisper expects audio sampled at 16 kHz. If your file uses a different rate, you'll need to resample it. For this we picked the speexdsp library, which we installed via homebrew, but it could also be bundled with the application like sndfile above.
// 4. If needed, convert the audio to 16000 Hz as that is what our
If info.SampleRate <> 16000 Then
// we need 16 Khz
Const LibPath = "/opt/homebrew/Cellar/speexdsp/1.2.1/lib/libspeexdsp.1.dylib"
// SpeexResamplerState *speex_resampler_init(spx_uint32_t nb_channels, spx_uint32_t in_rate, spx_uint32_t out_rate, int quality, int *err);
Soft Declare Function speex_resampler_init Lib LibPath (Channels As UInt32, InRate As UInt32, OutRate As UInt32, Quality As Int32, ByRef error As Int32) As Ptr
Var inputRate As Integer = info.SampleRate
Var outputRate As Integer = 16000
Var InputLength As UInt32 = SamplesCount
Var OutputLength As UInt32 = InputLength * outputRate / inputRate
Var output As New MemoryBlock(OutputLength * 4)
Const SPEEX_RESAMPLER_QUALITY_BEST = 10
Var error As Int32
Var resampler As Ptr = speex_resampler_init(1, inputRate, outputRate, SPEEX_RESAMPLER_QUALITY_BEST, error)
If error <> 0 Then
Log "Speex resampler init failed"
Break
Return
End If
// int speex_resampler_process_float(SpeexResamplerState *st, spx_uint32_t channel_index, Const float *In, spx_uint32_t *in_len, float *out, spx_uint32_t *out_len);
Soft Declare Function speex_resampler_process_float Lib LibPath (resampler As Ptr, ChannelIndex As UInt32, Input As Ptr, ByRef InLen As UInt32, Output As Ptr, ByRef OutLen As UInt32) As Int32
Var SamplesPtr As Ptr = samples
Var outputPtr As Ptr = output
Call speex_resampler_process_float(resampler, 0, SamplesPtr, InputLength, outputPtr, OutputLength)
Soft Declare Sub speex_resampler_destroy Lib LibPath (resampler As Ptr)
// void speex_resampler_destroy(SpeexResamplerState *st);
speex_resampler_destroy(resampler)
// now use resampled data
samples = output
SamplesCount = OutputLength
end if
Step 5: Run Whisper and Transcribe
With audio data loaded and resampled (if needed), we can now use Whisper to transcribe the content.
// 5. Use Whisper to convert audio to text
// now convert
System.DebugLog WhisperMBS.LangMaxID.ToString+" languages"
Var Resources As FolderItem = appContents.Child("Resources")
Var ModelFile As FolderItem = Resources.Child("ggml-base.en.bin") // you may need to change this to point to your file
Set up context and parameters:
Var cparams As New WhisperContextParamsMBS
Var wparams As New WhisperFullParamsMBS(WhisperFullParamsMBS.SamplingStrategyGreedy)
wparams.TdrzEnable = True
Var context As New WhisperContext(ModelFile, cparams)
Var samplesPtr As Ptr = samples
Var e As Integer = context.full(wparams, samplesPtr, SamplesCount)
Check for errors and extract the segments:
If e <> 0 Then
Log "Failed to process audio. Error: "+e.ToString
Quit
Else
Log "Error: "+e.ToString
End If
Var segments As Integer = context.FullSegments
Var lines() As String
lines.add "Text: "
Loop over segments and collect texts
We loop over the segments. For each segment we get:
- Get text of segment
- Get token objects for the segment
- Get token data objects for the segment with more details
- Get token texts for the segment
- Whether the speaker changed
- Loop over tokens and ask for each token object
On the end we collect the segment texts and show them on the end.
For SegmentIndex As Integer = 0 To Segments-1
Var tokenCount As Integer = context.FullTokens(SegmentIndex)
Var tokens() As WhisperTokenMBS = context.FullGetTokens(SegmentIndex)
Var tokenDatas() As WhisperTokenDataMBS = context.FullGetTokenDatas(SegmentIndex)
Var tokenTexts() As String = context.FullGetTokenTexts(SegmentIndex)
Var speakerHasTurned As Boolean = context.FullGetSegmentSpeakerTurnNext(SegmentIndex)
System.DebugLog "speakerHasTurned: "+speakerHasTurned.ToString
Var Text As String = context.FullSegmentText(SegmentIndex)
For TokenIndex As Integer = 0 To tokenCount -1
Var token As WhisperTokenMBS = context.FullGetToken(SegmentIndex, TokenIndex)
Var tokenData As WhisperTokenDataMBS = context.FullGetTokenData(SegmentIndex, TokenIndex)
objects.Add token
objects.Add tokenData
Next
System.DebugLog Text
lines.add Text
Next
MessageBox "Finished: "+Join(lines, EndOfLine)
Please try and see if you can make use of the Whisper library to transcribe text within Xojo.