Using llama.cpp in Xojo
Running a local LLM directly from Xojo is now easier than ever thanks to the MBS Xojo Plugins and their integration with llama.cpp. With just a few lines of code, you can load a model, create a context, and generate text—all on-device, with optional GPU acceleration.
In this article, we’ll walk through the basics of setting up llama.cpp with Xojo and the MBS Plugin, and then examine a complete example that loads a model and asks it simple questions.
What Is llama.cpp?
llama.cpp is a high-performance C/C++ implementation for running LLaMA-family language models locally, optimized for CPUs and GPUs (Metal, CUDA, etc.). It is lightweight, fast, and ideal for on-device inference with small to medium-sized models.
The MBS Xojo Plugins provide a direct bridge between Xojo and llama.cpp, exposing model loading, context creation, sampling, and inference capabilities through the LlamaMBS, LlamaModelMBS, LlamaContextMBS, and related classes.
Requirements
To follow along, you will need:
- Xojo 2006r4 or newer
- Latest MBS Xojo Tools Plugin with llama.cpp support
- A compiled llama.cpp library:
- libllama.dylib on macOS
- libllama.dll on Windows
- libllama.so on Linux
- A GGUF model file (.gguf format)
For a lot of platforms, you find downloads on the llama.cpp release page.
Installation with Homebrew
On macOS you can install homebrew from their website. Then you can use brew to install the llama.cpp package:
brew install llama.cpp
This provides libllama.dylib inside your Homebrew cellar with e.g. this path
/opt/homebrew/Cellar/llama.cpp/6710/lib/libllama.dylib
If you have a newer version, the path will be different, but luckily you can use the path to the libs folder instead:
/opt/homebrew/lib/libllama.dylib
Step 1 — Loading the llama.cpp Library
Before interacting with any model, you must load the llama.cpp dynamic library:
For macOS, please pass full path to the dylib. For Linux you may just pass the file name, if the package manager installed it properly. Otherwise you pass the full path. On Windows you pass the name of the DLL. You may want to use SetDllDirectoryMBS function to set the folder with the DLL, so Windows can find all the related DLL files.
If the path is wrong or dependencies are missing, you’ll get a detailed error message from LoadErrorMessage. On Windows you may see error 193 if the architecture of the DLL doesn't match the application or error 126 if either the path to the DLL is invalid or some dependency is not found.
Step 2 — Initialize the Backend
llama.cpp supports multiple compute backends: CPU, Metal (macOS/iOS), CUDA with Nvidia GPUs, ROCm / HIP for AMD GPUs, or Vulkan.
The MBS plugin can load all available ones:
This ensures GPU-accelerated layers are enabled if available.
Step 3 — Load the Model
You specify the path to your .gguf model file and configure parameters such as the number of GPU layers:
If your GPU supports it, offloading 20–100 layers can dramatically speed up inference. Otherwise, just set it to 0 for CPU-only execution.
Step 4 — Create the Context
The context manages the state of a conversation and the token buffer.
Each context is independent, so you can have multiple simultaneous sessions with the same model.
You may set properties in LlamaContextParametersMBS class before calling LlamaModelMBS constructor to set the parameters. For example n_ctx defines the context size.
Step 5 — Set Up a Sampler
In llama.cpp, samplers determine how tokens are selected.
For simple deterministic output, we can use a Greedy sampler:
You could also add temperature sampling, top-p sampling, or multiple samplers chained together.
Step 6 — Ask the Model a Question
Once everything is initialized, generating text is as simple as:
Each call feeds your prompt into the model, runs inference and returns the generated completion as a string. The output depends on the settings applied above and what the model is trained on.
You may also use LlamaSamplerMBS class to do the Ask method yourself. We have that as an alternative in the example project.
Complete sample code
Here is the complete, ready-to-run example:
Conclusion
With only a handful of API calls, the MBS Xojo Plugins let you load llama.cpp models, run inference, and build fully local AI features directly into your Xojo applications. Whether you're building chatbots, reasoning tools, or creative assistants, this integration gives you full control and zero cloud dependency.
Please try and let us know how well this works.
Example projects: Llama.zip