How to Run LLMs Locally Without Ollama
mullama is a drop-in Ollama replacement with native bindings for Python, Node.js, Go, PHP, Rust, and C/C++. Install, embed in-process, and run an OpenAI-compatible server — no daemon required.
The Short Answer
mullama is a drop-in Ollama replacement that ships native language bindings for Python, Node.js, Go, PHP, Rust, and C/C++. Same CLI, same Modelfile format, same model registry — but you can embed inference directly in your application without running a daemon or making HTTP calls.
If you have used Ollama before, you already know how to use mullama. If you have not, you will be productive in about five minutes.
Why Look Beyond Ollama
Ollama is excellent for what it does: pull a model, run it, expose an API. But it has architectural constraints that surface quickly in production:
- No native bindings. Every interaction goes through HTTP. That means serialization overhead, latency, and a mandatory background daemon even when your app is the only consumer.
- No in-process embedding. You cannot link a model into your application. Every inference call is an IPC round-trip to the Ollama server.
- No Anthropic API compatibility. If your codebase targets the Anthropic SDK, you need an adapter layer or a different tool.
- Limited GPU backend coverage. Ollama supports CUDA and Metal. If you are on AMD (ROCm), Vulkan, OpenCL, or SYCL hardware, options are limited.
mullama addresses all of these. It wraps llama.cpp with a clean multi-language binding layer and exposes both OpenAI and Anthropic-compatible API endpoints. You get the same convenience as Ollama with the flexibility to embed inference anywhere.
Installation
mullama is available through the standard package manager for every supported language, plus a universal curl installer.
Language-specific installs
# Python
pip install mullama
# Node.js
npm install mullama
# Rust
cargo add mullama
# PHP
composer require skelf-research/mullama
# Go
go get github.com/skelf-research/mullama-go
Universal install (gets you the CLI + server)
curl -fsSL https://cognisoc.com/mullama/install.sh | sh
Verify the installation:
mullama --version
GPU backend selection
mullama auto-detects your GPU at build time. To force a specific backend:
# CUDA (NVIDIA)
MULLAMA_GPU=cuda pip install mullama
# ROCm (AMD)
MULLAMA_GPU=rocm pip install mullama
# Vulkan (cross-platform)
MULLAMA_GPU=vulkan pip install mullama
# Metal (macOS) — auto-detected, no flag needed
pip install mullama
All seven backends — CUDA, Metal, ROCm, OpenCL, Vulkan, SYCL, and RPC — are supported. RPC enables distributed inference across multiple machines.
CLI Quick Start
If you have used Ollama, these commands are identical:
# Pull a model from the registry
mullama pull llama3.2:1b
# Run interactively
mullama run llama3.2:1b
# Run with a prompt (non-interactive)
mullama run llama3.2:1b "Explain the CAP theorem in three sentences."
# List downloaded models
mullama list
# Show model details
mullama show llama3.2:1b
# Remove a model
mullama rm llama3.2:1b
Load a GGUF file directly
mullama run ./mistral-7b-instruct-v0.3.Q4_K_M.gguf "Summarize this."
Use a Modelfile
FROM llama3.2:1b
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
SYSTEM "You are a concise technical writer."
mullama create my-writer -f Modelfile
mullama run my-writer "Write a commit message for a refactor of the auth module."
Modelfile syntax is compatible with Ollama. Existing Modelfiles work without changes.
The Key Differentiator: Using mullama as a Library
This is where mullama diverges from Ollama entirely. Instead of running a separate server and making HTTP calls, you load the model in-process and call it like any other function. No daemon, no serialization, no network overhead.
Python
from mullama import Model, Context
# Load with GPU offloading
model = Model.load("llama3.2-1b.gguf", n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)
# Single-shot generation
response = ctx.generate("What is a monad?")
print(response)
# Streaming
for token in ctx.stream("Explain quicksort step by step."):
print(token, end="", flush=True)
Rust
use mullama::{Model, Context, ContextParams};
fn main() -> Result<(), mullama::Error> {
let model = Model::load("llama3.2-1b.gguf")?;
let params = ContextParams { n_ctx: 4096, ..Default::default() };
let mut ctx = Context::new(&model, params)?;
// Single-shot
let response = ctx.generate("What is a monad?", 512)?;
println!("{response}");
// Streaming with a callback
ctx.stream("Explain quicksort.", 512, |token| {
print!("{token}");
true // return false to stop early
})?;
Ok(())
}
Node.js
const { Model, Context } = require("mullama");
const model = await Model.load("llama3.2-1b.gguf", { gpuLayers: 32 });
const ctx = new Context(model, { contextSize: 4096 });
// Single-shot
const response = await ctx.generate("What is a monad?");
console.log(response);
// Streaming
for await (const token of ctx.stream("Explain quicksort.")) {
process.stdout.write(token);
}
PHP
use Mullama\Model;
use Mullama\Context;
$model = Model::load('llama3.2-1b.gguf', ['gpu_layers' => 32]);
$ctx = new Context($model, ['n_ctx' => 4096]);
// Single-shot
$response = $ctx->generate('What is a monad?');
echo $response;
// Streaming
foreach ($ctx->stream('Explain quicksort.') as $token) {
echo $token;
flush();
}
Go
package main
import (
"fmt"
mullama "github.com/skelf-research/mullama-go"
)
func main() {
model, _ := mullama.LoadModel("llama3.2-1b.gguf", mullama.WithGPULayers(32))
ctx, _ := model.NewContext(mullama.ContextConfig{ContextSize: 4096})
// Single-shot
response, _ := ctx.Generate("What is a monad?")
fmt.Println(response)
// Streaming
ctx.Stream("Explain quicksort.", func(token string) bool {
fmt.Print(token)
return true
})
}
C
#include <mullama.h>
int main() {
mullama_model *model = mullama_load_model("llama3.2-1b.gguf", 32);
mullama_ctx *ctx = mullama_create_context(model, 4096);
char *response = mullama_generate(ctx, "What is a monad?", 512);
printf("%s\n", response);
mullama_free(response);
mullama_free_context(ctx);
mullama_free_model(model);
return 0;
}
Every binding follows the same pattern: load a model, create a context, generate. The mental model is consistent across all six languages.
Running an OpenAI-Compatible Server
When you do need an HTTP API — for serving multiple clients or integrating with tools that expect OpenAI endpoints — mullama has you covered:
# Start the server
mullama serve --model llama3.2:1b --port 8080
Then hit it with any OpenAI SDK or curl:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:1b",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
response = client.chat.completions.create(
model="llama3.2:1b",
messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
Anthropic API compatibility
mullama also exposes an Anthropic-compatible endpoint. Point the Anthropic SDK at your local server:
from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8080", api_key="unused")
message = client.messages.create(
model="llama3.2:1b",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello"}],
)
print(message.content[0].text)
This is useful when your production code targets Claude but you want to develop and test locally with an open model.
Built-in UIs
mullama ships with a web UI and a terminal UI:
# Web UI at http://localhost:8080
mullama serve --model llama3.2:1b --ui
# Terminal UI (no server needed)
mullama tui
Comparison Table
| Feature | mullama | Ollama | llama-cpp-python | llamafile |
|---|---|---|---|---|
| CLI model management | Yes | Yes | No | No |
| Native language bindings | 6 (Python, Node.js, Go, PHP, Rust, C/C++) | None (HTTP only) | Python only | None |
| In-process embedding | Yes | No | Yes | No |
| OpenAI-compatible API | Yes | Yes | Yes | Yes |
| Anthropic-compatible API | Yes | No | No | No |
| GPU backends | 7 (CUDA, Metal, ROCm, OpenCL, Vulkan, SYCL, RPC) | 2 (CUDA, Metal) | Depends on build | 2 (CUDA, Metal) |
| Modelfile support | Yes | Yes | No | No |
| Built-in Web UI | Yes | No | No | Yes |
| Built-in TUI | Yes | No | No | No |
| Model registry | Yes | Yes | No | No |
| Multimodal (vision + audio) | Yes | Yes (vision) | Partial | Yes (vision) |
| Single-binary distribution | Yes | Yes | No (pip) | Yes |
| Streaming | Yes | Yes | Yes | Yes |
When to Use Each Tool
Use mullama when:
- You need to embed inference directly in your application without running a daemon.
- Your stack spans multiple languages and you want a consistent API across all of them.
- You need Anthropic API compatibility for local development against production Claude code.
- You are on AMD, Vulkan, or OpenCL hardware.
- You want Ollama CLI compatibility plus library-level access.
Use Ollama when:
- You want the simplest possible local LLM setup and only need HTTP access.
- Your workflow is entirely CLI-based and you do not need to embed models in code.
Use llama-cpp-python when:
- You are building a Python-only project and want direct llama.cpp bindings without the CLI/server layer.
- You need fine-grained control over llama.cpp parameters that higher-level tools abstract away.
Use llamafile when:
- You want a single file that contains both the runtime and the model.
- Distribution simplicity matters more than language binding support.
Getting Started
# Install mullama
pip install mullama
# Pull and run a model in under 30 seconds
mullama pull llama3.2:1b
mullama run llama3.2:1b "Explain the difference between concurrency and parallelism."
The full documentation, binding-specific guides, and GPU backend setup instructions are on the mullama GitHub repository.