· 12 min read developer

How to Run LLMs Locally Without Ollama

mullama is a drop-in Ollama replacement with native bindings for Python, Node.js, Go, PHP, Rust, and C/C++. Install, embed in-process, and run an OpenAI-compatible server — no daemon required.

LLMmullamaOllamaOllama alternativelocal inferenceself-hosted AIrun LLM locallyllama.cppGGUF

The Short Answer

mullama is a drop-in Ollama replacement that ships native language bindings for Python, Node.js, Go, PHP, Rust, and C/C++. Same CLI, same Modelfile format, same model registry — but you can embed inference directly in your application without running a daemon or making HTTP calls.

If you have used Ollama before, you already know how to use mullama. If you have not, you will be productive in about five minutes.

Why Look Beyond Ollama

Ollama is excellent for what it does: pull a model, run it, expose an API. But it has architectural constraints that surface quickly in production:

  • No native bindings. Every interaction goes through HTTP. That means serialization overhead, latency, and a mandatory background daemon even when your app is the only consumer.
  • No in-process embedding. You cannot link a model into your application. Every inference call is an IPC round-trip to the Ollama server.
  • No Anthropic API compatibility. If your codebase targets the Anthropic SDK, you need an adapter layer or a different tool.
  • Limited GPU backend coverage. Ollama supports CUDA and Metal. If you are on AMD (ROCm), Vulkan, OpenCL, or SYCL hardware, options are limited.

mullama addresses all of these. It wraps llama.cpp with a clean multi-language binding layer and exposes both OpenAI and Anthropic-compatible API endpoints. You get the same convenience as Ollama with the flexibility to embed inference anywhere.

Installation

mullama is available through the standard package manager for every supported language, plus a universal curl installer.

Language-specific installs

# Python
pip install mullama

# Node.js
npm install mullama

# Rust
cargo add mullama

# PHP
composer require skelf-research/mullama

# Go
go get github.com/skelf-research/mullama-go

Universal install (gets you the CLI + server)

curl -fsSL https://cognisoc.com/mullama/install.sh | sh

Verify the installation:

mullama --version

GPU backend selection

mullama auto-detects your GPU at build time. To force a specific backend:

# CUDA (NVIDIA)
MULLAMA_GPU=cuda pip install mullama

# ROCm (AMD)
MULLAMA_GPU=rocm pip install mullama

# Vulkan (cross-platform)
MULLAMA_GPU=vulkan pip install mullama

# Metal (macOS) — auto-detected, no flag needed
pip install mullama

All seven backends — CUDA, Metal, ROCm, OpenCL, Vulkan, SYCL, and RPC — are supported. RPC enables distributed inference across multiple machines.

CLI Quick Start

If you have used Ollama, these commands are identical:

# Pull a model from the registry
mullama pull llama3.2:1b

# Run interactively
mullama run llama3.2:1b

# Run with a prompt (non-interactive)
mullama run llama3.2:1b "Explain the CAP theorem in three sentences."

# List downloaded models
mullama list

# Show model details
mullama show llama3.2:1b

# Remove a model
mullama rm llama3.2:1b

Load a GGUF file directly

mullama run ./mistral-7b-instruct-v0.3.Q4_K_M.gguf "Summarize this."

Use a Modelfile

FROM llama3.2:1b
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
SYSTEM "You are a concise technical writer."
mullama create my-writer -f Modelfile
mullama run my-writer "Write a commit message for a refactor of the auth module."

Modelfile syntax is compatible with Ollama. Existing Modelfiles work without changes.

The Key Differentiator: Using mullama as a Library

This is where mullama diverges from Ollama entirely. Instead of running a separate server and making HTTP calls, you load the model in-process and call it like any other function. No daemon, no serialization, no network overhead.

Python

from mullama import Model, Context

# Load with GPU offloading
model = Model.load("llama3.2-1b.gguf", n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)

# Single-shot generation
response = ctx.generate("What is a monad?")
print(response)

# Streaming
for token in ctx.stream("Explain quicksort step by step."):
    print(token, end="", flush=True)

Rust

use mullama::{Model, Context, ContextParams};

fn main() -> Result<(), mullama::Error> {
    let model = Model::load("llama3.2-1b.gguf")?;
    let params = ContextParams { n_ctx: 4096, ..Default::default() };
    let mut ctx = Context::new(&model, params)?;

    // Single-shot
    let response = ctx.generate("What is a monad?", 512)?;
    println!("{response}");

    // Streaming with a callback
    ctx.stream("Explain quicksort.", 512, |token| {
        print!("{token}");
        true // return false to stop early
    })?;

    Ok(())
}

Node.js

const { Model, Context } = require("mullama");

const model = await Model.load("llama3.2-1b.gguf", { gpuLayers: 32 });
const ctx = new Context(model, { contextSize: 4096 });

// Single-shot
const response = await ctx.generate("What is a monad?");
console.log(response);

// Streaming
for await (const token of ctx.stream("Explain quicksort.")) {
  process.stdout.write(token);
}

PHP

use Mullama\Model;
use Mullama\Context;

$model = Model::load('llama3.2-1b.gguf', ['gpu_layers' => 32]);
$ctx = new Context($model, ['n_ctx' => 4096]);

// Single-shot
$response = $ctx->generate('What is a monad?');
echo $response;

// Streaming
foreach ($ctx->stream('Explain quicksort.') as $token) {
    echo $token;
    flush();
}

Go

package main

import (
    "fmt"
    mullama "github.com/skelf-research/mullama-go"
)

func main() {
    model, _ := mullama.LoadModel("llama3.2-1b.gguf", mullama.WithGPULayers(32))
    ctx, _ := model.NewContext(mullama.ContextConfig{ContextSize: 4096})

    // Single-shot
    response, _ := ctx.Generate("What is a monad?")
    fmt.Println(response)

    // Streaming
    ctx.Stream("Explain quicksort.", func(token string) bool {
        fmt.Print(token)
        return true
    })
}

C

#include <mullama.h>

int main() {
    mullama_model *model = mullama_load_model("llama3.2-1b.gguf", 32);
    mullama_ctx *ctx = mullama_create_context(model, 4096);

    char *response = mullama_generate(ctx, "What is a monad?", 512);
    printf("%s\n", response);

    mullama_free(response);
    mullama_free_context(ctx);
    mullama_free_model(model);
    return 0;
}

Every binding follows the same pattern: load a model, create a context, generate. The mental model is consistent across all six languages.

Running an OpenAI-Compatible Server

When you do need an HTTP API — for serving multiple clients or integrating with tools that expect OpenAI endpoints — mullama has you covered:

# Start the server
mullama serve --model llama3.2:1b --port 8080

Then hit it with any OpenAI SDK or curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:1b",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
response = client.chat.completions.create(
    model="llama3.2:1b",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

Anthropic API compatibility

mullama also exposes an Anthropic-compatible endpoint. Point the Anthropic SDK at your local server:

from anthropic import Anthropic

client = Anthropic(base_url="http://localhost:8080", api_key="unused")
message = client.messages.create(
    model="llama3.2:1b",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}],
)
print(message.content[0].text)

This is useful when your production code targets Claude but you want to develop and test locally with an open model.

Built-in UIs

mullama ships with a web UI and a terminal UI:

# Web UI at http://localhost:8080
mullama serve --model llama3.2:1b --ui

# Terminal UI (no server needed)
mullama tui

Comparison Table

FeaturemullamaOllamallama-cpp-pythonllamafile
CLI model managementYesYesNoNo
Native language bindings6 (Python, Node.js, Go, PHP, Rust, C/C++)None (HTTP only)Python onlyNone
In-process embeddingYesNoYesNo
OpenAI-compatible APIYesYesYesYes
Anthropic-compatible APIYesNoNoNo
GPU backends7 (CUDA, Metal, ROCm, OpenCL, Vulkan, SYCL, RPC)2 (CUDA, Metal)Depends on build2 (CUDA, Metal)
Modelfile supportYesYesNoNo
Built-in Web UIYesNoNoYes
Built-in TUIYesNoNoNo
Model registryYesYesNoNo
Multimodal (vision + audio)YesYes (vision)PartialYes (vision)
Single-binary distributionYesYesNo (pip)Yes
StreamingYesYesYesYes

When to Use Each Tool

Use mullama when:

  • You need to embed inference directly in your application without running a daemon.
  • Your stack spans multiple languages and you want a consistent API across all of them.
  • You need Anthropic API compatibility for local development against production Claude code.
  • You are on AMD, Vulkan, or OpenCL hardware.
  • You want Ollama CLI compatibility plus library-level access.

Use Ollama when:

  • You want the simplest possible local LLM setup and only need HTTP access.
  • Your workflow is entirely CLI-based and you do not need to embed models in code.

Use llama-cpp-python when:

  • You are building a Python-only project and want direct llama.cpp bindings without the CLI/server layer.
  • You need fine-grained control over llama.cpp parameters that higher-level tools abstract away.

Use llamafile when:

  • You want a single file that contains both the runtime and the model.
  • Distribution simplicity matters more than language binding support.

Getting Started

# Install mullama
pip install mullama

# Pull and run a model in under 30 seconds
mullama pull llama3.2:1b
mullama run llama3.2:1b "Explain the difference between concurrency and parallelism."

The full documentation, binding-specific guides, and GPU backend setup instructions are on the mullama GitHub repository.