Python / Rust LLM ServerPolyglotOpenAI APIAnthropic API

mullama

Run any LLM locally. Use it from any language. Deploy anywhere. A drop-in Ollama replacement with native bindings for Python, Node.js, Go, PHP, Rust, and C/C++.

Key Features

๐Ÿ”—

6 Native Language Bindings

Python, Node.js, Go, PHP, Rust, C/C++ โ€” call models directly, no HTTP overhead.

๐Ÿ”„

Drop-in Ollama Replacement

Same CLI commands, same Modelfile format, same model registry. Your scripts work unchanged.

๐ŸŒ

OpenAI + Anthropic API

Use your existing SDKs and tools without changes. Compatible API endpoints out of the box.

๐Ÿ“ฆ

Embed in Any App

Run inference in-process โ€” no separate daemon or HTTP server required.

๐ŸŽฎ

7 GPU Backends

CUDA, Metal, ROCm, OpenCL, Vulkan, SYCL, and RPC for distributed inference.

๐Ÿ‘๏ธ

Multimodal

Text, image, and real-time audio with voice activity detection.

Quick Start

# Install
pip install mullama          # Python
npm install mullama          # Node.js
cargo add mullama            # Rust
composer require skelf-research/mullama  # PHP

# Run a model
mullama run llama3.2:1b "What is the capital of France?"

# Start an OpenAI-compatible server
mullama serve --model llama3.2:1b

Two Ways to Use mullama

Most developers are confused about the difference between running a server and embedding a model. mullama supports both โ€” and the choice matters.

Server Mode (HTTP API)

Run mullama serve and call it from any language via HTTP. This is similar to how Ollama works. The model runs as a separate process, and your app talks to it over OpenAI-compatible endpoints.

# Start the server
mullama serve --model llama3.2:1b

# Call it from anywhere โ€” curl, Python, Node.js, any HTTP client
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:1b", "messages": [{"role": "user", "content": "Hello"}]}'

Best for: Shared team servers, microservice architectures, multi-tenant setups, or when you want to swap models without redeploying your app.

Embedded Mode (In-Process)

Load the model inside your application via native bindings. No HTTP, no daemon, no separate process. The model is a library call โ€” as fast as it gets.

Best for: Mobile apps, CLI tools, edge devices, latency-critical pipelines, offline scenarios, privacy-sensitive workloads, or any situation where you canโ€™t (or donโ€™t want to) run a server.

When should I embed vs run a server? If your application is the only consumer of the model, embed it. If multiple applications or users need the same model, run a server. If youโ€™re on a mobile device or embedded system, embed โ€” thereโ€™s no server to run.

Python

from mullama import Model, Context

model = Model.load('llama3.2-1b.gguf', n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)
response = ctx.generate('Hello, AI!')
print(response)

Rust

use mullama::{Model, Context, ContextParams};

let model = Model::load("llama3.2-1b.gguf")?;
let mut ctx = Context::new(&model, ContextParams::default())?;
let response = ctx.generate("Hello, AI!", 256)?;
println!("{}", response);

Node.js

const { Model, Context } = require('mullama');

const model = await Model.load('llama3.2-1b.gguf', { gpuLayers: 32 });
const ctx = new Context(model, { contextSize: 4096 });
const response = await ctx.generate('Hello, AI!');
console.log(response);

Go

import mullama "github.com/skelf-research/mullama-go"

model, _ := mullama.LoadModel("llama3.2-1b.gguf", mullama.WithGPULayers(32))
ctx, _ := model.NewContext(mullama.ContextConfig{ContextSize: 4096})
response, _ := ctx.Generate("Hello, AI!")
fmt.Println(response)

PHP

use Mullama\Model;
use Mullama\Context;

$model = Model::load('llama3.2-1b.gguf', ['gpu_layers' => 32]);
$ctx = new Context($model, ['n_ctx' => 4096]);
$response = $ctx->generate('Hello, AI!');
echo $response;

Ollama Compatibility

FeaturemullamaOllama
CLI commands (run, pull, serve)Same syntaxโ€”
Modelfile formatCompatibleโ€”
GGUF modelsYesYes
OpenAI APIYesYes
Anthropic APIYesNo
Native language bindings6 languagesHTTP only
Embed in app (no daemon)YesNo
Built-in Web UIYesNo

What You Can Build

  • Chatbots and assistants โ€” Streaming responses, multi-turn context, custom system prompts
  • RAG pipelines โ€” Embeddings, ColBERT-style semantic search, grammar-constrained generation
  • Voice assistants โ€” Real-time audio capture with VAD, speech-to-text, streaming LLM responses
  • API servers โ€” Production-ready OpenAI-compatible endpoints with streaming SSE
  • Edge deployments โ€” Embed a model directly in your app with no network dependency
  • Batch processing โ€” Parallel inference across documents with work-stealing scheduling