mullama
Run any LLM locally. Use it from any language. Deploy anywhere. A drop-in Ollama replacement with native bindings for Python, Node.js, Go, PHP, Rust, and C/C++.
Key Features
6 Native Language Bindings
Python, Node.js, Go, PHP, Rust, C/C++ โ call models directly, no HTTP overhead.
Drop-in Ollama Replacement
Same CLI commands, same Modelfile format, same model registry. Your scripts work unchanged.
OpenAI + Anthropic API
Use your existing SDKs and tools without changes. Compatible API endpoints out of the box.
Embed in Any App
Run inference in-process โ no separate daemon or HTTP server required.
7 GPU Backends
CUDA, Metal, ROCm, OpenCL, Vulkan, SYCL, and RPC for distributed inference.
Multimodal
Text, image, and real-time audio with voice activity detection.
Quick Start
# Install
pip install mullama # Python
npm install mullama # Node.js
cargo add mullama # Rust
composer require skelf-research/mullama # PHP
# Run a model
mullama run llama3.2:1b "What is the capital of France?"
# Start an OpenAI-compatible server
mullama serve --model llama3.2:1b
Two Ways to Use mullama
Most developers are confused about the difference between running a server and embedding a model. mullama supports both โ and the choice matters.
Server Mode (HTTP API)
Run mullama serve and call it from any language via HTTP. This is similar to how Ollama works. The model runs as a separate process, and your app talks to it over OpenAI-compatible endpoints.
# Start the server
mullama serve --model llama3.2:1b
# Call it from anywhere โ curl, Python, Node.js, any HTTP client
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2:1b", "messages": [{"role": "user", "content": "Hello"}]}'
Best for: Shared team servers, microservice architectures, multi-tenant setups, or when you want to swap models without redeploying your app.
Embedded Mode (In-Process)
Load the model inside your application via native bindings. No HTTP, no daemon, no separate process. The model is a library call โ as fast as it gets.
Best for: Mobile apps, CLI tools, edge devices, latency-critical pipelines, offline scenarios, privacy-sensitive workloads, or any situation where you canโt (or donโt want to) run a server.
When should I embed vs run a server? If your application is the only consumer of the model, embed it. If multiple applications or users need the same model, run a server. If youโre on a mobile device or embedded system, embed โ thereโs no server to run.
Python
from mullama import Model, Context
model = Model.load('llama3.2-1b.gguf', n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)
response = ctx.generate('Hello, AI!')
print(response)
Rust
use mullama::{Model, Context, ContextParams};
let model = Model::load("llama3.2-1b.gguf")?;
let mut ctx = Context::new(&model, ContextParams::default())?;
let response = ctx.generate("Hello, AI!", 256)?;
println!("{}", response);
Node.js
const { Model, Context } = require('mullama');
const model = await Model.load('llama3.2-1b.gguf', { gpuLayers: 32 });
const ctx = new Context(model, { contextSize: 4096 });
const response = await ctx.generate('Hello, AI!');
console.log(response);
Go
import mullama "github.com/skelf-research/mullama-go"
model, _ := mullama.LoadModel("llama3.2-1b.gguf", mullama.WithGPULayers(32))
ctx, _ := model.NewContext(mullama.ContextConfig{ContextSize: 4096})
response, _ := ctx.Generate("Hello, AI!")
fmt.Println(response)
PHP
use Mullama\Model;
use Mullama\Context;
$model = Model::load('llama3.2-1b.gguf', ['gpu_layers' => 32]);
$ctx = new Context($model, ['n_ctx' => 4096]);
$response = $ctx->generate('Hello, AI!');
echo $response;
Ollama Compatibility
| Feature | mullama | Ollama |
|---|---|---|
CLI commands (run, pull, serve) | Same syntax | โ |
| Modelfile format | Compatible | โ |
| GGUF models | Yes | Yes |
| OpenAI API | Yes | Yes |
| Anthropic API | Yes | No |
| Native language bindings | 6 languages | HTTP only |
| Embed in app (no daemon) | Yes | No |
| Built-in Web UI | Yes | No |
What You Can Build
- Chatbots and assistants โ Streaming responses, multi-turn context, custom system prompts
- RAG pipelines โ Embeddings, ColBERT-style semantic search, grammar-constrained generation
- Voice assistants โ Real-time audio capture with VAD, speech-to-text, streaming LLM responses
- API servers โ Production-ready OpenAI-compatible endpoints with streaming SSE
- Edge deployments โ Embed a model directly in your app with no network dependency
- Batch processing โ Parallel inference across documents with work-stealing scheduling