Python / Rust LLM ServerPolyglotOpenAI APIAnthropic API

mullama

Name: mullama
Author: Cognisoc

Run any LLM locally. Use it from any language. Deploy anywhere. A drop-in Ollama replacement with native bindings for Python, Node.js, Go, PHP, Rust, and C/C++.

Visit mullama.cognisoc.com → View on GitHub Read the docs

Key Features

🔗

6 Native Language Bindings

Python, Node.js, Go, PHP, Rust, C/C++ — call models directly, no HTTP overhead.

🔄

Drop-in Ollama Replacement

Same CLI commands, same Modelfile format, same model registry. Your scripts work unchanged.

🌐

OpenAI + Anthropic API

Use your existing SDKs and tools without changes. Compatible API endpoints out of the box.

📦

Embed in Any App

Run inference in-process — no separate daemon or HTTP server required.

🎮

7 GPU Backends

CUDA, Metal, ROCm, OpenCL, Vulkan, SYCL, and RPC for distributed inference.

👁️

Multimodal

Text, image, and real-time audio with voice activity detection.

Quick Start

# Install
pip install mullama          # Python
npm install mullama          # Node.js
cargo add mullama            # Rust
composer require skelf-research/mullama  # PHP

# Run a model
mullama run llama3.2:1b "What is the capital of France?"

# Start an OpenAI-compatible server
mullama serve --model llama3.2:1b

Two Ways to Use mullama

Most developers are confused about the difference between running a server and embedding a model. mullama supports both — and the choice matters.

Server Mode (HTTP API)

Run mullama serve and call it from any language via HTTP. This is similar to how Ollama works. The model runs as a separate process, and your app talks to it over OpenAI-compatible endpoints.

# Start the server
mullama serve --model llama3.2:1b

# Call it from anywhere — curl, Python, Node.js, any HTTP client
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:1b", "messages": [{"role": "user", "content": "Hello"}]}'

Best for: Shared team servers, microservice architectures, multi-tenant setups, or when you want to swap models without redeploying your app.

Embedded Mode (In-Process)

Load the model inside your application via native bindings. No HTTP, no daemon, no separate process. The model is a library call — as fast as it gets.

Best for: Mobile apps, CLI tools, edge devices, latency-critical pipelines, offline scenarios, privacy-sensitive workloads, or any situation where you can’t (or don’t want to) run a server.

When should I embed vs run a server? If your application is the only consumer of the model, embed it. If multiple applications or users need the same model, run a server. If you’re on a mobile device or embedded system, embed — there’s no server to run.

Python

from mullama import Model, Context

model = Model.load('llama3.2-1b.gguf', n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)
response = ctx.generate('Hello, AI!')
print(response)

Rust

use mullama::{Model, Context, ContextParams};

let model = Model::load("llama3.2-1b.gguf")?;
let mut ctx = Context::new(&model, ContextParams::default())?;
let response = ctx.generate("Hello, AI!", 256)?;
println!("{}", response);

Node.js

const { Model, Context } = require('mullama');

const model = await Model.load('llama3.2-1b.gguf', { gpuLayers: 32 });
const ctx = new Context(model, { contextSize: 4096 });
const response = await ctx.generate('Hello, AI!');
console.log(response);

Go

import mullama "github.com/skelf-research/mullama-go"

model, _ := mullama.LoadModel("llama3.2-1b.gguf", mullama.WithGPULayers(32))
ctx, _ := model.NewContext(mullama.ContextConfig{ContextSize: 4096})
response, _ := ctx.Generate("Hello, AI!")
fmt.Println(response)

PHP

use Mullama\Model;
use Mullama\Context;

$model = Model::load('llama3.2-1b.gguf', ['gpu_layers' => 32]);
$ctx = new Context($model, ['n_ctx' => 4096]);
$response = $ctx->generate('Hello, AI!');
echo $response;

Ollama Compatibility

Feature	mullama	Ollama
CLI commands (`run`, `pull`, `serve`)	Same syntax	—
Modelfile format	Compatible	—
GGUF models	Yes	Yes
OpenAI API	Yes	Yes
Anthropic API	Yes	No
Native language bindings	6 languages	HTTP only
Embed in app (no daemon)	Yes	No
Built-in Web UI	Yes	No

What You Can Build

Chatbots and assistants — Streaming responses, multi-turn context, custom system prompts
RAG pipelines — Embeddings, ColBERT-style semantic search, grammar-constrained generation
Voice assistants — Real-time audio capture with VAD, speech-to-text, streaming LLM responses
API servers — Production-ready OpenAI-compatible endpoints with streaming SSE
Edge deployments — Embed a model directly in your app with no network dependency
Batch processing — Parallel inference across documents with work-stealing scheduling