April 7, 2026 · 14 min read developer

Embedding LLMs in Your Application: A Guide for Every Language

Most LLM tools force you through HTTP. Here's how to embed models directly in Python, Rust, Dart, Go, PHP, Node.js, C, and Zig — no server, no overhead, no separate process.

embedded inferencemullamallamafupolyglotlocal LLMnative bindings

The Confusion: Server vs Embedded

If you’ve used Ollama, llama-cpp-python, or any cloud API, you’re used to this pattern: run a server, then call it over HTTP. You POST to localhost:11434, parse JSON, handle connection errors, and manage a separate process lifecycle.

This works. But it’s not the only way — and for many applications, it’s the wrong way.

Embedded inference means loading the model inside your application, as a library. You call a function, you get tokens back. No HTTP. No sockets. No serialization. No separate process to babysit.

	Server Mode	Embedded Mode
How it works	Model runs in a separate process, you call it over HTTP	Model loads in your process, you call it as a function
Latency	~1-5ms overhead per request (TCP + JSON)	Zero overhead — direct function call
Setup	Start server, configure ports, manage process	Add dependency, load model file
Sharing	Multiple clients can share one model	One application, one model
Lifecycle	Separate process management	Dies with your app
Offline	Needs localhost networking	Works with no network stack at all
Best for	Shared servers, microservices, multi-user	Mobile, CLI, edge, privacy, latency-critical

Most developers default to server mode because that’s what the docs show. But if your app is the sole consumer of the model, embedded mode is simpler, faster, and more reliable.

Python: The Most Common Case

Python developers usually reach for llama-cpp-python or call Ollama over HTTP. With mullama, you can embed directly:

from mullama import Model, Context

# Load the model — this is the only slow step (~2-5 seconds)
model = Model.load('llama3.2-1b.gguf', n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)

# Generate — direct function call, no HTTP
response = ctx.generate('Explain embeddings in one paragraph')
print(response)

Streaming

for token in ctx.stream('Write a haiku about Rust:'):
    print(token, end='', flush=True)

Chat with History

messages = [
    {'role': 'system', 'content': 'You are a helpful coding assistant.'},
    {'role': 'user', 'content': 'How do I reverse a list in Python?'},
]
response = ctx.chat(messages)

When to Use Server Mode in Python

If you’re running a FastAPI or Django backend and want multiple endpoints to share one model, run the server instead:

mullama serve --model llama3.2:1b --port 8080

Then use the OpenAI SDK (you probably already have it installed):

from openai import OpenAI

client = OpenAI(base_url='http://localhost:8080/v1', api_key='unused')
response = client.chat.completions.create(
    model='llama3.2:1b',
    messages=[{'role': 'user', 'content': 'Hello'}],
)

Rust: Zero-Cost Inference

Rust is ideal for embedded inference — no GC, no runtime, deterministic performance.

With mullama (High-Level)

use mullama::{Model, Context, ContextParams};

fn main() -> anyhow::Result<()> {
    let model = Model::load("llama3.2-1b.gguf")?;
    let mut ctx = Context::new(&model, ContextParams {
        n_ctx: 4096,
        n_gpu_layers: 32,
        ..Default::default()
    })?;

    let response = ctx.generate("What is GGUF?", 256)?;
    println!("{}", response);
    Ok(())
}

With unillm (Full Runtime Control)

If you need control over the inference runtime — custom scheduling, batch processing, KV cache tuning — use unillm directly:

use unillm::{Model, ModelInputs, ModelOutputs};

// unillm gives you the Model trait — forward(), generate(),
// and access to the 47 architecture implementations
let model = Model::load("llama3.2-1b.gguf")?;
let output = model.generate("Hello", Default::default())?;

unillm is what mullama uses under the hood. Use mullama for application-level embedding; use unillm when you’re building infrastructure.

Dart / Flutter: Mobile-First

On mobile, there is no server. The model runs on the device or it doesn’t run at all. llamafu provides FFI bindings to llama.cpp — no HTTP layer involved.

import 'package:llamafu/llamafu.dart';

// Initialize — loads model into device memory
final llm = await Llamafu.init(
  modelPath: '/data/models/llama3.2-1b-q4_k_m.gguf',
  threads: 4,         // Match your device's core count
  contextSize: 2048,  // Keep small on mobile (RAM!)
);

// Generate text — runs on-device, works offline
final result = await llm.complete(
  prompt: 'Summarize this document:',
  maxTokens: 256,
  temperature: 0.7,
);
print(result);

// Always clean up to free device memory
llm.close();

Streaming in Flutter UI

StreamBuilder<String>(
  stream: llm.streamComplete(
    prompt: 'Explain quantum computing:',
    maxTokens: 300,
  ),
  builder: (context, snapshot) {
    if (snapshot.hasData) {
      return Text(snapshot.data!);
    }
    return CircularProgressIndicator();
  },
)

Vision on Mobile

final result = await llm.multimodalComplete(
  prompt: 'What is in this photo?',
  mediaInputs: [
    MediaInput(type: MediaType.image, data: imagePath),
  ],
  maxTokens: 200,
);

Model Selection for Mobile

Quantization	Size (1B params)	RAM Usage	Quality	Speed
Q4_K_M	~700MB	~1.2GB	Good	Fast
Q5_K_M	~850MB	~1.5GB	Better	Moderate
Q8_0	~1.1GB	~1.8GB	Best	Slower

Recommendation: Use Q4_K_M for most mobile apps. It’s the best tradeoff of quality, speed, and memory.

PHP: Yes, Really

PHP is one of the most underserved languages for LLM tooling. Most PHP developers resort to calling cloud APIs or shelling out to a Python script. mullama gives PHP native FFI bindings:

use Mullama\Model;
use Mullama\Context;

$model = Model::load('llama3.2-1b.gguf', ['gpu_layers' => 32]);
$ctx = new Context($model, ['n_ctx' => 4096]);

// Direct inference — no HTTP, no external process
$response = $ctx->generate('Write a SQL query to find duplicate emails');
echo $response;

In a Laravel Controller

class AIController extends Controller
{
    private Context $ctx;

    public function __construct()
    {
        $model = Model::load(storage_path('models/llama3.2-1b.gguf'));
        $this->ctx = new Context($model, ['n_ctx' => 2048]);
    }

    public function generate(Request $request)
    {
        $response = $this->ctx->generate($request->input('prompt'));
        return response()->json(['text' => $response]);
    }
}

When to Use Server Mode in PHP

If your PHP workers are short-lived (FPM), loading a model per-request is wasteful. Run a mullama server and call it:

// Use the OpenAI-compatible endpoint
$client = OpenAI::factory()
    ->withBaseUri('http://localhost:8080/v1')
    ->make();

$response = $client->chat()->create([
    'model' => 'llama3.2:1b',
    'messages' => [['role' => 'user', 'content' => 'Hello']],
]);

Rule of thumb: If your PHP process lives long enough to amortize model loading (~3-5 seconds), embed. If it’s a short-lived FPM request, use server mode.

Go: Embed for CLIs, Serve for Services

import mullama "github.com/skelf-research/mullama-go"

func main() {
    model, err := mullama.LoadModel("llama3.2-1b.gguf",
        mullama.WithGPULayers(32),
    )
    if err != nil {
        log.Fatal(err)
    }
    defer model.Close()

    ctx, _ := model.NewContext(mullama.ContextConfig{
        ContextSize: 4096,
    })

    result, _ := ctx.Generate("What is a goroutine?")
    fmt.Println(result)
}

Go’s fast startup makes it excellent for CLI tools with embedded models. For long-running services, consider whether to embed (single-tenant) or run a server (multi-tenant).

Node.js: Embed in Electron and Edge

const { Model, Context } = require('mullama');

// Load once at startup
const model = await Model.load('llama3.2-1b.gguf', { gpuLayers: 32 });
const ctx = new Context(model, { contextSize: 4096 });

// Generate — async, non-blocking
const response = await ctx.generate('Explain WebSockets');
console.log(response);

Streaming

const stream = ctx.stream('Write a poem about JavaScript:');
for await (const token of stream) {
  process.stdout.write(token);
}

Electron apps: Embedded mode is ideal — ship the model with your app, no server needed, works offline.

C / C++: Maximum Control

For embedded systems, game engines, or IoT devices:

#include <mullama.h>

mullama_model *model = mullama_load_model("llama3.2-1b.gguf", NULL);
mullama_context *ctx = mullama_new_context(model, NULL);

char output[4096];
mullama_generate(ctx, "Hello, embedded world!", output, sizeof(output));
printf("%s\n", output);

mullama_free_context(ctx);
mullama_free_model(model);

For the most extreme case — running on bare metal with no OS at all — see cllm, our unikernel that boots directly into an LLM inference server.

Zig: Learn the Internals

zigllm isn’t just an inference tool — it’s an educational implementation that teaches you how every layer works:

const std = @import("std");
const zigllm = @import("zigllm");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const allocator = gpa.allocator();

    const model = try zigllm.Model.load(allocator, "model.gguf");
    defer model.deinit();

    const output = try model.generate("Hello from Zig!", .{
        .max_tokens = 256,
        .temperature = 0.7,
    });
    std.debug.print("{s}\n", .{output});
}

zigllm is ideal for researchers and engineers who want to understand how inference works at the SIMD and tensor level, not just call it as a black box.

Decision Matrix: Which Mode, Which Tool?

Scenario	Mode	Tool
Python data pipeline	Embedded	mullama (Python)
FastAPI serving multiple models	Server	mullama serve
Flutter mobile app	Embedded	llamafu
PHP WordPress plugin	Embedded (long-lived) or Server (FPM)	mullama
Rust CLI tool	Embedded	mullama (Rust)
Rust inference server	Embedded	unillm
Go microservice (single model)	Embedded	mullama (Go)
Electron desktop app	Embedded	mullama (Node.js)
Shared team GPU server	Server	mullama serve
IoT / embedded system	Embedded	mullama (C)
Bare-metal appliance	Embedded	cllm
Learning ML internals	Embedded	zigllm
Any language, quick prototype	Server	mullama serve + any HTTP client

Getting Started

Pick your language from the table above
Choose embedded or server mode based on your use case
Download a model — start with llama3.2:1b in GGUF format (Q4_K_M quantization)
Run the code — every example above is copy-paste ready

See individual project pages for full API docs: mullama, llamafu, unillm, cllm, zigllm.