· 14 min read developer

Embedding LLMs in Your Application: A Guide for Every Language

Most LLM tools force you through HTTP. Here's how to embed models directly in Python, Rust, Dart, Go, PHP, Node.js, C, and Zig — no server, no overhead, no separate process.

embedded inferencemullamallamafupolyglotlocal LLMnative bindings

The Confusion: Server vs Embedded

If you’ve used Ollama, llama-cpp-python, or any cloud API, you’re used to this pattern: run a server, then call it over HTTP. You POST to localhost:11434, parse JSON, handle connection errors, and manage a separate process lifecycle.

This works. But it’s not the only way — and for many applications, it’s the wrong way.

Embedded inference means loading the model inside your application, as a library. You call a function, you get tokens back. No HTTP. No sockets. No serialization. No separate process to babysit.

Server ModeEmbedded Mode
How it worksModel runs in a separate process, you call it over HTTPModel loads in your process, you call it as a function
Latency~1-5ms overhead per request (TCP + JSON)Zero overhead — direct function call
SetupStart server, configure ports, manage processAdd dependency, load model file
SharingMultiple clients can share one modelOne application, one model
LifecycleSeparate process managementDies with your app
OfflineNeeds localhost networkingWorks with no network stack at all
Best forShared servers, microservices, multi-userMobile, CLI, edge, privacy, latency-critical

Most developers default to server mode because that’s what the docs show. But if your app is the sole consumer of the model, embedded mode is simpler, faster, and more reliable.

Python: The Most Common Case

Python developers usually reach for llama-cpp-python or call Ollama over HTTP. With mullama, you can embed directly:

from mullama import Model, Context

# Load the model — this is the only slow step (~2-5 seconds)
model = Model.load('llama3.2-1b.gguf', n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)

# Generate — direct function call, no HTTP
response = ctx.generate('Explain embeddings in one paragraph')
print(response)

Streaming

for token in ctx.stream('Write a haiku about Rust:'):
    print(token, end='', flush=True)

Chat with History

messages = [
    {'role': 'system', 'content': 'You are a helpful coding assistant.'},
    {'role': 'user', 'content': 'How do I reverse a list in Python?'},
]
response = ctx.chat(messages)

When to Use Server Mode in Python

If you’re running a FastAPI or Django backend and want multiple endpoints to share one model, run the server instead:

mullama serve --model llama3.2:1b --port 8080

Then use the OpenAI SDK (you probably already have it installed):

from openai import OpenAI

client = OpenAI(base_url='http://localhost:8080/v1', api_key='unused')
response = client.chat.completions.create(
    model='llama3.2:1b',
    messages=[{'role': 'user', 'content': 'Hello'}],
)

Rust: Zero-Cost Inference

Rust is ideal for embedded inference — no GC, no runtime, deterministic performance.

With mullama (High-Level)

use mullama::{Model, Context, ContextParams};

fn main() -> anyhow::Result<()> {
    let model = Model::load("llama3.2-1b.gguf")?;
    let mut ctx = Context::new(&model, ContextParams {
        n_ctx: 4096,
        n_gpu_layers: 32,
        ..Default::default()
    })?;

    let response = ctx.generate("What is GGUF?", 256)?;
    println!("{}", response);
    Ok(())
}

With unillm (Full Runtime Control)

If you need control over the inference runtime — custom scheduling, batch processing, KV cache tuning — use unillm directly:

use unillm::{Model, ModelInputs, ModelOutputs};

// unillm gives you the Model trait — forward(), generate(),
// and access to the 47 architecture implementations
let model = Model::load("llama3.2-1b.gguf")?;
let output = model.generate("Hello", Default::default())?;

unillm is what mullama uses under the hood. Use mullama for application-level embedding; use unillm when you’re building infrastructure.

Dart / Flutter: Mobile-First

On mobile, there is no server. The model runs on the device or it doesn’t run at all. llamafu provides FFI bindings to llama.cpp — no HTTP layer involved.

import 'package:llamafu/llamafu.dart';

// Initialize — loads model into device memory
final llm = await Llamafu.init(
  modelPath: '/data/models/llama3.2-1b-q4_k_m.gguf',
  threads: 4,         // Match your device's core count
  contextSize: 2048,  // Keep small on mobile (RAM!)
);

// Generate text — runs on-device, works offline
final result = await llm.complete(
  prompt: 'Summarize this document:',
  maxTokens: 256,
  temperature: 0.7,
);
print(result);

// Always clean up to free device memory
llm.close();

Streaming in Flutter UI

StreamBuilder<String>(
  stream: llm.streamComplete(
    prompt: 'Explain quantum computing:',
    maxTokens: 300,
  ),
  builder: (context, snapshot) {
    if (snapshot.hasData) {
      return Text(snapshot.data!);
    }
    return CircularProgressIndicator();
  },
)

Vision on Mobile

final result = await llm.multimodalComplete(
  prompt: 'What is in this photo?',
  mediaInputs: [
    MediaInput(type: MediaType.image, data: imagePath),
  ],
  maxTokens: 200,
);

Model Selection for Mobile

QuantizationSize (1B params)RAM UsageQualitySpeed
Q4_K_M~700MB~1.2GBGoodFast
Q5_K_M~850MB~1.5GBBetterModerate
Q8_0~1.1GB~1.8GBBestSlower

Recommendation: Use Q4_K_M for most mobile apps. It’s the best tradeoff of quality, speed, and memory.

PHP: Yes, Really

PHP is one of the most underserved languages for LLM tooling. Most PHP developers resort to calling cloud APIs or shelling out to a Python script. mullama gives PHP native FFI bindings:

use Mullama\Model;
use Mullama\Context;

$model = Model::load('llama3.2-1b.gguf', ['gpu_layers' => 32]);
$ctx = new Context($model, ['n_ctx' => 4096]);

// Direct inference — no HTTP, no external process
$response = $ctx->generate('Write a SQL query to find duplicate emails');
echo $response;

In a Laravel Controller

class AIController extends Controller
{
    private Context $ctx;

    public function __construct()
    {
        $model = Model::load(storage_path('models/llama3.2-1b.gguf'));
        $this->ctx = new Context($model, ['n_ctx' => 2048]);
    }

    public function generate(Request $request)
    {
        $response = $this->ctx->generate($request->input('prompt'));
        return response()->json(['text' => $response]);
    }
}

When to Use Server Mode in PHP

If your PHP workers are short-lived (FPM), loading a model per-request is wasteful. Run a mullama server and call it:

// Use the OpenAI-compatible endpoint
$client = OpenAI::factory()
    ->withBaseUri('http://localhost:8080/v1')
    ->make();

$response = $client->chat()->create([
    'model' => 'llama3.2:1b',
    'messages' => [['role' => 'user', 'content' => 'Hello']],
]);

Rule of thumb: If your PHP process lives long enough to amortize model loading (~3-5 seconds), embed. If it’s a short-lived FPM request, use server mode.

Go: Embed for CLIs, Serve for Services

import mullama "github.com/skelf-research/mullama-go"

func main() {
    model, err := mullama.LoadModel("llama3.2-1b.gguf",
        mullama.WithGPULayers(32),
    )
    if err != nil {
        log.Fatal(err)
    }
    defer model.Close()

    ctx, _ := model.NewContext(mullama.ContextConfig{
        ContextSize: 4096,
    })

    result, _ := ctx.Generate("What is a goroutine?")
    fmt.Println(result)
}

Go’s fast startup makes it excellent for CLI tools with embedded models. For long-running services, consider whether to embed (single-tenant) or run a server (multi-tenant).

Node.js: Embed in Electron and Edge

const { Model, Context } = require('mullama');

// Load once at startup
const model = await Model.load('llama3.2-1b.gguf', { gpuLayers: 32 });
const ctx = new Context(model, { contextSize: 4096 });

// Generate — async, non-blocking
const response = await ctx.generate('Explain WebSockets');
console.log(response);

Streaming

const stream = ctx.stream('Write a poem about JavaScript:');
for await (const token of stream) {
  process.stdout.write(token);
}

Electron apps: Embedded mode is ideal — ship the model with your app, no server needed, works offline.

C / C++: Maximum Control

For embedded systems, game engines, or IoT devices:

#include <mullama.h>

mullama_model *model = mullama_load_model("llama3.2-1b.gguf", NULL);
mullama_context *ctx = mullama_new_context(model, NULL);

char output[4096];
mullama_generate(ctx, "Hello, embedded world!", output, sizeof(output));
printf("%s\n", output);

mullama_free_context(ctx);
mullama_free_model(model);

For the most extreme case — running on bare metal with no OS at all — see cllm, our unikernel that boots directly into an LLM inference server.

Zig: Learn the Internals

zigllm isn’t just an inference tool — it’s an educational implementation that teaches you how every layer works:

const std = @import("std");
const zigllm = @import("zigllm");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const allocator = gpa.allocator();

    const model = try zigllm.Model.load(allocator, "model.gguf");
    defer model.deinit();

    const output = try model.generate("Hello from Zig!", .{
        .max_tokens = 256,
        .temperature = 0.7,
    });
    std.debug.print("{s}\n", .{output});
}

zigllm is ideal for researchers and engineers who want to understand how inference works at the SIMD and tensor level, not just call it as a black box.

Decision Matrix: Which Mode, Which Tool?

ScenarioModeTool
Python data pipelineEmbeddedmullama (Python)
FastAPI serving multiple modelsServermullama serve
Flutter mobile appEmbeddedllamafu
PHP WordPress pluginEmbedded (long-lived) or Server (FPM)mullama
Rust CLI toolEmbeddedmullama (Rust)
Rust inference serverEmbeddedunillm
Go microservice (single model)Embeddedmullama (Go)
Electron desktop appEmbeddedmullama (Node.js)
Shared team GPU serverServermullama serve
IoT / embedded systemEmbeddedmullama (C)
Bare-metal applianceEmbeddedcllm
Learning ML internalsEmbeddedzigllm
Any language, quick prototypeServermullama serve + any HTTP client

Getting Started

  1. Pick your language from the table above
  2. Choose embedded or server mode based on your use case
  3. Download a model — start with llama3.2:1b in GGUF format (Q4_K_M quantization)
  4. Run the code — every example above is copy-paste ready

See individual project pages for full API docs: mullama, llamafu, unillm, cllm, zigllm.