Embedding LLMs in Your Application: A Guide for Every Language
Most LLM tools force you through HTTP. Here's how to embed models directly in Python, Rust, Dart, Go, PHP, Node.js, C, and Zig — no server, no overhead, no separate process.
The Confusion: Server vs Embedded
If you’ve used Ollama, llama-cpp-python, or any cloud API, you’re used to this pattern: run a server, then call it over HTTP. You POST to localhost:11434, parse JSON, handle connection errors, and manage a separate process lifecycle.
This works. But it’s not the only way — and for many applications, it’s the wrong way.
Embedded inference means loading the model inside your application, as a library. You call a function, you get tokens back. No HTTP. No sockets. No serialization. No separate process to babysit.
| Server Mode | Embedded Mode | |
|---|---|---|
| How it works | Model runs in a separate process, you call it over HTTP | Model loads in your process, you call it as a function |
| Latency | ~1-5ms overhead per request (TCP + JSON) | Zero overhead — direct function call |
| Setup | Start server, configure ports, manage process | Add dependency, load model file |
| Sharing | Multiple clients can share one model | One application, one model |
| Lifecycle | Separate process management | Dies with your app |
| Offline | Needs localhost networking | Works with no network stack at all |
| Best for | Shared servers, microservices, multi-user | Mobile, CLI, edge, privacy, latency-critical |
Most developers default to server mode because that’s what the docs show. But if your app is the sole consumer of the model, embedded mode is simpler, faster, and more reliable.
Python: The Most Common Case
Python developers usually reach for llama-cpp-python or call Ollama over HTTP. With mullama, you can embed directly:
from mullama import Model, Context
# Load the model — this is the only slow step (~2-5 seconds)
model = Model.load('llama3.2-1b.gguf', n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)
# Generate — direct function call, no HTTP
response = ctx.generate('Explain embeddings in one paragraph')
print(response)
Streaming
for token in ctx.stream('Write a haiku about Rust:'):
print(token, end='', flush=True)
Chat with History
messages = [
{'role': 'system', 'content': 'You are a helpful coding assistant.'},
{'role': 'user', 'content': 'How do I reverse a list in Python?'},
]
response = ctx.chat(messages)
When to Use Server Mode in Python
If you’re running a FastAPI or Django backend and want multiple endpoints to share one model, run the server instead:
mullama serve --model llama3.2:1b --port 8080
Then use the OpenAI SDK (you probably already have it installed):
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8080/v1', api_key='unused')
response = client.chat.completions.create(
model='llama3.2:1b',
messages=[{'role': 'user', 'content': 'Hello'}],
)
Rust: Zero-Cost Inference
Rust is ideal for embedded inference — no GC, no runtime, deterministic performance.
With mullama (High-Level)
use mullama::{Model, Context, ContextParams};
fn main() -> anyhow::Result<()> {
let model = Model::load("llama3.2-1b.gguf")?;
let mut ctx = Context::new(&model, ContextParams {
n_ctx: 4096,
n_gpu_layers: 32,
..Default::default()
})?;
let response = ctx.generate("What is GGUF?", 256)?;
println!("{}", response);
Ok(())
}
With unillm (Full Runtime Control)
If you need control over the inference runtime — custom scheduling, batch processing, KV cache tuning — use unillm directly:
use unillm::{Model, ModelInputs, ModelOutputs};
// unillm gives you the Model trait — forward(), generate(),
// and access to the 47 architecture implementations
let model = Model::load("llama3.2-1b.gguf")?;
let output = model.generate("Hello", Default::default())?;
unillm is what mullama uses under the hood. Use mullama for application-level embedding; use unillm when you’re building infrastructure.
Dart / Flutter: Mobile-First
On mobile, there is no server. The model runs on the device or it doesn’t run at all. llamafu provides FFI bindings to llama.cpp — no HTTP layer involved.
import 'package:llamafu/llamafu.dart';
// Initialize — loads model into device memory
final llm = await Llamafu.init(
modelPath: '/data/models/llama3.2-1b-q4_k_m.gguf',
threads: 4, // Match your device's core count
contextSize: 2048, // Keep small on mobile (RAM!)
);
// Generate text — runs on-device, works offline
final result = await llm.complete(
prompt: 'Summarize this document:',
maxTokens: 256,
temperature: 0.7,
);
print(result);
// Always clean up to free device memory
llm.close();
Streaming in Flutter UI
StreamBuilder<String>(
stream: llm.streamComplete(
prompt: 'Explain quantum computing:',
maxTokens: 300,
),
builder: (context, snapshot) {
if (snapshot.hasData) {
return Text(snapshot.data!);
}
return CircularProgressIndicator();
},
)
Vision on Mobile
final result = await llm.multimodalComplete(
prompt: 'What is in this photo?',
mediaInputs: [
MediaInput(type: MediaType.image, data: imagePath),
],
maxTokens: 200,
);
Model Selection for Mobile
| Quantization | Size (1B params) | RAM Usage | Quality | Speed |
|---|---|---|---|---|
| Q4_K_M | ~700MB | ~1.2GB | Good | Fast |
| Q5_K_M | ~850MB | ~1.5GB | Better | Moderate |
| Q8_0 | ~1.1GB | ~1.8GB | Best | Slower |
Recommendation: Use Q4_K_M for most mobile apps. It’s the best tradeoff of quality, speed, and memory.
PHP: Yes, Really
PHP is one of the most underserved languages for LLM tooling. Most PHP developers resort to calling cloud APIs or shelling out to a Python script. mullama gives PHP native FFI bindings:
use Mullama\Model;
use Mullama\Context;
$model = Model::load('llama3.2-1b.gguf', ['gpu_layers' => 32]);
$ctx = new Context($model, ['n_ctx' => 4096]);
// Direct inference — no HTTP, no external process
$response = $ctx->generate('Write a SQL query to find duplicate emails');
echo $response;
In a Laravel Controller
class AIController extends Controller
{
private Context $ctx;
public function __construct()
{
$model = Model::load(storage_path('models/llama3.2-1b.gguf'));
$this->ctx = new Context($model, ['n_ctx' => 2048]);
}
public function generate(Request $request)
{
$response = $this->ctx->generate($request->input('prompt'));
return response()->json(['text' => $response]);
}
}
When to Use Server Mode in PHP
If your PHP workers are short-lived (FPM), loading a model per-request is wasteful. Run a mullama server and call it:
// Use the OpenAI-compatible endpoint
$client = OpenAI::factory()
->withBaseUri('http://localhost:8080/v1')
->make();
$response = $client->chat()->create([
'model' => 'llama3.2:1b',
'messages' => [['role' => 'user', 'content' => 'Hello']],
]);
Rule of thumb: If your PHP process lives long enough to amortize model loading (~3-5 seconds), embed. If it’s a short-lived FPM request, use server mode.
Go: Embed for CLIs, Serve for Services
import mullama "github.com/skelf-research/mullama-go"
func main() {
model, err := mullama.LoadModel("llama3.2-1b.gguf",
mullama.WithGPULayers(32),
)
if err != nil {
log.Fatal(err)
}
defer model.Close()
ctx, _ := model.NewContext(mullama.ContextConfig{
ContextSize: 4096,
})
result, _ := ctx.Generate("What is a goroutine?")
fmt.Println(result)
}
Go’s fast startup makes it excellent for CLI tools with embedded models. For long-running services, consider whether to embed (single-tenant) or run a server (multi-tenant).
Node.js: Embed in Electron and Edge
const { Model, Context } = require('mullama');
// Load once at startup
const model = await Model.load('llama3.2-1b.gguf', { gpuLayers: 32 });
const ctx = new Context(model, { contextSize: 4096 });
// Generate — async, non-blocking
const response = await ctx.generate('Explain WebSockets');
console.log(response);
Streaming
const stream = ctx.stream('Write a poem about JavaScript:');
for await (const token of stream) {
process.stdout.write(token);
}
Electron apps: Embedded mode is ideal — ship the model with your app, no server needed, works offline.
C / C++: Maximum Control
For embedded systems, game engines, or IoT devices:
#include <mullama.h>
mullama_model *model = mullama_load_model("llama3.2-1b.gguf", NULL);
mullama_context *ctx = mullama_new_context(model, NULL);
char output[4096];
mullama_generate(ctx, "Hello, embedded world!", output, sizeof(output));
printf("%s\n", output);
mullama_free_context(ctx);
mullama_free_model(model);
For the most extreme case — running on bare metal with no OS at all — see cllm, our unikernel that boots directly into an LLM inference server.
Zig: Learn the Internals
zigllm isn’t just an inference tool — it’s an educational implementation that teaches you how every layer works:
const std = @import("std");
const zigllm = @import("zigllm");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const allocator = gpa.allocator();
const model = try zigllm.Model.load(allocator, "model.gguf");
defer model.deinit();
const output = try model.generate("Hello from Zig!", .{
.max_tokens = 256,
.temperature = 0.7,
});
std.debug.print("{s}\n", .{output});
}
zigllm is ideal for researchers and engineers who want to understand how inference works at the SIMD and tensor level, not just call it as a black box.
Decision Matrix: Which Mode, Which Tool?
| Scenario | Mode | Tool |
|---|---|---|
| Python data pipeline | Embedded | mullama (Python) |
| FastAPI serving multiple models | Server | mullama serve |
| Flutter mobile app | Embedded | llamafu |
| PHP WordPress plugin | Embedded (long-lived) or Server (FPM) | mullama |
| Rust CLI tool | Embedded | mullama (Rust) |
| Rust inference server | Embedded | unillm |
| Go microservice (single model) | Embedded | mullama (Go) |
| Electron desktop app | Embedded | mullama (Node.js) |
| Shared team GPU server | Server | mullama serve |
| IoT / embedded system | Embedded | mullama (C) |
| Bare-metal appliance | Embedded | cllm |
| Learning ML internals | Embedded | zigllm |
| Any language, quick prototype | Server | mullama serve + any HTTP client |
Getting Started
- Pick your language from the table above
- Choose embedded or server mode based on your use case
- Download a model — start with
llama3.2:1bin GGUF format (Q4_K_M quantization) - Run the code — every example above is copy-paste ready
See individual project pages for full API docs: mullama, llamafu, unillm, cllm, zigllm.