Embed LLMs in
Any Language
Most LLM tools force you to run a separate server and talk to it over HTTP. With cognisoc, you can embed the model directly in your application — same process, zero network overhead. Or run a server when you need one. Your choice.
Embedded vs Server Mode
Embedded (In-Process)
The model loads inside your application. You call it like any library function — model.generate("Hello"). No HTTP, no sockets, no serialization, no separate process to manage.
- + Zero latency overhead
- + No process management
- + Works offline / no network stack
- + Simpler deployment (one binary)
- - Model tied to one process
- - Memory used by your app
Provided by: mullama bindings, llamafu, unillm, zigllm
Server (HTTP API)
Run mullama serve and call it from any language via OpenAI-compatible endpoints. The model runs in a separate process and serves multiple clients concurrently.
- + Share one model across services
- + Works with any HTTP client
- + Swap models without redeploying
- + OpenAI SDK compatible
- - ~1-5ms per-request overhead
- - Extra process to manage
Provided by: mullama serve (OpenAI + Anthropic API compatible)
Rule of thumb: If your application is the only consumer of the model, embed it. If multiple applications or users need the same model, run a server. If you're on mobile or embedded hardware, embed — there's no server to run.
Language Guide
Python
Data science, RAG pipelines, backend services, Jupyter notebooks, batch processing
pip install mullama from mullama import Model, Context
model = Model.load('llama3.2-1b.gguf', n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)
# Direct function call — no HTTP, no server
response = ctx.generate('Explain embeddings in one paragraph')
print(response)
# Streaming
for token in ctx.stream('Write a haiku about Rust:'):
print(token, end='', flush=True) from openai import OpenAI
# Talk to a running mullama server
client = OpenAI(base_url='http://localhost:8080/v1', api_key='unused')
response = client.chat.completions.create(
model='llama3.2:1b',
messages=[{'role': 'user', 'content': 'Hello'}],
) Most Python ML workflows should use embedded mode. Use server mode only when sharing a model across multiple services.
Rust
Systems programming, high-throughput servers, CLI tools, inference infrastructure
cargo add mullama use mullama::{Model, Context, ContextParams};
let model = Model::load("llama3.2-1b.gguf")?;
let mut ctx = Context::new(&model, ContextParams {
n_ctx: 4096,
n_gpu_layers: 32,
..Default::default()
})?;
let response = ctx.generate("What is GGUF?", 256)?;
println!("{}", response); // For full runtime control, use unillm directly
use unillm::{Model, ModelInputs};
// unillm powers mullama under the hood — use it
// when you need custom scheduling, KV cache tuning,
// or access to 47 architecture implementations
let model = Model::load("llama3.2-1b.gguf")?;
let output = model.generate("Hello", Default::default())?; Use mullama for application-level embedding. Use unillm when building inference infrastructure or need runtime control.
Dart / Flutter
Mobile apps (iOS/Android), offline-first experiences, on-device privacy, edge AI
flutter pub add llamafu import 'package:llamafu/llamafu.dart'; final llm = await Llamafu.init( modelPath: '/data/models/llama3.2-1b-q4_k_m.gguf', threads: 4, // Match device core count contextSize: 2048, // Keep small on mobile (RAM) ); // On-device inference — works offline, no server final result = await llm.complete( prompt: 'Summarize this document:', maxTokens: 256, temperature: 0.7, ); // Vision / multimodal final vision = await llm.multimodalComplete( prompt: 'What is in this photo?', mediaInputs: [MediaInput(type: MediaType.image, data: imgPath)], ); llm.close(); // Free device memory
On mobile there is no server — the model runs on the device or it doesn't run. Use Q4_K_M quantization for best quality/speed tradeoff.
PHP
WordPress plugins, Laravel apps, web backends, content generation
composer require skelf-research/mullama use Mullama\Model;
use Mullama\Context;
$model = Model::load('llama3.2-1b.gguf', ['gpu_layers' => 32]);
$ctx = new Context($model, ['n_ctx' => 4096]);
// Direct inference — no HTTP, no external process
$response = $ctx->generate('Write a SQL query to find duplicates');
echo $response; // For short-lived FPM workers, use server mode instead
// (model loading is ~3-5s, too slow per-request)
$client = OpenAI::factory()
->withBaseUri('http://localhost:8080/v1')
->make();
$response = $client->chat()->create([
'model' => 'llama3.2:1b',
'messages' => [['role' => 'user', 'content' => 'Hello']],
]); PHP is one of the most underserved languages for LLM tooling. If your process lives long enough to amortize model loading (~3-5s), embed. For short-lived FPM requests, use server mode.
Go
Microservices, CLI tools, API gateways, DevOps tooling
go get github.com/skelf-research/mullama-go import mullama "github.com/skelf-research/mullama-go"
func main() {
model, err := mullama.LoadModel("llama3.2-1b.gguf",
mullama.WithGPULayers(32),
)
if err != nil { log.Fatal(err) }
defer model.Close()
ctx, _ := model.NewContext(mullama.ContextConfig{
ContextSize: 4096,
})
result, _ := ctx.Generate("What is a goroutine?")
fmt.Println(result)
} import "github.com/sashabaranov/go-openai"
client := openai.NewClient("unused",
openai.WithBaseURL("http://localhost:8080/v1"),
)
resp, _ := client.CreateChatCompletion(ctx,
openai.ChatCompletionRequest{
Model: "llama3.2:1b",
Messages: []openai.ChatCompletionMessage{
{Role: "user", Content: "Hello"},
},
},
) Go's fast startup makes it excellent for CLI tools with embedded models. For long-running services, embed if single-tenant, server if multi-tenant.
Node.js
Full-stack apps, Electron desktop apps, serverless edge functions, real-time chat
npm install mullama const { Model, Context } = require('mullama');
// Load once at startup
const model = await Model.load('llama3.2-1b.gguf', {
gpuLayers: 32,
});
const ctx = new Context(model, { contextSize: 4096 });
// Async, non-blocking
const response = await ctx.generate('Explain WebSockets');
console.log(response);
// Streaming
const stream = ctx.stream('Write a poem about JavaScript:');
for await (const token of stream) {
process.stdout.write(token);
} const OpenAI = require('openai');
const client = new OpenAI({
baseURL: 'http://localhost:8080/v1',
apiKey: 'unused',
});
const response = await client.chat.completions.create({
model: 'llama3.2:1b',
messages: [{ role: 'user', content: 'Hello' }],
}); Embedded mode is ideal for Electron apps — ship the model with your app, no server needed, works offline.
C / C++
Embedded systems, game engines, native apps, IoT devices, bare-metal appliances
# Link against libmullama #include <mullama.h>
mullama_model *model = mullama_load_model("llama3.2-1b.gguf", NULL);
mullama_context *ctx = mullama_new_context(model, NULL);
char output[4096];
mullama_generate(ctx, "Hello, embedded world!", output, sizeof(output));
printf("%s\n", output);
mullama_free_context(ctx);
mullama_free_model(model);
// For the most extreme case — running on bare metal
// with no OS at all — see cllm For maximum control with zero OS overhead, cllm boots directly into an LLM inference server on bare metal.
Zig
Learning ML internals, SIMD research, systems programming, custom inference engines
git clone https://github.com/cognisoc/zigllm.git const std = @import("std");
const zigllm = @import("zigllm");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const allocator = gpa.allocator();
const model = try zigllm.Model.load(allocator, "model.gguf");
defer model.deinit();
const output = try model.generate("Hello from Zig!", .{
.max_tokens = 256,
.temperature = 0.7,
});
std.debug.print("{s}\n", .{output});
} zigllm is educational — it teaches you how every layer of inference works, from tensors to text generation. 285+ tests serve as executable documentation.
Decision Matrix
Not sure which mode or tool to use? Find your scenario.
| Scenario | Mode | Tool |
|---|---|---|
| Python data pipeline | Embedded | mullama |
| FastAPI serving multiple models | Server | mullama serve |
| Flutter mobile app | Embedded | llamafu |
| PHP WordPress plugin | Either | mullama |
| Rust CLI tool | Embedded | mullama |
| Rust inference server | Embedded | unillm |
| Go microservice | Embedded | mullama |
| Electron desktop app | Embedded | mullama (Node) |
| Shared team GPU server | Server | mullama serve |
| IoT / embedded system | Embedded | mullama (C) |
| Bare-metal appliance | Embedded | cllm |
| Learning ML internals | Embedded | zigllm |
| Any language, quick prototype | Server | mullama serve |
Open Hardware for LLM Inference
We're not stopping at software. Cognisoc is exploring open hardware reference designs purpose-built for local LLM inference — open schematics, open firmware, designed to run cognisoc software from boot.
Inference Accelerator Boards
Single-board designs with NPUs and RISC-V cores. Run cllm directly on bare metal — no OS, no overhead. Designed for edge deployment where every watt and millisecond counts.
FPGA Accelerator Capes
Reconfigurable hardware for custom quantization formats, novel attention mechanisms, and research workloads. Flash new inference kernels without respinning silicon.
GPU Cluster Blueprints
Rack-mount configurations with optimized networking for distributed inference using unillm's RPC backend. Open BOM, open thermal design, open orchestration.
Why Open Hardware?
The software is ready. We have the runtime (unillm), the server (mullama), the mobile stack (llamafu), and the unikernel (cllm). The missing piece is hardware designed to run this stack natively — not general-purpose servers with inference bolted on.
Vertical integration matters. When you control both the software and the hardware reference design, you can optimize in ways that generic platforms can't: custom memory layouts for KV caches, tuned PCIe topologies for multi-GPU inference, firmware-level model loading.
Open means auditable. For enterprise and government deployments, proprietary hardware is a black box. Open schematics and open firmware mean you can verify what's running — down to the gate level.
Accessible by design. Reference designs lower the barrier for hardware manufacturers worldwide. Any fabricator can produce inference boards that work with the cognisoc stack out of the box — no licensing, no vendor lock-in.
If you're building hardware for AI inference, working on RISC-V or FPGA platforms, or interested in co-developing open reference designs — let's talk.
Ready to embed?
Pick your language, choose embedded or server mode, and start running LLMs locally in minutes.