No HTTP required

Embed LLMs in
Any Language

Most LLM tools force you to run a separate server and talk to it over HTTP. With cognisoc, you can embed the model directly in your application — same process, zero network overhead. Or run a server when you need one. Your choice.

Embedded vs Server Mode

Recommended for most use cases

Embedded (In-Process)

The model loads inside your application. You call it like any library function — model.generate("Hello"). No HTTP, no sockets, no serialization, no separate process to manage.

+ Zero latency overhead
+ No process management
+ Works offline / no network stack
+ Simpler deployment (one binary)
- Model tied to one process
- Memory used by your app

Provided by: mullama bindings, llamafu, unillm, zigllm

For shared / multi-tenant setups

Server (HTTP API)

Run mullama serve and call it from any language via OpenAI-compatible endpoints. The model runs in a separate process and serves multiple clients concurrently.

+ Share one model across services
+ Works with any HTTP client
+ Swap models without redeploying
+ OpenAI SDK compatible
- ~1-5ms per-request overhead
- Extra process to manage

Provided by: mullama serve (OpenAI + Anthropic API compatible)

Rule of thumb: If your application is the only consumer of the model, embed it. If multiple applications or users need the same model, run a server. If you're on mobile or embedded hardware, embed — there's no server to run.

Language Guide

Python

Data science, RAG pipelines, backend services, Jupyter notebooks, batch processing

mullama

pip install mullama

Embedded (In-Process)

from mullama import Model, Context

model = Model.load('llama3.2-1b.gguf', n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)

# Direct function call — no HTTP, no server
response = ctx.generate('Explain embeddings in one paragraph')
print(response)

# Streaming
for token in ctx.stream('Write a haiku about Rust:'):
    print(token, end='', flush=True)

Server (HTTP API)

from openai import OpenAI

# Talk to a running mullama server
client = OpenAI(base_url='http://localhost:8080/v1', api_key='unused')
response = client.chat.completions.create(
    model='llama3.2:1b',
    messages=[{'role': 'user', 'content': 'Hello'}],
)

Most Python ML workflows should use embedded mode. Use server mode only when sharing a model across multiple services.

Rust

Systems programming, high-throughput servers, CLI tools, inference infrastructure

mullama / unillm

cargo add mullama

Embedded (In-Process)

use mullama::{Model, Context, ContextParams};

let model = Model::load("llama3.2-1b.gguf")?;
let mut ctx = Context::new(&model, ContextParams {
    n_ctx: 4096,
    n_gpu_layers: 32,
    ..Default::default()
})?;

let response = ctx.generate("What is GGUF?", 256)?;
println!("{}", response);

Server (HTTP API)

// For full runtime control, use unillm directly
use unillm::{Model, ModelInputs};

// unillm powers mullama under the hood — use it
// when you need custom scheduling, KV cache tuning,
// or access to 47 architecture implementations
let model = Model::load("llama3.2-1b.gguf")?;
let output = model.generate("Hello", Default::default())?;

Use mullama for application-level embedding. Use unillm when building inference infrastructure or need runtime control.

Dart / Flutter

Mobile apps (iOS/Android), offline-first experiences, on-device privacy, edge AI

llamafu

flutter pub add llamafu

Embedded (In-Process)

import 'package:llamafu/llamafu.dart';

final llm = await Llamafu.init(
  modelPath: '/data/models/llama3.2-1b-q4_k_m.gguf',
  threads: 4,         // Match device core count
  contextSize: 2048,  // Keep small on mobile (RAM)
);

// On-device inference — works offline, no server
final result = await llm.complete(
  prompt: 'Summarize this document:',
  maxTokens: 256,
  temperature: 0.7,
);

// Vision / multimodal
final vision = await llm.multimodalComplete(
  prompt: 'What is in this photo?',
  mediaInputs: [MediaInput(type: MediaType.image, data: imgPath)],
);

llm.close(); // Free device memory

On mobile there is no server — the model runs on the device or it doesn't run. Use Q4_K_M quantization for best quality/speed tradeoff.

PHP

WordPress plugins, Laravel apps, web backends, content generation

mullama

composer require skelf-research/mullama

Embedded (In-Process)

use Mullama\Model;
use Mullama\Context;

$model = Model::load('llama3.2-1b.gguf', ['gpu_layers' => 32]);
$ctx = new Context($model, ['n_ctx' => 4096]);

// Direct inference — no HTTP, no external process
$response = $ctx->generate('Write a SQL query to find duplicates');
echo $response;

Server (HTTP API)

// For short-lived FPM workers, use server mode instead
// (model loading is ~3-5s, too slow per-request)

$client = OpenAI::factory()
    ->withBaseUri('http://localhost:8080/v1')
    ->make();

$response = $client->chat()->create([
    'model' => 'llama3.2:1b',
    'messages' => [['role' => 'user', 'content' => 'Hello']],
]);

PHP is one of the most underserved languages for LLM tooling. If your process lives long enough to amortize model loading (~3-5s), embed. For short-lived FPM requests, use server mode.

Go

Microservices, CLI tools, API gateways, DevOps tooling

mullama

go get github.com/skelf-research/mullama-go

Embedded (In-Process)

import mullama "github.com/skelf-research/mullama-go"

func main() {
    model, err := mullama.LoadModel("llama3.2-1b.gguf",
        mullama.WithGPULayers(32),
    )
    if err != nil { log.Fatal(err) }
    defer model.Close()

    ctx, _ := model.NewContext(mullama.ContextConfig{
        ContextSize: 4096,
    })

    result, _ := ctx.Generate("What is a goroutine?")
    fmt.Println(result)
}

Server (HTTP API)

import "github.com/sashabaranov/go-openai"

client := openai.NewClient("unused",
    openai.WithBaseURL("http://localhost:8080/v1"),
)
resp, _ := client.CreateChatCompletion(ctx,
    openai.ChatCompletionRequest{
        Model: "llama3.2:1b",
        Messages: []openai.ChatCompletionMessage{
            {Role: "user", Content: "Hello"},
        },
    },
)

Go's fast startup makes it excellent for CLI tools with embedded models. For long-running services, embed if single-tenant, server if multi-tenant.

Node.js

Full-stack apps, Electron desktop apps, serverless edge functions, real-time chat

mullama

npm install mullama

Embedded (In-Process)

const { Model, Context } = require('mullama');

// Load once at startup
const model = await Model.load('llama3.2-1b.gguf', {
  gpuLayers: 32,
});
const ctx = new Context(model, { contextSize: 4096 });

// Async, non-blocking
const response = await ctx.generate('Explain WebSockets');
console.log(response);

// Streaming
const stream = ctx.stream('Write a poem about JavaScript:');
for await (const token of stream) {
  process.stdout.write(token);
}

Server (HTTP API)

const OpenAI = require('openai');

const client = new OpenAI({
  baseURL: 'http://localhost:8080/v1',
  apiKey: 'unused',
});

const response = await client.chat.completions.create({
  model: 'llama3.2:1b',
  messages: [{ role: 'user', content: 'Hello' }],
});

Embedded mode is ideal for Electron apps — ship the model with your app, no server needed, works offline.

C / C++

Embedded systems, game engines, native apps, IoT devices, bare-metal appliances

mullama / cllm

# Link against libmullama

Embedded (In-Process)

#include <mullama.h>

mullama_model *model = mullama_load_model("llama3.2-1b.gguf", NULL);
mullama_context *ctx = mullama_new_context(model, NULL);

char output[4096];
mullama_generate(ctx, "Hello, embedded world!", output, sizeof(output));
printf("%s\n", output);

mullama_free_context(ctx);
mullama_free_model(model);

// For the most extreme case — running on bare metal
// with no OS at all — see cllm

For maximum control with zero OS overhead, cllm boots directly into an LLM inference server on bare metal.

Zig

Learning ML internals, SIMD research, systems programming, custom inference engines

zigllm

git clone https://github.com/cognisoc/zigllm.git

Embedded (In-Process)

const std = @import("std");
const zigllm = @import("zigllm");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const allocator = gpa.allocator();

    const model = try zigllm.Model.load(allocator, "model.gguf");
    defer model.deinit();

    const output = try model.generate("Hello from Zig!", .{
        .max_tokens = 256,
        .temperature = 0.7,
    });
    std.debug.print("{s}\n", .{output});
}

zigllm is educational — it teaches you how every layer of inference works, from tensors to text generation. 285+ tests serve as executable documentation.

Decision Matrix

Not sure which mode or tool to use? Find your scenario.

Scenario	Mode	Tool	Why
Python data pipeline	Embedded	mullama	Model lives in your process, no server management
FastAPI serving multiple models	Server	mullama serve	Multiple endpoints share one model process
Flutter mobile app	Embedded	llamafu	No server on mobile — model runs on device
PHP WordPress plugin	Either	mullama	Embed if long-lived, server if FPM (short-lived)
Rust CLI tool	Embedded	mullama	Fast startup, single binary, no dependencies
Rust inference server	Embedded	unillm	Full runtime control, custom scheduling
Go microservice	Embedded	mullama	Single-tenant service, direct function calls
Electron desktop app	Embedded	mullama (Node)	Ship model with app, works offline
Shared team GPU server	Server	mullama serve	Multiple users/services share one GPU
IoT / embedded system	Embedded	mullama (C)	No OS networking, minimal footprint
Bare-metal appliance	Embedded	cllm	No OS at all — unikernel boots into inference
Learning ML internals	Embedded	zigllm	285+ tests, progressive architecture
Any language, quick prototype	Server	mullama serve	Hit it from curl, any HTTP client

Coming Soon

Open Hardware for LLM Inference

We're not stopping at software. Cognisoc is exploring open hardware reference designs purpose-built for local LLM inference — open schematics, open firmware, designed to run cognisoc software from boot.

📟

Inference Accelerator Boards

Single-board designs with NPUs and RISC-V cores. Run cllm directly on bare metal — no OS, no overhead. Designed for edge deployment where every watt and millisecond counts.

🧩

FPGA Accelerator Capes

Reconfigurable hardware for custom quantization formats, novel attention mechanisms, and research workloads. Flash new inference kernels without respinning silicon.

🗄️

GPU Cluster Blueprints

Rack-mount configurations with optimized networking for distributed inference using unillm's RPC backend. Open BOM, open thermal design, open orchestration.

Why Open Hardware?

The software is ready. We have the runtime (unillm), the server (mullama), the mobile stack (llamafu), and the unikernel (cllm). The missing piece is hardware designed to run this stack natively — not general-purpose servers with inference bolted on.

Vertical integration matters. When you control both the software and the hardware reference design, you can optimize in ways that generic platforms can't: custom memory layouts for KV caches, tuned PCIe topologies for multi-GPU inference, firmware-level model loading.

Open means auditable. For enterprise and government deployments, proprietary hardware is a black box. Open schematics and open firmware mean you can verify what's running — down to the gate level.

Accessible by design. Reference designs lower the barrier for hardware manufacturers worldwide. Any fabricator can produce inference boards that work with the cognisoc stack out of the box — no licensing, no vendor lock-in.

If you're building hardware for AI inference, working on RISC-V or FPGA platforms, or interested in co-developing open reference designs — let's talk.

Ready to embed?

Pick your language, choose embedded or server mode, and start running LLMs locally in minutes.

Get Started with mullama Read the Full Tutorial

Embed LLMs in Any Language

Embedded vs Server Mode

Embedded (In-Process)

Server (HTTP API)

Language Guide

Python

Rust

Dart / Flutter

PHP

Go

Node.js

C / C++

Zig

Decision Matrix

Open Hardware for LLM Inference

Inference Accelerator Boards

FPGA Accelerator Capes

GPU Cluster Blueprints

Why Open Hardware?

Ready to embed?

Embed LLMs in
Any Language