The LLM Inference Stack: From Silicon to API
LLM inference is a full-stack problem. Most companies solve one layer. Cognisoc is building all five — from bare-metal unikernel to on-device mobile inference.
The Thesis
Large language model inference is not a model problem. It is a systems problem — one that spans silicon, operating systems, runtimes, servers, and client devices. Yet the industry treats it as a single-layer challenge. Cloud providers optimize the server. Chip startups optimize the silicon. App developers wrap an HTTP client around someone else’s API and call it “AI-powered.”
The result is a fragmented stack where no single entity controls the path from hardware interrupt to generated token. Every layer boundary introduces latency, abstraction tax, and lost optimization opportunity.
Cognisoc is building the full inference stack. Five layers, five projects, one coherent architecture. This article maps each layer, explains why vertical integration matters, and makes the case that the future of AI inference is local, private, and fast.
The Five Layers
The LLM inference stack can be decomposed into five distinct layers. Each has its own constraints, its own engineering culture, and — until now — its own isolated set of solutions.
| Layer | Concern | Cognisoc Project | Language |
|---|---|---|---|
| 1. Hardware | Bare-metal execution, zero OS overhead | cllm | C |
| 2. Runtime | Model loading, tensor ops, weight formats | unillm | Rust |
| 3. Server | API compatibility, language bindings, deployment | mullama | Python/Rust |
| 4. Edge/Mobile | On-device inference, privacy, offline | llamafu | Dart (Flutter) |
| 5. Education | Community building, transparency, reproducibility | zigllm | Zig |
Each layer feeds the next. Optimizations at the bottom propagate upward. A faster tensor kernel in Layer 2 means faster completions in Layer 3 and lower latency on Layer 4. This is the compounding advantage of vertical integration.
Layer 1: Hardware — cllm
Most inference servers run on Linux. That means a kernel, a scheduler, a network stack, a filesystem, device drivers for hardware the server will never touch, and thousands of syscall entry points. All of this sits between the model and the metal.
cllm removes the operating system entirely. It is a unikernel: a single-address-space binary that boots via Multiboot, initializes an e1000 network interface through direct PCI enumeration, and enters a packet-processing loop that serves HTTP requests for inference. There is no scheduler because there is only one application. There is no filesystem because weights are mapped directly into memory. There is no kernel/user boundary because the kernel is the application.
+---------------------------------------------------------+
| QEMU / Bare Metal (x86, Multiboot) |
+---------------------------------------------------------+
| boot.S Multiboot entry, stack, serial init |
| kernel.c VGA terminal, serial I/O |
| memory.c Heap allocator (malloc/free) |
| network.c PCI enumeration + e1000 NIC driver |
| http.c HTTP server, request routing |
| api_v1.c llama.cpp-compatible REST API |
| llm.c Model loading and inference |
+---------------------------------------------------------+
The implications for inference are significant. Context switches cost 1-5 microseconds each on Linux. A single token generation can trigger hundreds of them. At scale — thousands of concurrent requests, millions of tokens per minute — those microseconds compound into milliseconds of wasted compute per request. cllm eliminates this entire category of overhead.
Status: Boots on x86 hardware and QEMU, serves HTTP, exposes llama.cpp-compatible API endpoints. GPU passthrough and streaming generation are on the roadmap.
Layer 2: Runtime — unillm
A runtime must answer two questions: which models can it run? and how efficiently does it run them?
unillm answers the first question with 47 model architectures — LLaMA, Qwen, Gemma, Phi, DeepSeek, Mistral, GPT-2, Whisper, BERT, StarCoder, Mamba, and dozens more. It answers the second with a three-layer design that separates concerns cleanly:
- TensorCore — Device-agnostic tensor operations (CPU, CUDA, Metal)
- ModelCore — A universal
Modeltrait withforward()andgenerate(), configured via themodel_config!macro - WeightLoaderCore — Format-agnostic weight loading for SafeTensors, GGUF, and PyTorch files
This separation is critical. When a new weight format emerges, only WeightLoaderCore changes. When a new hardware backend ships, only TensorCore changes. When a new architecture drops on Hugging Face, a developer adds a new ModelCore implementation without touching the rest of the stack.
model_config!(LlamaConfig {
vocab_size: usize = 32000,
hidden_size: usize = 4096,
num_hidden_layers: usize = 32,
});
Production features — hybrid KV caching (RadixAttention + PagedAttention), continuous batching, request scheduling — are built into the runtime, not bolted on as afterthoughts. This is a Rust codebase with compile-time guarantees, not a Python wrapper around C++ with hopes and prayers.
47 architectures. Three weight formats. Compile-time type safety. One runtime.
Layer 3: Server — mullama
The runtime handles inference. The server handles everything else: API compatibility, language bindings, model management, and deployment.
mullama is a drop-in Ollama replacement — same CLI commands, same Modelfile format, same model registry — but with a fundamentally different architecture. Where Ollama exposes models through HTTP only, mullama provides native bindings for six languages:
| Language | Install | In-Process |
|---|---|---|
| Python | pip install mullama | Yes |
| Node.js | npm install mullama | Yes |
| Rust | cargo add mullama | Yes |
| Go | go get github.com/... | Yes |
| PHP | composer require ... | Yes |
| C/C++ | Link directly | Yes |
“In-process” means no separate daemon, no HTTP serialization overhead, no network round-trip to localhost. A Python application imports mullama and calls ctx.generate() — the model runs in the same process space. For latency-sensitive applications (autocomplete, real-time agents, game NPCs), this is the difference between usable and unusable.
mullama also exposes OpenAI-compatible and Anthropic-compatible API endpoints. Existing codebases that target GPT-4 or Claude can point at a local mullama server and run locally with zero code changes. Seven GPU backends (CUDA, Metal, ROCm, OpenCL, Vulkan, SYCL, RPC) ensure the server runs on virtually any hardware.
Layer 4: Edge/Mobile — llamafu
Cloud inference has three structural problems that no amount of optimization can fix: latency (the speed of light is not negotiable), privacy (data leaves the device), and cost (per-token billing scales linearly with usage).
llamafu is a Flutter FFI plugin that runs inference entirely on-device. No API keys, no network calls, no data exfiltration. It supports Android (API 21+) and iOS (12.0+) with native code paths optimized for mobile hardware.
| Factor | Cloud API | llamafu (On-Device) |
|---|---|---|
| Privacy | Data sent to servers | Data never leaves device |
| Latency | 200-2000ms round trip | Sub-100ms |
| Cost | Per-token billing | One-time compute |
| Offline | Requires internet | Works anywhere |
| Control | Vendor-dependent | You own the stack |
The feature set goes beyond basic text completion. llamafu supports vision models (LLaVA, Qwen2-VL), tool calling with JSON schema validation, LoRA hot-swapping at runtime, and real-time token streaming — all running on the user’s phone, offline.
This is not a demo. This is the production-grade mobile inference layer that makes “AI on every device” a deployable reality.
Layer 5: Education — zigllm
An inference stack is only as strong as the community that builds on it. Proprietary, opaque systems create vendor lock-in. Open, understandable systems create ecosystems.
zigllm is an LLM implementation designed to teach. It supports 18 model families (LLaMA, Mistral, GPT-2, Falcon, Mamba, BERT, Gemma, StarCoder, and more) across a progressive six-layer architecture:
6. Inference Text generation, sampling, KV caching
5. Models 18 architectures, GGUF loading, tokenization
4. Transformers Multi-head attention, feed-forward networks
3. Neural Primitives SwiGLU, GELU, RMSNorm, RoPE
2. Linear Algebra SIMD matrix ops, 18+ quantization formats
1. Foundation Tensors, memory management, memory mapping
285+ tests serve as executable documentation. Each test demonstrates a concept and validates the math. A developer who works through zigllm from Layer 1 to Layer 6 emerges with a deep understanding of how LLM inference actually works — not just how to call an API.
Zig was chosen deliberately: comptime generics, first-class SIMD, manual memory control, no garbage collector. It forces the developer to understand every allocation, every memory layout decision, every vectorization opportunity. This is education through engineering constraints.
Why Vertical Integration Matters
The dominant model in AI infrastructure today is horizontal specialization. NVIDIA builds chips. vLLM builds runtimes. Ollama builds servers. App developers call APIs. Each company optimizes its layer and hopes the layers above and below cooperate.
This model breaks down in three specific ways:
1. Lost optimization opportunities. When the runtime knows the hardware topology and the server knows the request pattern, end-to-end optimizations become possible: batching strategies tuned to memory hierarchy, KV cache policies informed by API-level context windows, weight layouts optimized for the specific silicon. Cross-layer optimization requires cross-layer ownership.
2. Inconsistent experience. A model that runs well in vLLM may behave differently in llama.cpp, produce different outputs through Ollama versus a direct API, and fail entirely on mobile. Vertical integration guarantees that a model behaves identically from bare metal to phone.
3. Fragile dependencies. Every layer boundary is a potential breaking point. A runtime update breaks the server. A weight format change breaks the loader. A new model architecture requires coordinated changes across three different open-source projects maintained by three different teams. A single organization controlling the full stack can ship coordinated updates.
The Market Shift
The inference market is undergoing a structural transition. From 2022 to 2024, cloud API was the default. Developers called OpenAI, paid per token, and accepted the latency, privacy, and vendor-lock tradeoffs.
That default is changing. Three forces are driving the shift:
Regulation. GDPR, the EU AI Act, and sector-specific rules in healthcare and finance increasingly require that sensitive data not leave organizational boundaries. Cloud inference of private data is becoming a compliance liability.
Economics. At scale, per-token API pricing is brutal. A company processing millions of documents per day can reduce inference costs by 10-50x by running models locally on owned hardware. The capital expenditure pays for itself within months.
Capability parity. Open-weight models (LLaMA 3, Mistral, Qwen, DeepSeek) now match or exceed proprietary models on most benchmarks. The quality gap that justified cloud API lock-in has narrowed to the point of irrelevance for the majority of production use cases.
The market is moving from “cloud-only” to “hybrid and local.” The winners will be the companies that make local inference as easy, reliable, and performant as calling an API. That requires solving the full stack.
Cognisoc vs. The Field
Most companies in the inference space solve one layer well and ignore the rest.
| Capability | Cognisoc | Ollama | vLLM | llama.cpp | NVIDIA TensorRT |
|---|---|---|---|---|---|
| Bare-metal / unikernel | cllm | — | — | — | — |
| Multi-arch runtime | unillm (47) | — | Yes | Yes | Yes |
| Local server + bindings | mullama (6 langs) | HTTP only | HTTP only | C API | HTTP only |
| Mobile / on-device | llamafu | — | — | Partial | — |
| Educational implementation | zigllm | — | — | — | — |
| Weight format abstraction | SafeTensors + GGUF + PyTorch | GGUF | Multiple | GGUF | Proprietary |
| API compatibility | OpenAI + Anthropic | OpenAI | OpenAI | — | — |
No other organization covers all five layers. This is not a coincidence — it is a strategy. Each project reinforces the others. unillm’s runtime powers mullama’s server. mullama’s model management feeds llamafu’s on-device deployment. zigllm’s educational materials grow the contributor pipeline for all four production projects. cllm’s bare-metal research pushes the performance ceiling that every other layer benefits from.
The Cognisoc Thesis
Every device should be able to run AI locally. Not as a novelty. Not as a demo. As the default mode of operation.
This requires solving inference at every layer of the stack — from the hardware interrupt that receives a network packet, through the tensor operations that transform embeddings, to the Flutter widget that streams tokens to a user’s screen. Half-measures produce half-products.
The companies that will define the next decade of AI infrastructure are not the ones building better chatbots. They are the ones building better systems — systems where the hardware, runtime, server, and client work as a single coherent machine.
That is what cognisoc is building. Five layers. Five projects. One stack.