Rust Runtime47 ArchitecturesModularType-Safe
unillm
A modular LLM inference runtime written in Rust. Supports 47 model architectures through three composable abstractions: TensorCore, ModelCore, and WeightLoaderCore.
Key Features
🏗️
47 Model Architectures
LLaMA, Qwen, Gemma, Phi, DeepSeek, Mistral, GPT-2, Whisper, BERT, and more.
📐
Three-Layer Design
TensorCore, ModelCore, WeightLoaderCore — composable abstractions for clean extensibility.
📦
Multi-Format Weights
Load SafeTensors, GGUF, and PyTorch files through a unified interface.
🧠
Hybrid KV Cache
RadixAttention + PagedAttention for efficient memory management during inference.
⚡
Continuous Batching
Request scheduling with continuous batching for production throughput.
🔒
Type-Safe Rust
Full compile-time guarantees with the model_config! macro system.
Quick Start
git clone https://github.com/cognisoc/unillm.git
cd unillm
# Generate text (downloads TinyLlama on first run)
cargo run --bin unillm -p unillm-runtime -- generate --prompt "Explain gravity"
# Use a different model
cargo run --bin unillm -p unillm-runtime -- generate --model llama2:7b --prompt "Hello"
# List cached models
cargo run --bin unillm -p unillm-runtime -- models
Architecture
UniLLM is organized into three composable layers:
- TensorCore — Device-agnostic tensor operations (CPU, CUDA, Metal). All ops go through
ops_fn::operation(). - ModelCore — Universal
Modeltrait withforward()andgenerate(). Configuration viamodel_config!macro. - WeightLoaderCore — Format-agnostic weight loading for SafeTensors, GGUF, and PyTorch files.
Adding a Model
model_config!(MyModelConfig {
vocab_size: usize = 32000,
hidden_size: usize = 4096,
num_hidden_layers: usize = 32,
});
impl Model for MyModel {
type Config = MyModelConfig;
fn forward(&self, inputs: &ModelInputs) -> Result<ModelOutputs> {
// model-specific forward pass
}
}
Supported Models
| Category | Models |
|---|---|
| Core LLMs | LLaMA, Qwen, Gemma, Phi, DeepSeek, Mistral, Mixtral |
| GPT Family | GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, MPT |
| Code | StarCoder, CodeLlama |
| MoE | DeepSeek-MoE, DBRX, Grok, Arctic, Jamba |
| RWKV | RWKV-4, RWKV-6, RecurrentGemma |
| Vision-Language | Qwen2-VL, Phi-3-Vision, InternVL, CogVLM, LLaVA, CLIP |
| Audio / Speech | Wav2Vec2, HuBERT, MusicGen, Encodec, Whisper |
| Encoder | BERT, T5 |
| Specialized | Mamba, MiniCPM, OLMo, Granite |
Project Structure
crates/
runtime/ Core inference runtime (tensor ops, model trait, weight loading)
inference/ High-level inference engine and batching
kv/ Hybrid KV cache (RadixAttention + PagedAttention)
scheduler/ Request scheduling with continuous batching