March 25, 2026 · 10 min read developer

Run LLMs on Flutter and Dart: Complete Guide to On-Device AI

How to run large language models locally on iOS and Android using llamafu, a Flutter FFI plugin built on llama.cpp. Covers text generation, chat, vision, tool calling, and performance tuning.

FlutterDartLLMOn-Device AIMobilellama.cpp

The Short Answer

llamafu is a Flutter FFI plugin that runs GGUF-format LLMs directly on iOS and Android. No server, no API key, no network calls. It wraps llama.cpp through Dart FFI and supports text generation, chat completions, streaming, embeddings, vision/multimodal, tool calling, LoRA adapters, and grammar-constrained output.

flutter pub add llamafu

That is the entire install. The rest of this post covers everything you need to go from that command to a working on-device AI feature.

Why On-Device Inference

Before writing any code, it is worth understanding why you would run a model on the phone instead of calling an API.

Privacy. User data never leaves the device. There is no server to breach, no logs to subpoena, no third-party DPA to negotiate. For health, finance, or enterprise apps, this can be the difference between shipping and not shipping.

Latency. A cloud API round trip is 200-2000ms before the first token. On-device inference on a modern phone starts generating in under 100ms. For autocomplete, inline suggestions, or real-time translation, that gap matters.

Cost. Cloud LLM APIs bill per token. On-device inference costs zero at runtime. If your app generates thousands of completions per user per day, the math is straightforward.

Offline capability. The model runs without a network connection. This matters on planes, in basements, in rural areas, and in any country where connectivity is not guaranteed.

Control. No rate limits, no deprecation notices, no vendor lock-in. You own the entire inference stack.

Platform Requirements

Platform	Minimum Version
Flutter	3.10.0+
Dart SDK	3.1.0+
Android	API 21+ (Android 5.0), NDK 21+
iOS	12.0+, Xcode 14+

No special Gradle or Podfile configuration is required. The plugin handles native library compilation through Flutter’s standard FFI build system.

Choosing a Model and Quantization

llamafu loads models in the GGUF format (the standard for llama.cpp). You can find thousands of GGUF models on Hugging Face.

For mobile, quantization is critical. A full FP16 7B model is ~14 GB --- too large for most phones. Quantized versions trade a small amount of quality for massive size and speed improvements.

Quantization	Size (7B model)	Quality	Speed	Recommended For
Q2_K	~2.7 GB	Low	Fastest	Older devices, quick prototyping
Q4_K_M	~4.1 GB	Good	Fast	General mobile use (recommended)
Q5_K_M	~4.8 GB	Better	Medium	Devices with 6+ GB RAM
Q8_0	~7.0 GB	Best	Slower	High-end devices, quality-critical

Q4_K_M is the sweet spot for mobile. It fits comfortably in RAM on most modern phones while retaining strong output quality. Start there unless you have a specific reason not to.

Models to try first:

Qwen2.5-3B-Instruct-GGUF (Q4_K_M ~2 GB) --- fast, good quality for its size
Llama-3.2-3B-Instruct-GGUF (Q4_K_M ~2 GB) --- strong instruction following
Phi-3.5-mini-instruct-GGUF (Q4_K_M ~2.2 GB) --- good reasoning for a small model

Initialization

import 'package:llamafu/llamafu.dart';

final llamafu = await Llamafu.init(
  modelPath: '/path/to/qwen2.5-3b-instruct-q4_k_m.gguf',
  threads: 4,
  contextSize: 2048,
);

modelPath is the absolute path to the GGUF file on the device filesystem. In practice, you will either bundle the model as an asset and copy it to the app’s documents directory on first launch, or download it at runtime.

threads controls how many CPU threads are used for inference. A good default is the number of performance cores on the device. On most modern phones, 4 is a safe choice. Setting this too high (e.g., 8 on a 4-core device) will actually slow things down due to context switching.

contextSize is the maximum number of tokens in the context window. Larger values use more RAM. For mobile, 2048 is a practical default. Increase to 4096 if you need longer conversations and have the memory budget.

Always call llamafu.close() when you are done to free native memory.

Text Generation

The simplest operation: give it a prompt, get text back.

final result = await llamafu.complete(
  prompt: 'Write a Dart function that reverses a linked list:',
  maxTokens: 512,
  temperature: 0.7,
  topP: 0.9,
);

print(result);

temperature controls randomness. Lower values (0.1-0.3) produce more deterministic output; higher values (0.7-1.0) produce more creative output. For code generation, 0.2-0.4 tends to work well. For creative writing, 0.7-0.9.

topP (nucleus sampling) clips the probability distribution. 0.9 means the model considers tokens in the top 90% of probability mass. Combined with temperature, these two parameters give you fine-grained control over output style.

Streaming

For chat interfaces, you want tokens to appear as they are generated rather than waiting for the entire response.

final stream = llamafu.completeStream(
  prompt: 'Explain how Flutter renders widgets:',
  maxTokens: 256,
  temperature: 0.7,
);

await for (final token in stream) {
  stdout.write(token); // prints token by token
}

completeStream returns a Stream<String>. Each event is one or more tokens. You can pipe this directly into a StreamBuilder widget:

StreamBuilder<String>(
  stream: _llamafu.completeStream(
    prompt: prompt,
    maxTokens: 512,
    temperature: 0.7,
  ),
  builder: (context, snapshot) {
    if (snapshot.hasData) {
      _buffer.write(snapshot.data);
    }
    return Text(_buffer.toString());
  },
)

This gives you the “typewriter” effect users expect from chat interfaces, with tokens appearing in real time as the model generates them.

Chat Completions with Conversation History

For multi-turn conversations, use chatComplete with a list of messages. llamafu handles the chat template formatting for you based on the model’s metadata.

final messages = <ChatMessage>[
  ChatMessage.system('You are a helpful Dart programming assistant.'),
  ChatMessage.user('What is the difference between final and const in Dart?'),
];

final response = await llamafu.chatComplete(
  messages: messages,
  maxTokens: 512,
  temperature: 0.7,
);

print(response.content);

// Continue the conversation
messages.add(ChatMessage.assistant(response.content));
messages.add(ChatMessage.user('Show me an example where const matters for performance.'));

final followUp = await llamafu.chatComplete(
  messages: messages,
  maxTokens: 512,
  temperature: 0.7,
);

print(followUp.content);

Each call to chatComplete sends the full message history. The model does not retain state between calls --- you manage the conversation context yourself. This is the same pattern used by cloud APIs, so it should feel familiar.

Watch your context window. Each message consumes tokens. If the conversation grows past your contextSize, you will need to truncate older messages or summarize them.

Vision and Multimodal

llamafu supports vision models like LLaVA and Qwen2-VL. These models can process images alongside text prompts.

You need two files: the main model GGUF and a multimodal projector (mmproj) GGUF.

final llamafu = await Llamafu.init(
  modelPath: '/path/to/llava-v1.6-mistral-7b-q4_k_m.gguf',
  mmprojPath: '/path/to/llava-v1.6-mistral-7b-mmproj-f16.gguf',
  threads: 4,
  contextSize: 4096, // vision models benefit from larger context
);

final result = await llamafu.multimodalComplete(
  prompt: 'What objects are in this image? List them.',
  mediaInputs: [
    MediaInput(type: MediaType.image, data: '/path/to/photo.jpg'),
  ],
  maxTokens: 256,
  temperature: 0.3,
);

print(result);

The data field in MediaInput takes a file path to a JPEG or PNG image. The image is preprocessed by the mmproj model and encoded into the context alongside the text prompt.

Practical uses: document scanning, receipt parsing, accessibility descriptions, visual question answering, and any feature where users point a camera at something and expect an answer.

Note that vision models are larger and slower than text-only models. Budget for higher RAM usage and longer generation times.

Tool Calling / Function Calling

llamafu supports structured tool calling, allowing the model to request function invocations with validated JSON arguments.

final result = await llamafu.chatComplete(
  messages: [
    ChatMessage.system('You have access to tools. Use them when needed.'),
    ChatMessage.user('What is the weather in Tokyo and New York?'),
  ],
  tools: [
    Tool(
      name: 'get_weather',
      description: 'Get current weather for a city',
      parameters: {
        'type': 'object',
        'properties': {
          'location': {
            'type': 'string',
            'description': 'City name',
          },
          'unit': {
            'type': 'string',
            'enum': ['celsius', 'fahrenheit'],
          },
        },
        'required': ['location'],
      },
    ),
  ],
);

// Handle tool calls
if (result.toolCalls != null) {
  for (final call in result.toolCalls!) {
    print('Function: ${call.name}');
    print('Arguments: ${call.arguments}');

    // Execute the function, then feed the result back
    final weatherData = await fetchWeather(
      call.arguments['location'],
      call.arguments['unit'] ?? 'celsius',
    );

    // Continue the conversation with tool results
    messages.add(ChatMessage.tool(
      toolCallId: call.id,
      content: jsonEncode(weatherData),
    ));
  }

  // Get the final response incorporating tool results
  final finalResponse = await llamafu.chatComplete(
    messages: messages,
    tools: tools,
  );
  print(finalResponse.content);
}

The model outputs structured JSON that maps to your tool definitions. You execute the function locally, feed the result back as a tool message, and let the model synthesize a natural-language response. This pattern lets on-device models interact with device APIs, sensors, databases, and any other local resource.

Tool calling works best with instruction-tuned models that have been trained on function-calling datasets. Qwen2.5-Instruct and Llama-3.2-Instruct both handle this well.

LoRA Adapters

You can load LoRA adapters at runtime to specialize a base model without carrying multiple full-size models.

final llamafu = await Llamafu.init(
  modelPath: '/path/to/base-model.gguf',
  loraPath: '/path/to/medical-lora.gguf',
  loraScale: 0.8, // blend factor, 0.0 = base only, 1.0 = full adapter
  threads: 4,
  contextSize: 2048,
);

This is useful when you want one base model with multiple domain-specific fine-tunes (medical, legal, customer support). Ship the base model once, download small LoRA files (~50-200 MB) as needed.

Performance Tips

Thread count. Match performance cores, not total cores. On a Snapdragon 8 Gen 3, use 4 (the performance cluster), not 8. On Apple A17, use 2 performance cores. Profile on real devices.

Context size. Every doubling of context size roughly doubles memory usage for KV cache. Start with 2048. Only increase if your use case requires it.

Quantization. Q4_K_M is the best general choice. If you are seeing quality issues in a specific domain, try Q5_K_M before jumping to Q8_0. The size difference between Q4 and Q5 is much smaller than the jump to Q8.

Model size. On current mobile hardware (2025-2026), 1B-3B parameter models offer the best balance of quality and speed. 7B models work but are noticeably slower. Anything above 7B is impractical for most phones.

Memory management. Call llamafu.close() when inference is not needed. The model occupies significant RAM even when idle. In a Flutter app, consider loading and unloading based on lifecycle events.

Batch processing. If you need to run multiple unrelated completions, create them sequentially rather than in parallel. A single llamafu instance uses all allocated threads; running two instances simultaneously will cause thread contention.

Comparison: Mobile LLM Solutions

Feature	llamafu	MediaPipe LLM	ONNX Runtime Mobile	executorch
Framework	Flutter/Dart	Android (Kotlin/Java)	Cross-platform (C++)	PyTorch Mobile
Model Format	GGUF	TFLite	ONNX	PTE
Quantization Options	Q2-Q8, F16	4-bit, 8-bit	Various	Various
Streaming	Yes	Yes	Manual	Manual
Vision/Multimodal	Yes (LLaVA, Qwen2-VL)	Limited	Model-dependent	Model-dependent
Tool Calling	Built-in	No	No	No
LoRA Support	Yes (hot-swap)	No	No	Limited
Chat Templates	Auto-detected	Manual	Manual	Manual
Grammar Constraints	Yes	No	No	No
iOS + Android	Yes	Android only	Yes	Yes

llamafu’s main advantage for Flutter developers is obvious: it is a first-class Dart package. No platform channels, no method channel serialization overhead, no separate native codebases. FFI calls go directly from Dart to the llama.cpp C library.

Putting It Together

Here is a minimal but complete example: a function that initializes a model, runs a streaming chat, and cleans up.

import 'package:llamafu/llamafu.dart';

Future<void> runChat() async {
  final llamafu = await Llamafu.init(
    modelPath: '/data/models/qwen2.5-3b-instruct-q4_k_m.gguf',
    threads: 4,
    contextSize: 2048,
  );

  final messages = <ChatMessage>[
    ChatMessage.system('You are a concise technical assistant.'),
    ChatMessage.user('How does Dart implement isolates under the hood?'),
  ];

  final stream = llamafu.chatCompleteStream(
    messages: messages,
    maxTokens: 512,
    temperature: 0.4,
  );

  final buffer = StringBuffer();
  await for (final token in stream) {
    stdout.write(token);
    buffer.write(token);
  }

  messages.add(ChatMessage.assistant(buffer.toString()));
  // messages list now has full history for follow-up turns

  llamafu.close();
}

The full API reference and additional examples are on GitHub. File issues there if you hit device-specific problems --- the matrix of Android OEMs and SoCs means edge cases are inevitable.