Sharing observations from running LocalAI alongside the Nextcloud Assistant + Context Agent stack on consumer hardware (RTX 3090, 24GB VRAM, shared with an Immich photo library).
Setup: Nextcloud AIO with context_agent, context_chat, and integration_openai pointed at a self-hosted LocalAI instance. Approximately 86 tools enabled across the integrations (a significant context-budget consideration on its own — more on that below).
What’s working well for me: Qwen3-30B-A3B-Instruct-2507 (Q4_K_M, 48K context). MoE architecture means 3B active parameters per token, so inference speed is comparable to a 3B dense model despite 30B total weights. Tool-calling format (ChatML with <tool_call> JSON blocks) parses cleanly through LocalAI’s existing tool-call parser without custom regex. Comfortable VRAM headroom alongside Immich. This has been my daily driver and handles all of context_agent’s task types including the tool-heavy ones.
Gemma 3 27B QAT (Q4_0). Better conversational tone for drafting and free-prompt tasks than Qwen3-30B. No native tool-calling support that LocalAI’s parser recognizes, so not viable as a Context Agent backend. Useful as a routed alternative for content tasks (Free Prompt, Summarize, Headline, Reformulation) where tool execution isn’t needed.
Hermes-3-Llama-3.2-3B. Fast slot for short-response tasks where latency matters more than depth. CPU-runnable for the llm2 sidecar.
What I tried and would caution others about (at least for now): Gemma 4 26B-A4B (Mudler’s APEX quant). Loaded after a llama.cpp backend update added gemma4 architecture support. Inference works fine, but the model emits tool calls in its native format (<|tool_call>call:NAME{key:value}<tool_call|> with <|"|> string delimiters), which LocalAI’s existing tool-call parsers don’t recognize. End result: raw model tokens leak through to the user instead of executed tool calls. The format is documented in the model’s chat template, so a custom regex parser is theoretically possible, but I haven’t found existing LocalAI support for it.
For now I’m routing tool tasks to Qwen and content tasks to Gemma 4 instead. Reasoning-distilled models (Qwen3.5-27B-Claude-Distilled, Qwen3.6-35B-distilled). The trace overhead made these unsuitable for routine Assistant tasks even when the underlying answer quality was high. They’re better reserved for explicit “deep think” usage, not as defaults. Mistral-Small-3.2-24B-Instruct-2506.
Works well technically but the prose style felt clipped/cold for general-purpose family use. Subjective preference, not a capability gap. A few practical notes that might help others: LocalAI’s context_size YAML key only takes effect at the top level of the model config — placed under parameters: it’s silently ignored. Cost me a few hours.
The Context Agent ships ~11K tokens of tool definitions per request when all integrations are enabled (Cookbook, Forms, OpenStreetMap, Weather, YouTube, LibreSign, Tables, Analytics, etc.). Pruning unused tool integrations from Nextcloud’s AI settings significantly reduces prompt overhead and noise. Worth doing regardless of model choice. For tool-calling workloads specifically, sticking with models whose tool-call format matches an existing LocalAI parser (Mistral’s [TOOL_CALLS], ChatML’s <tool_call>, Hermes’ bracketed syntax) saves a lot of integration pain. Qwen3-30B’s ChatML format is the cleanest fit I’ve found.