Model Triage
Pawan is model-agnostic — it works with any OpenAI-compatible API. We've triaged models on NVIDIA NIM and local inference for real-world coding agent tasks: tool calling, multi-step reasoning, code generation, and self-healing.
Triage Results (updated 2026-04-12)
Tested across 1000+ cumulative tool calls, 16 data structure builds, and 1200+ passing automated tests across the workspace. 12 NIM models benchmarked via latency (nimakai) and real-world dogfooding (pawan task).
Tier 1 — Production Ready
| Model | Provider | Latency | Task Time | Notes |
|---|---|---|---|---|
| Qwen3.5 122B A10B | NIM | 383ms | 13.6s | Primary model. Fastest task completion, solid tool calling, thinking mode support. S tier (66% SWE). |
| MiniMax M2.5 | NIM | 374ms | 73.8s | Cloud fallback. Highest SWE-bench score (80.2%). Best analysis quality but slower. |
| Step 3.5 Flash | NIM | 379ms | — | S+ tier (74.4% SWE). Fast latency but produced empty responses in dogfooding — needs investigation. |
Tier 2 — Usable with Guardrails
| Model | Provider | Notes |
|---|---|---|
| Kimi K2 Thinking | NIM | 470ms. Strong reasoning model but thinking mode overhead slows agentic tasks. |
| Kimi K2 Instruct 0905 | NIM | 458ms. No thinking overhead, decent tool calling. |
| Mistral Large 3 675B | NIM | 685ms. Capable but slow for agent loops. |
| GLM-4.7 | NIM | 1614ms. Strong benchmarks but too slow for real-time agent use. |
Tier 3 — Avoid for Agents
| Model | Provider | Issue |
|---|---|---|
| Mistral Small 4 119B | NIM | 400 error: "Unexpected role 'user' after role 'tool'" — Eruka context injection breaks Mistral's strict message ordering. |
| Gemma 4 31B IT | NIM | Thinking mode stalls pawan (15+ min with no tool calls). 9 TPS too slow for agentic tasks. |
| GLM-5 | NIM | 8313ms latency. Unstable. |
| DeepSeek V3.2 | NIM | Timeout in latency benchmark. |
| Kimi K2.5 | NIM | Timeout in latency benchmark. |
| Qwen3.5 397B A17B | NIM | 404 / timeout — not reliably available. |
Guardrails That Make It Work
Pawan's state machine includes guardrails that boost success rates across all models:
| Guardrail | What It Does |
|---|---|
| Empty response nudge | If model returns empty content + no tool calls, sends a nudge prompt to retry |
| Repeat detection | Detects when model repeats the same response 3x and forces a different approach |
| Chatty no-op detection | If model returns verbose planning text but no tool calls, nudges it to use tools |
| Think-token stripping | Strips <think>...</think> from content AND tool call arguments (StepFun/Qwen compat) |
| UTF-8 safe truncation | Truncates at char boundaries, not byte boundaries — no panics on multi-byte chars |
| Resilient LLM retry | Exponential backoff (2s, 4s, 8s) with auto-prune on context overflow |
| Tool timeout | 30s per tool (bash uses config timeout), returns error with hint instead of hanging |
Hybrid Local + Cloud Routing
Pawan supports hybrid routing — use a local model first (via MLX, Ollama, or llama.cpp/lancor), fall back to NIM cloud:
# pawan.toml
provider = "mlx"
model = "mlx-community/Qwen3.5-9B-4bit"
temperature = 0.6
max_tokens = 4096
max_tool_iterations = 20
[cloud]
provider = "nvidia"
model = "minimaxai/minimax-m2.5"
The local model runs at $0/token. If it's down (OOM, Mac asleep), pawan seamlessly falls back to NIM cloud. Zero manual intervention.
MLX on Mac Mini M4 — Honest Assessment
We switched from llama.cpp GGUF to mlx_lm.server (Apple Silicon native) for local inference. Here's what we actually observed:
MLX vs llama.cpp
| mlx_lm.server | llama.cpp | |
|---|---|---|
| Hardware | Apple Silicon (Metal GPU) | Cross-platform (CPU + GPU) |
| Format | MLX native (safetensors) | GGUF |
| Speed on M4 16GB | ~40 tok/s (4-bit Qwen3.5-9B) | ~20 tok/s |
| Memory | Unified memory — efficient | Separate GPU/CPU split |
| API | OpenAI-compatible, localhost:8080 | OpenAI-compatible, localhost:8080 |
Setup: uv tool install mlx-lm, then mlx_lm.server --model mlx-community/Qwen3.5-9B-4bit. Persisted via launchd plist on Mac, exposed to VPS via SSH tunnel.
What MLX handles well
- Targeted edits: "add a test for this function", "fix this compiler error"
- Simple implementations: bloom filter, fenwick tree, trie — wrote correct code on first try
- Tool calls: individual tool call format is reliable (~85% success rate)
Where MLX falls short
- Timeout on complex tasks: At 5–10s/inference call, a 300s budget is exhausted during the exploration phase (reading files, checking dirs) before any code gets written. Leftist heap and suffix array both timed out.
- Algorithmic bugs: When it does write complex code under time pressure, correctness suffers. The suffix array binary search had inverted comparison directions; the treap's split-based remove was conceptually wrong. Both compiled but failed tests.
- Task-level completion: ~60% for "implement this data structure from scratch". Individual tool calls are fine; it's sustained multi-step reasoning that breaks down.
Mitigation
Front-load prompts with all context the model would otherwise explore: exact file path, existing structure, exact function signature. Skip exploration entirely. This gets task completion up to ~80% for moderate-complexity tasks.
Dogfood Stats
Updated 2026-04-29:
- 12 NIM models benchmarked via nimakai latency + real-world pawan task dogfooding
- Fastest task completion: Qwen3.5 122B (13.6s for healing module review)
- Highest SWE-bench: MiniMax M2.5 (80.2%)
- 16 data structures in grind workspace
- 988+ tests passing with 74.58% line / 74.80% region / 77.34% function coverage (cargo-llvm-cov baseline); 98 new tests added in v0.5.6 (29 compaction, 16 eruka, 53 TUI types)
- 34 tools in 3 tiers (Core/Standard/Extended) with auto-install via mise
- Multi-model thinking support: Qwen (
enable_thinking), Gemma (enable_thinking), GLM (enable_thinking+clear_thinking), Mistral Small 4 (reasoning_effort), DeepSeek (thinking) - Token budget tracking: thinking vs action token split visible in TUI and CLI
Running Your Own Triage
# Latency benchmark via nimakai
nimakai --once -m "qwen/qwen3.5-122b-a10b,minimaxai/minimax-m2.5,stepfun-ai/step-3.5-flash"
# Test a model with a real coding task
pawan task "read src/lib.rs and identify the top 3 issues"
# Override model for comparison
PAWAN_MODEL=minimaxai/minimax-m2.5 pawan task "read src/lib.rs and identify the top 3 issues"