Model Triage

Pawan is model-agnostic — it works with any OpenAI-compatible API. We've triaged models on NVIDIA NIM and local inference for real-world coding agent tasks: tool calling, multi-step reasoning, code generation, and self-healing.

Triage Results (updated 2026-04-12)

Tested across 1000+ cumulative tool calls, 16 data structure builds, and 1200+ passing automated tests across the workspace. 12 NIM models benchmarked via latency (nimakai) and real-world dogfooding (pawan task).

Tier 1 — Production Ready

ModelProviderLatencyTask TimeNotes
Qwen3.5 122B A10BNIM383ms13.6sPrimary model. Fastest task completion, solid tool calling, thinking mode support. S tier (66% SWE).
MiniMax M2.5NIM374ms73.8sCloud fallback. Highest SWE-bench score (80.2%). Best analysis quality but slower.
Step 3.5 FlashNIM379msS+ tier (74.4% SWE). Fast latency but produced empty responses in dogfooding — needs investigation.

Tier 2 — Usable with Guardrails

ModelProviderNotes
Kimi K2 ThinkingNIM470ms. Strong reasoning model but thinking mode overhead slows agentic tasks.
Kimi K2 Instruct 0905NIM458ms. No thinking overhead, decent tool calling.
Mistral Large 3 675BNIM685ms. Capable but slow for agent loops.
GLM-4.7NIM1614ms. Strong benchmarks but too slow for real-time agent use.

Tier 3 — Avoid for Agents

ModelProviderIssue
Mistral Small 4 119BNIM400 error: "Unexpected role 'user' after role 'tool'" — Eruka context injection breaks Mistral's strict message ordering.
Gemma 4 31B ITNIMThinking mode stalls pawan (15+ min with no tool calls). 9 TPS too slow for agentic tasks.
GLM-5NIM8313ms latency. Unstable.
DeepSeek V3.2NIMTimeout in latency benchmark.
Kimi K2.5NIMTimeout in latency benchmark.
Qwen3.5 397B A17BNIM404 / timeout — not reliably available.

Guardrails That Make It Work

Pawan's state machine includes guardrails that boost success rates across all models:

GuardrailWhat It Does
Empty response nudgeIf model returns empty content + no tool calls, sends a nudge prompt to retry
Repeat detectionDetects when model repeats the same response 3x and forces a different approach
Chatty no-op detectionIf model returns verbose planning text but no tool calls, nudges it to use tools
Think-token strippingStrips <think>...</think> from content AND tool call arguments (StepFun/Qwen compat)
UTF-8 safe truncationTruncates at char boundaries, not byte boundaries — no panics on multi-byte chars
Resilient LLM retryExponential backoff (2s, 4s, 8s) with auto-prune on context overflow
Tool timeout30s per tool (bash uses config timeout), returns error with hint instead of hanging

Hybrid Local + Cloud Routing

Pawan supports hybrid routing — use a local model first (via MLX, Ollama, or llama.cpp/lancor), fall back to NIM cloud:

# pawan.toml
provider = "mlx"
model = "mlx-community/Qwen3.5-9B-4bit"
temperature = 0.6
max_tokens = 4096
max_tool_iterations = 20

[cloud]
provider = "nvidia"
model = "minimaxai/minimax-m2.5"

The local model runs at $0/token. If it's down (OOM, Mac asleep), pawan seamlessly falls back to NIM cloud. Zero manual intervention.

MLX on Mac Mini M4 — Honest Assessment

We switched from llama.cpp GGUF to mlx_lm.server (Apple Silicon native) for local inference. Here's what we actually observed:

MLX vs llama.cpp

mlx_lm.serverllama.cpp
HardwareApple Silicon (Metal GPU)Cross-platform (CPU + GPU)
FormatMLX native (safetensors)GGUF
Speed on M4 16GB~40 tok/s (4-bit Qwen3.5-9B)~20 tok/s
MemoryUnified memory — efficientSeparate GPU/CPU split
APIOpenAI-compatible, localhost:8080OpenAI-compatible, localhost:8080

Setup: uv tool install mlx-lm, then mlx_lm.server --model mlx-community/Qwen3.5-9B-4bit. Persisted via launchd plist on Mac, exposed to VPS via SSH tunnel.

What MLX handles well

Where MLX falls short

Mitigation

Front-load prompts with all context the model would otherwise explore: exact file path, existing structure, exact function signature. Skip exploration entirely. This gets task completion up to ~80% for moderate-complexity tasks.

Dogfood Stats

Updated 2026-04-29:

Running Your Own Triage

# Latency benchmark via nimakai
nimakai --once -m "qwen/qwen3.5-122b-a10b,minimaxai/minimax-m2.5,stepfun-ai/step-3.5-flash"

# Test a model with a real coding task
pawan task "read src/lib.rs and identify the top 3 issues"

# Override model for comparison
PAWAN_MODEL=minimaxai/minimax-m2.5 pawan task "read src/lib.rs and identify the top 3 issues"