Model Triage

Pawan is model-agnostic — it works with any OpenAI-compatible API. We've triaged models on NVIDIA NIM and local inference for real-world coding agent tasks: tool calling, multi-step reasoning, code generation, and self-healing.

Triage Results (updated 2026-03-19)

Tested across 1000+ cumulative tool calls, 16 data structure builds, and 372 tests across the workspace. 11 NIM models benchmarked on the B-Tree ringer test (2026-03-21).

Tier 1 — Production Ready

ModelProviderTool CallCodingNotes
Mistral Small 4 119BNIMGoodBestFirst 100% autonomous score (interval tree 6/6). Self-corrects via semantic reasoning. Primary model.
StepFun FlashNIMBest (98.9%)GoodBest for multi-step orchestration. Cloud fallback.
MiniMax M2.5NIMGoodGood (4/5 B-Tree)Tied with Mistral on B-Tree. Fewer tool calls needed.
Qwen3.5-9B-OptiQ-4bit (local)MLX~85%Execution only17-18 tok/s, $0. enable_thinking: false required. Can't generate complex algos.

Tier 2 — Usable with Guardrails

ModelProviderNotes
Qwen3.5-122BNIMStrong code but can't fix borrow checker in time (6 errors on B-Tree).
GPT-OSS 120BNIMFixes things but breaks other things. Destructive fix loops.
Nemotron-Super-49BNIMToo simple implementations (1/5 B-Tree).

Tier 3 — Avoid for Agents

ModelProviderIssue
DeepSeek V3.2NIMInfinite loop on tool calling. Never terminates.
Mistral-NemotronNIMDescribes actions but never makes tool calls.
Nemotron-3-Nano-30BNIMAll thinking tokens, zero tool calls.
Nemotron-Ultra-253BNIMHit max iterations, never completed.
Nemotron-3-Super-120BNIMMissing Default trait, compile errors.
Nemotron-3-Nano-4BMLXBroken chat template, garbled output.
Nemotron-Cascade-8BMLXCan't disable thinking. Burns all tokens.

Guardrails That Make It Work

Pawan's state machine includes guardrails that boost success rates across all models:

GuardrailWhat It Does
Empty response nudgeIf model returns empty content + no tool calls, sends a nudge prompt to retry
Repeat detectionDetects when model repeats the same response 3x and forces a different approach
Chatty no-op detectionIf model returns verbose planning text but no tool calls, nudges it to use tools
Think-token strippingStrips <think>...</think> from content AND tool call arguments (StepFun/Qwen compat)
UTF-8 safe truncationTruncates at char boundaries, not byte boundaries — no panics on multi-byte chars
Resilient LLM retryExponential backoff (2s, 4s, 8s) with auto-prune on context overflow
Tool timeout30s per tool (bash uses config timeout), returns error with hint instead of hanging

Hybrid Local + Cloud Routing

Pawan supports hybrid routing — use a local model first (via MLX, Ollama, or llama.cpp), fall back to NIM cloud:

# pawan.toml
provider = "mlx"
model = "mlx-community/Qwen3.5-9B-4bit"
temperature = 0.6
max_tokens = 4096
max_tool_iterations = 20

[cloud]
provider = "nvidia"
model = "step-ai/step-2-flash"

The local model runs at $0/token. If it's down (OOM, Mac asleep), pawan seamlessly falls back to NIM cloud. Zero manual intervention.

MLX on Mac Mini M4 — Honest Assessment

We switched from llama.cpp GGUF to mlx_lm.server (Apple Silicon native) for local inference. Here's what we actually observed:

MLX vs llama.cpp

mlx_lm.serverllama.cpp
HardwareApple Silicon (Metal GPU)Cross-platform (CPU + GPU)
FormatMLX native (safetensors)GGUF
Speed on M4 16GB~40 tok/s (4-bit Qwen3.5-9B)~20 tok/s
MemoryUnified memory — efficientSeparate GPU/CPU split
APIOpenAI-compatible, localhost:8080OpenAI-compatible, localhost:8080

Setup: uv tool install mlx-lm, then mlx_lm.server --model mlx-community/Qwen3.5-9B-4bit. Persisted via launchd plist on Mac, exposed to VPS via SSH tunnel.

What MLX handles well

Where MLX falls short

Mitigation

Front-load prompts with all context the model would otherwise explore: exact file path, existing structure, exact function signature. Skip exploration entirely. This gets task completion up to ~80% for moderate-complexity tasks.

Dogfood Stats

Updated 2026-03-21:

Running Your Own Triage

# Test a model with a simple coding task
pawan run "create a Rust function that checks if a string is a palindrome" \
  --timeout 60 --verbose

# Compare local vs cloud
PAWAN_PROVIDER=mlx pawan run "implement a binary search tree" --output json
PAWAN_PROVIDER=nvidia PAWAN_MODEL=step-ai/step-2-flash pawan run "implement a binary search tree" --output json