Model Triage

Pawan is model-agnostic — it works with any OpenAI-compatible API. We've triaged models on NVIDIA NIM and local inference for real-world coding agent tasks: tool calling, multi-step reasoning, code generation, and self-healing.

Triage Results (updated 2026-03-19)

Tested across 1000+ cumulative tool calls, 16 data structure builds, and 372 tests across the workspace. 11 NIM models benchmarked on the B-Tree ringer test (2026-03-21).

Tier 1 — Production Ready

Model	Provider	Tool Call	Coding	Notes
Mistral Small 4 119B	NIM	Good	Best	First 100% autonomous score (interval tree 6/6). Self-corrects via semantic reasoning. Primary model.
StepFun Flash	NIM	Best (98.9%)	Good	Best for multi-step orchestration. Cloud fallback.
MiniMax M2.5	NIM	Good	Good (4/5 B-Tree)	Tied with Mistral on B-Tree. Fewer tool calls needed.
Qwen3.5-9B-OptiQ-4bit (local)	MLX	~85%	Execution only	17-18 tok/s, $0. `enable_thinking: false` required. Can't generate complex algos.

Tier 2 — Usable with Guardrails

Model	Provider	Notes
Qwen3.5-122B	NIM	Strong code but can't fix borrow checker in time (6 errors on B-Tree).
GPT-OSS 120B	NIM	Fixes things but breaks other things. Destructive fix loops.
Nemotron-Super-49B	NIM	Too simple implementations (1/5 B-Tree).

Tier 3 — Avoid for Agents

Model	Provider	Issue
DeepSeek V3.2	NIM	Infinite loop on tool calling. Never terminates.
Mistral-Nemotron	NIM	Describes actions but never makes tool calls.
Nemotron-3-Nano-30B	NIM	All thinking tokens, zero tool calls.
Nemotron-Ultra-253B	NIM	Hit max iterations, never completed.
Nemotron-3-Super-120B	NIM	Missing Default trait, compile errors.
Nemotron-3-Nano-4B	MLX	Broken chat template, garbled output.
Nemotron-Cascade-8B	MLX	Can't disable thinking. Burns all tokens.

Guardrails That Make It Work

Pawan's state machine includes guardrails that boost success rates across all models:

Guardrail	What It Does
Empty response nudge	If model returns empty content + no tool calls, sends a nudge prompt to retry
Repeat detection	Detects when model repeats the same response 3x and forces a different approach
Chatty no-op detection	If model returns verbose planning text but no tool calls, nudges it to use tools
Think-token stripping	Strips `<think>...</think>` from content AND tool call arguments (StepFun/Qwen compat)
UTF-8 safe truncation	Truncates at char boundaries, not byte boundaries — no panics on multi-byte chars
Resilient LLM retry	Exponential backoff (2s, 4s, 8s) with auto-prune on context overflow
Tool timeout	30s per tool (bash uses config timeout), returns error with hint instead of hanging

Hybrid Local + Cloud Routing

Pawan supports hybrid routing — use a local model first (via MLX, Ollama, or llama.cpp), fall back to NIM cloud:

# pawan.toml
provider = "mlx"
model = "mlx-community/Qwen3.5-9B-4bit"
temperature = 0.6
max_tokens = 4096
max_tool_iterations = 20

[cloud]
provider = "nvidia"
model = "step-ai/step-2-flash"

The local model runs at $0/token. If it's down (OOM, Mac asleep), pawan seamlessly falls back to NIM cloud. Zero manual intervention.

MLX on Mac Mini M4 — Honest Assessment

We switched from llama.cpp GGUF to mlx_lm.server (Apple Silicon native) for local inference. Here's what we actually observed:

MLX vs llama.cpp

	mlx_lm.server	llama.cpp
Hardware	Apple Silicon (Metal GPU)	Cross-platform (CPU + GPU)
Format	MLX native (safetensors)	GGUF
Speed on M4 16GB	~40 tok/s (4-bit Qwen3.5-9B)	~20 tok/s
Memory	Unified memory — efficient	Separate GPU/CPU split
API	OpenAI-compatible, localhost:8080	OpenAI-compatible, localhost:8080

Setup: uv tool install mlx-lm, then mlx_lm.server --model mlx-community/Qwen3.5-9B-4bit. Persisted via launchd plist on Mac, exposed to VPS via SSH tunnel.

What MLX handles well

Targeted edits: "add a test for this function", "fix this compiler error"
Simple implementations: bloom filter, fenwick tree, trie — wrote correct code on first try
Tool calls: individual tool call format is reliable (~85% success rate)

Where MLX falls short

Timeout on complex tasks: At 5–10s/inference call, a 300s budget is exhausted during the exploration phase (reading files, checking dirs) before any code gets written. Leftist heap and suffix array both timed out.
Algorithmic bugs: When it does write complex code under time pressure, correctness suffers. The suffix array binary search had inverted comparison directions; the treap's split-based remove was conceptually wrong. Both compiled but failed tests.
Task-level completion: ~60% for "implement this data structure from scratch". Individual tool calls are fine; it's sustained multi-step reasoning that breaks down.

Mitigation

Front-load prompts with all context the model would otherwise explore: exact file path, existing structure, exact function signature. Skip exploration entirely. This gets task completion up to ~80% for moderate-complexity tasks.

Dogfood Stats

Updated 2026-03-21:

1000+ tool calls across 11 models
11 NIM models benchmarked on B-Tree ringer test
First 100% autonomous score: Mistral Small 4 wrote interval tree (6/6 tests)
Best coding accuracy: Mistral Small 4 (4/5 B-Tree, 6/6 interval tree, self-refactored 9 callsites)
Best tool calling: StepFun Flash (98.9% success rate)
16 data structures in grind workspace: bloom filter, fenwick, skip list, trie, segment tree, DSU, treap, suffix array, leftist heap, radix tree, pairing heap, splay tree, rope, AVL tree, LRU cache, interval tree
119 grind tests + 207 pawan-core tests + 46 TUI tests = 372 total
29 tools in 3 tiers (Core/Standard/Extended) with auto-install via mise
Pawan dogfoods itself: wrote 14 tests for its own git.rs, native.rs, bash.rs, agent.rs
Token budget tracking: thinking vs action token split visible in TUI and CLI

Running Your Own Triage

# Test a model with a simple coding task
pawan run "create a Rust function that checks if a string is a palindrome" \
  --timeout 60 --verbose

# Compare local vs cloud
PAWAN_PROVIDER=mlx pawan run "implement a binary search tree" --output json
PAWAN_PROVIDER=nvidia PAWAN_MODEL=step-ai/step-2-flash pawan run "implement a binary search tree" --output json