NvAgent — Evaluation Process

Quality Assurance Pipeline

Every Output
Evaluated & Scored

The evaluator sub-agent validates every result against the original request — scoring completeness, quality, and relevance before the user sees anything.

How It Works

The Evaluation Pipeline

After all sub-agents complete their tasks, the evaluator reviews every output in Phase 4 of the orchestration loop.

📥

Collect Sub-Agent Results

All task outputs from Phase 3 (Action) are gathered along with the original user request and the execution plan. This provides full context for evaluation.

🔬

Evaluator Analysis

The evaluator sub-agent (Claude Sonnet, temperature 0.1) assesses each task result independently against three criteria: completeness, quality, and relevance.

📊

Score & Classify

Each task receives a 0.0–1.0 score with detailed issue descriptions. An overall pass/fail determination is made based on aggregate quality.

💾

Persist to PostgreSQL

Evaluation scores are written to orchestrator_plans (overall_pass, summary) and plan_tasks (per-task pass, score, issues) for analytics and trend tracking.

🔄

Retry or Proceed

If evaluation fails and retries remain, specific failing tasks are re-executed with evaluator feedback. Otherwise, results proceed to Phase 5 (Finalization).

Evaluation Criteria

Three Dimensions of Quality

Every sub-agent output is measured against three independent criteria — ensuring comprehensive quality assessment.

📋

Completeness

Did the sub-agent fully address its assigned task? Are all requested elements present? Missing sections, incomplete analysis, or partial outputs are flagged.

⭐

Quality

Is the output well-structured and accurate? Are insights meaningful? Does the analysis follow sound methodology? Superficial or incorrect outputs are caught.

🎯

Relevance

Does the output directly address the user's original need? Off-topic analysis, tangential information, or misunderstood requirements are identified.

Structured Scoring

Machine-Readable Evaluation Output

The evaluator returns structured JSON — enabling automated quality gates, trend analysis, and data-driven optimization of your agent pipeline.

Each task receives an independent assessment with a numerical score (0.0–1.0) and specific issue descriptions. The overall pass/fail gates the entire pipeline.

{

"overall_pass": true,

"evaluations": [

{

"task_id": "t1",

"pass": true,

"score": 0.92,

"issues": ""

{

"task_id": "t2",

"pass": false,

"score": 0.45,

"issues": "Missing competitor
pricing analysis"

}

  "summary": "Research complete
   but competitive analysis
   needs pricing data"

}

Execute tasks — sub-agents produce results

↓

Evaluate — score each task 0.0–1.0

↓

✗

t2 fails — "Missing pricing analysis"

↓ retry with feedback

↻

Re-execute t2 — evaluator issues injected

↓

Re-evaluate — check improved output

↓

✓

All pass — proceed to finalization

Self-Healing

Automatic Retry with Feedback

When evaluation fails, the orchestrator doesn't just retry blindly. It injects the evaluator's specific issue descriptions into the retry prompt — giving the sub-agent targeted guidance on what to fix.

Retries are configurable per deployment (max_retries in agent_config.json) and tracked in the execution log for cost analysis.

Data Persistence

Full Evaluation History

Every evaluation is persisted to PostgreSQL — enabling quality trend analysis, plan grading, and continuous improvement of your agent pipeline.

Table	Field	Description
`orchestrator_plans`	`overall_pass`	Boolean — did the entire plan pass evaluation?
`orchestrator_plans`	`summary`	Evaluator's overall assessment text
`plan_tasks`	`pass`	Boolean — did this individual task pass?
	`score`	Float 0.0–1.0 quality score for this task
	`issues`	Evaluator's description of specific issues found
`subagent_executions`	`prompt_tokens`	Token count for evaluation LLM input
	`completion_tokens`	Token count for evaluation LLM output
	`duration`	Evaluation execution time in seconds

Evaluator Configuration

Purpose-Built for Accuracy

The evaluator is configured for maximum precision — low temperature (0.1) for consistent scoring, read-only workspace access, structured output format, and high priority in the execution queue.

→ Temperature 0.1 — deterministic scoring
→ Read-only workspace — no side effects
→ Clean context window — no bleed from other tasks
→ High priority — evaluated before finalization
→ Structured JSON output schema enforced

# subagents/evaluator.yaml

name: "evaluator"

model: "sonnet"

temperature: 0.1

max_tokens: 4096

max_iterations: 3

timeout_seconds: 60

workspace_mode: "readonly"

context_window: "clean"

priority: "high"

output_format: "structured"

Quality You Can
Measure & Trust

Every output evaluated, every score persisted, every failure retried with targeted feedback. Quality assurance built into the architecture.

Explore the Full Platform View Orchestration

Every OutputEvaluated & Scored