NvAgent — Evaluation Process
Quality Assurance Pipeline

Every Output
Evaluated & Scored

The evaluator sub-agent validates every result against the original request — scoring completeness, quality, and relevance before the user sees anything.

How It Works

The Evaluation Pipeline

After all sub-agents complete their tasks, the evaluator reviews every output in Phase 4 of the orchestration loop.

📥

Collect Sub-Agent Results

All task outputs from Phase 3 (Action) are gathered along with the original user request and the execution plan. This provides full context for evaluation.

🔬

Evaluator Analysis

The evaluator sub-agent (Claude Sonnet, temperature 0.1) assesses each task result independently against three criteria: completeness, quality, and relevance.

📊

Score & Classify

Each task receives a 0.0–1.0 score with detailed issue descriptions. An overall pass/fail determination is made based on aggregate quality.

💾

Persist to PostgreSQL

Evaluation scores are written to orchestrator_plans (overall_pass, summary) and plan_tasks (per-task pass, score, issues) for analytics and trend tracking.

🔄

Retry or Proceed

If evaluation fails and retries remain, specific failing tasks are re-executed with evaluator feedback. Otherwise, results proceed to Phase 5 (Finalization).

Evaluation Criteria

Three Dimensions of Quality

Every sub-agent output is measured against three independent criteria — ensuring comprehensive quality assessment.

📋

Completeness

Did the sub-agent fully address its assigned task? Are all requested elements present? Missing sections, incomplete analysis, or partial outputs are flagged.

Quality

Is the output well-structured and accurate? Are insights meaningful? Does the analysis follow sound methodology? Superficial or incorrect outputs are caught.

🎯

Relevance

Does the output directly address the user's original need? Off-topic analysis, tangential information, or misunderstood requirements are identified.

Structured Scoring

Machine-Readable Evaluation Output

The evaluator returns structured JSON — enabling automated quality gates, trend analysis, and data-driven optimization of your agent pipeline.

Each task receives an independent assessment with a numerical score (0.0–1.0) and specific issue descriptions. The overall pass/fail gates the entire pipeline.

{
  "overall_pass": true,
  "evaluations": [
    {
      "task_id": "t1",
      "pass": true,
      "score": 0.92,
      "issues": ""
    },
    {
      "task_id": "t2",
      "pass": false,
      "score": 0.45,
      "issues": "Missing competitor
       pricing analysis"
    }
  ],
  "summary": "Research complete
   but competitive analysis
   needs pricing data"
}
1
Execute tasks — sub-agents produce results
2
Evaluate — score each task 0.0–1.0
t2 fails — "Missing pricing analysis"
↓ retry with feedback
Re-execute t2 — evaluator issues injected
3
Re-evaluate — check improved output
All pass — proceed to finalization
Self-Healing

Automatic Retry with Feedback

When evaluation fails, the orchestrator doesn't just retry blindly. It injects the evaluator's specific issue descriptions into the retry prompt — giving the sub-agent targeted guidance on what to fix.

Retries are configurable per deployment (max_retries in agent_config.json) and tracked in the execution log for cost analysis.

Data Persistence

Full Evaluation History

Every evaluation is persisted to PostgreSQL — enabling quality trend analysis, plan grading, and continuous improvement of your agent pipeline.

Table Field Description
orchestrator_plans overall_pass Boolean — did the entire plan pass evaluation?
summary Evaluator's overall assessment text
plan_tasks pass Boolean — did this individual task pass?
score Float 0.0–1.0 quality score for this task
issues Evaluator's description of specific issues found
subagent_executions prompt_tokens Token count for evaluation LLM input
completion_tokens Token count for evaluation LLM output
duration Evaluation execution time in seconds
Evaluator Configuration

Purpose-Built for Accuracy

The evaluator is configured for maximum precision — low temperature (0.1) for consistent scoring, read-only workspace access, structured output format, and high priority in the execution queue.

  • Temperature 0.1 — deterministic scoring
  • Read-only workspace — no side effects
  • Clean context window — no bleed from other tasks
  • High priority — evaluated before finalization
  • Structured JSON output schema enforced
# subagents/evaluator.yaml
name: "evaluator"
model: "sonnet"
temperature: 0.1
max_tokens: 4096
max_iterations: 3
timeout_seconds: 60
workspace_mode: "readonly"
context_window: "clean"
priority: "high"
output_format: "structured"

Quality You Can
Measure & Trust

Every output evaluated, every score persisted, every failure retried with targeted feedback. Quality assurance built into the architecture.

Explore the Full Platform View Orchestration

© 2026 NotoVision. All rights reserved. — NvAssistant Platform

error: Content is protected !!