Every Output
Evaluated & Scored
The evaluator sub-agent validates every result against the original request — scoring completeness, quality, and relevance before the user sees anything.
The Evaluation Pipeline
After all sub-agents complete their tasks, the evaluator reviews every output in Phase 4 of the orchestration loop.
Collect Sub-Agent Results
All task outputs from Phase 3 (Action) are gathered along with the original user request and the execution plan. This provides full context for evaluation.
Evaluator Analysis
The evaluator sub-agent (Claude Sonnet, temperature 0.1) assesses each task result independently against three criteria: completeness, quality, and relevance.
Score & Classify
Each task receives a 0.0–1.0 score with detailed issue descriptions. An overall pass/fail determination is made based on aggregate quality.
Persist to PostgreSQL
Evaluation scores are written to orchestrator_plans (overall_pass, summary) and plan_tasks (per-task pass, score, issues) for analytics and trend tracking.
Retry or Proceed
If evaluation fails and retries remain, specific failing tasks are re-executed with evaluator feedback. Otherwise, results proceed to Phase 5 (Finalization).
Three Dimensions of Quality
Every sub-agent output is measured against three independent criteria — ensuring comprehensive quality assessment.
Completeness
Did the sub-agent fully address its assigned task? Are all requested elements present? Missing sections, incomplete analysis, or partial outputs are flagged.
Quality
Is the output well-structured and accurate? Are insights meaningful? Does the analysis follow sound methodology? Superficial or incorrect outputs are caught.
Relevance
Does the output directly address the user's original need? Off-topic analysis, tangential information, or misunderstood requirements are identified.
Machine-Readable Evaluation Output
The evaluator returns structured JSON — enabling automated quality gates, trend analysis, and data-driven optimization of your agent pipeline.
Each task receives an independent assessment with a numerical score (0.0–1.0) and specific issue descriptions. The overall pass/fail gates the entire pipeline.
pricing analysis"
but competitive analysis
needs pricing data"
Automatic Retry with Feedback
When evaluation fails, the orchestrator doesn't just retry blindly. It injects the evaluator's specific issue descriptions into the retry prompt — giving the sub-agent targeted guidance on what to fix.
Retries are configurable per deployment (max_retries in agent_config.json) and tracked in the execution log for cost analysis.
Full Evaluation History
Every evaluation is persisted to PostgreSQL — enabling quality trend analysis, plan grading, and continuous improvement of your agent pipeline.
| Table | Field | Description |
|---|---|---|
orchestrator_plans |
overall_pass |
Boolean — did the entire plan pass evaluation? |
summary |
Evaluator's overall assessment text | |
plan_tasks |
pass |
Boolean — did this individual task pass? |
score |
Float 0.0–1.0 quality score for this task | |
issues |
Evaluator's description of specific issues found | |
subagent_executions |
prompt_tokens |
Token count for evaluation LLM input |
completion_tokens |
Token count for evaluation LLM output | |
duration |
Evaluation execution time in seconds |
Purpose-Built for Accuracy
The evaluator is configured for maximum precision — low temperature (0.1) for consistent scoring, read-only workspace access, structured output format, and high priority in the execution queue.
- → Temperature 0.1 — deterministic scoring
- → Read-only workspace — no side effects
- → Clean context window — no bleed from other tasks
- → High priority — evaluated before finalization
- → Structured JSON output schema enforced
Quality You Can
Measure & Trust
Every output evaluated, every score persisted, every failure retried with targeted feedback. Quality assurance built into the architecture.
Explore the Full Platform View Orchestration