2026.4
Building a Production-Grade Evaluation Stack for Agentic Systems
Building a Production-Grade Evaluation Stack for Agentic Systems
Evaluation for agents should not be a vibe check.
It should be a typed, replayable, auditable execution system.
Most LLM evaluation frameworks were originally designed for single-turn answers: ask a question, get a response, compare the response against an expected answer, compute a score.
That is not enough for agentic systems.
A real agent does not simply "answer". It observes context, retrieves evidence, calls tools, queries databases, edits files, writes memory, validates outputs, refines drafts, and sometimes mutates the external world. It may submit a form, create a ticket, send an email, generate a report, update a document, or place an order.
Therefore, an agent evaluation system must answer two different classes of questions:
1. Outcome correctness:
Did the agent actually complete the task?
Did the external world change correctly?
Is the final output semantically acceptable?
2. Execution economics:
How many tokens did the agent burn?
Which runtime states created the cost?
How much did tools, APIs, retrieval, validation, and refinement cost?
Where is the optimization surface?
This report proposes a two-layer evaluation architecture:
Layer 1: Outcome Evaluation
Determines whether the task was truly completed.
Layer 2: Trace Evaluation
Builds a token-level, state-level, cost-level execution ledger.
The key idea is simple:
Outcome Evaluation tells us whether the agent works.
Trace Evaluation tells us how expensive it was to make it work.
1. System Overview
We model an agent run as a combination of final-state verification and runtime trace accounting.
flowchart TD
U[User Task] --> A[Target Agent Runtime]
A --> O[Final Answer]
A --> S[External State / Artifacts]
A --> T[Execution Trace]
O --> OE[Outcome Evaluation Layer]
S --> SE[State Change Evaluation]
T --> TE[Trace Evaluation Layer]
OE --> R1[Hard Success / Soft Score]
SE --> R2[State Match / Side Effect Detection]
TE --> R3[Token Ledger / Cost Ledger / Latency Ledger]
R1 --> E[Evaluation Warehouse]
R2 --> E
R3 --> E
E --> D[Dashboards]
E --> M[Failure Attribution]
E --> C[Cost Optimization]
E --> X[Future Self-Optimizing Agents]
The evaluation stack has three major outputs:
| Output | Purpose |
|---|---|
eval_task_result | Determines whether the task passes the outcome gate. |
eval_state_result | Verifies whether files, artifacts, or external systems reached the expected state. |
trace_run / trace_step | Records the full execution ledger: tokens, cost, latency, tools, context flow. |
This design is intentionally strict:
- The agent cannot self-certify success.
- The final answer is not the only thing evaluated.
- Tool results are not ignored.
- External mutations must be verified.
- Token cost must be attributed to specific runtime states.
- Failure reasons must be structured, not described as vague "bad output".
Part I — Outcome Evaluation Layer
2. What the Outcome Layer Measures
The Outcome Evaluation Layer answers three questions:
1. Was the task completed?
2. Did the external environment or target artifact change correctly?
3. Does the final output meet the required semantic quality?
This layer does not care how many steps the agent took, how many tools it called, or how many tokens it consumed. It only cares about the final result.
In other words:
Outcome Evaluation = final-state correctness
Trace Evaluation = runtime economics
3. Eval Case as a Contract
Every evaluation case should be treated as a contract.
The contract defines:
- what the user asked for;
- what output modules are required;
- what fields must be present;
- what errors must never appear;
- whether external execution is required;
- what proof is required for external execution.
A typical evaluation case can be expressed as YAML:
task_id: search_agent_001
task_name: Search for Phase III clinical evidence for a drug-indication pair
input:
user_instruction: >
Search and summarize whether drug X has Phase III clinical evidence
for indication Y.
success_criteria:
required_outputs:
- final_answer
- evidence_list
- citations
must_include:
- drug_name
- indication
- trial_phase
- trial_status
- primary_endpoint
- conclusion
must_not_include:
- fabricated_reference
- unsupported_claim
# Only required for executable tasks:
# ticket creation, booking, ordering, scheduling, form submission,
# email sending, workflow submission, payment, deletion, etc.
execution_result:
required: true
must_include:
- execution_status
- confirmation_id
- execution_target
- execution_parameters
must_not_include:
- action_not_executed
- execution_result_not_found
- wrong_execution_target
- wrong_execution_parameters
- duplicate_execution
- unauthorized_payment
- unconfirmed_high_risk_action
This structure is critical.
Do not define success like this:
The answer should be good.
Define it like this:
The output must contain specific fields.
The evidence must be checkable.
The citations must exist.
The external action must have a confirmation ID.
The execution target and parameters must match the user intent.
High-risk actions must be explicitly confirmed.
The target agent does not have to output JSON. The YAML or JSON schema is used to define the evaluation contract, not to force a rigid output format on the agent.
4. Hard Success vs. Soft Score
A single binary score is too coarse for agent evaluation.
We split task success into two levels:
Hard Success:
Does the task pass the minimum correctness gate?
Soft Score:
If the hard gate passes, how good is the result?
4.1 Hard Success
Hard Success is the minimum pass/fail gate.
A task fails immediately if any critical condition is violated.
| Hard Gate | Failure Condition |
|---|---|
| Final answer exists | No final answer was generated. |
| Required outputs exist | Required modules such as evidence_list or citations are missing. |
| Required fields exist | Core fields such as drug, indication, trial status, endpoint, or conclusion are missing. |
| Evidence exists | No evidence is provided for evidence-critical domains. |
| Citations are real | The agent cites nonexistent papers, URLs, trial IDs, or database records. |
| Claims are supported | The answer contains claims not supported by evidence. |
| No unauthorized action | The agent performs an action outside its permission boundary. |
| External execution is real | The claimed booking, order, ticket, submission, or email does not exist in the external system. |
| Execution target is correct | The agent acts on the wrong person, account, product, patient, file, or recipient. |
| Execution parameters are correct | Time, amount, quantity, address, fields, or target parameters are wrong. |
| High-risk action is confirmed | Payment, deletion, submission, or sending occurs without required confirmation. |
| No duplicate execution | The agent places duplicate orders, creates duplicate tickets, or sends duplicate emails. |
The output is binary:
Hard Success = true / false
If the hard gate fails, the task is failed regardless of how polished the final text looks.
4.2 Soft Score
Soft Score measures quality after the hard gate passes.
A default scoring formula can be:
Task Success Score =
0.30 × Completeness
+ 0.20 × Evidence Validity
+ 0.20 × Evidence Consistency
+ 0.20 × Methodology Compliance
+ 0.10 × Readability
Each sub-score is normalized to 0–10.
At the benchmark level:
Task Success Rate =
Number of tasks with Hard Success = true / Total number of tasks
Average Outcome Score =
Average Soft Score over tasks that passed the Hard Gate
This separation makes the evaluation interpretable:
Hard Success tells us whether the agent can complete the task.
Soft Score tells us how well the agent completed it.
5. Validator Mesh
After the target agent finishes execution, the final result is passed into a set of validators.
Each validator has a narrow responsibility.
flowchart LR
R[Agent Result] --> F[Required Field Validator]
R --> E[Evidence Validator]
R --> C[Claim-Fact Validator]
R --> P[Rule / SOP Validator]
R --> L[LLM Judge]
R --> X[Execution Validator]
F --> G[Hard Gate + Soft Score]
E --> G
C --> G
P --> G
L --> G
X --> G
| Validator | Responsibility | Metric |
|---|---|---|
| Required Field Validator | Compares output against the Eval Case contract and checks whether required fields are present. | Completeness |
| Evidence Validator | Verifies that citations exist and can be mapped to accessible source material. | Evidence Validity |
| Claim-Fact Validator | Checks whether the key claims are supported by the cited evidence. | Evidence Consistency |
| Rule / SOP Validator | Checks whether the agent followed system prompts, Skill instructions, workflow rules, and domain SOPs. | Methodology Compliance |
| LLM Judge | Evaluates open-ended semantic quality, domain style, logical clarity, and whether the answer matches user intent. | Readability |
| Execution Validator | Verifies whether external actions actually happened and whether execution records match the user instruction. | Execution Success |
The Evidence Validator should have access to the same evidence sources as the target agent. Otherwise, it cannot reliably verify whether a citation is real.
The Rule / SOP Validator should have access to the target agent's system prompt, Skill package, tool specification, and workflow policy, because it needs to check whether the agent followed its own operating contract.
6. Structured Failure Taxonomy
Failure reasons must be structured.
A production evaluation system should not record failure as:
bad answer
It should record:
primary_failure_reason_code = FABRICATED_REFERENCE
failure_reason_codes = [
"FABRICATED_REFERENCE",
"CLAIM_EVIDENCE_MISMATCH",
"LOW_EVIDENCE_CONSISTENCY_SCORE"
]
A recommended failure code taxonomy:
MISSING_FINAL_ANSWER
MISSING_REQUIRED_OUTPUT
MISSING_REQUIRED_FIELD
OUTPUT_FORMAT_INVALID
EMPTY_OR_INVALID_OUTPUT
MISSING_EVIDENCE
MISSING_CITATION
CITATION_NOT_FOUND
FABRICATED_REFERENCE
EVIDENCE_SOURCE_INACCESSIBLE
UNSUPPORTED_CLAIM
CLAIM_EVIDENCE_MISMATCH
CONTRADICTED_BY_EVIDENCE
WRONG_FACT
INCOMPLETE_ANSWER
LOW_COMPLETENESS_SCORE
LOW_EVIDENCE_VALIDITY_SCORE
LOW_EVIDENCE_CONSISTENCY_SCORE
LOW_METHODOLOGY_SCORE
LOW_READABILITY_SCORE
SOP_NOT_FOLLOWED
SYSTEM_PROMPT_VIOLATION
SKILL_INSTRUCTION_VIOLATION
UNAUTHORIZED_ACTION
STATE_MISMATCH
STATE_CHANGE_FAILED
PARTIAL_STATE_CHANGE
TOOL_FAILURE
TOOL_TIMEOUT
EXECUTION_TIMEOUT
EVALUATOR_FAILURE
LOW_CONFIDENCE_EVALUATION
UNKNOWN_FAILURE
ACTION_NOT_EXECUTED
EXECUTION_RESULT_NOT_FOUND
WRONG_EXECUTION_TARGET
WRONG_EXECUTION_PARAMETERS
DUPLICATE_EXECUTION
UNCONFIRMED_HIGH_RISK_ACTION
UNAUTHORIZED_PAYMENT
6.1 Failure Code Semantics
| Code | Meaning |
|---|---|
MISSING_FINAL_ANSWER | The agent did not produce a final conclusion. |
MISSING_REQUIRED_OUTPUT | Required output modules are missing, such as evidence_list or citations. |
MISSING_REQUIRED_FIELD | A required field is missing, such as drug name, indication, trial status, or primary endpoint. |
OUTPUT_FORMAT_INVALID | The output format violates the required structure. |
EMPTY_OR_INVALID_OUTPUT | The output is empty, malformed, or clearly unrelated to the task. |
MISSING_EVIDENCE | No evidence is provided in an evidence-critical task. |
MISSING_CITATION | Evidence is discussed but not cited. |
CITATION_NOT_FOUND | The cited item cannot be found in the configured source set. |
FABRICATED_REFERENCE | The agent cites a nonexistent paper, URL, trial ID, database record, or document. |
EVIDENCE_SOURCE_INACCESSIBLE | The evaluator cannot access the evidence source required for verification. |
UNSUPPORTED_CLAIM | A claim is made without evidence support. |
CLAIM_EVIDENCE_MISMATCH | The cited evidence does not support the claim. |
CONTRADICTED_BY_EVIDENCE | The output contradicts the cited evidence. |
WRONG_FACT | A key factual element is wrong. |
INCOMPLETE_ANSWER | The answer does not cover the required parts of the task. |
LOW_COMPLETENESS_SCORE | Completeness falls below the configured threshold. |
LOW_EVIDENCE_VALIDITY_SCORE | Citation quality or evidence verifiability is too low. |
LOW_EVIDENCE_CONSISTENCY_SCORE | Claim-evidence alignment is too weak. |
LOW_METHODOLOGY_SCORE | The agent did not follow the expected method, Skill, or SOP. |
LOW_READABILITY_SCORE | The output is hard to read, poorly structured, or domain-inappropriate. |
SOP_NOT_FOLLOWED | The agent violated a required operating procedure. |
SYSTEM_PROMPT_VIOLATION | The agent violated core system-level constraints. |
SKILL_INSTRUCTION_VIOLATION | The agent did not follow the Skill package instructions. |
UNAUTHORIZED_ACTION | The agent used forbidden tools, accessed unauthorized data, or performed a forbidden operation. |
STATE_MISMATCH | The agent claimed a state change that did not occur. |
STATE_CHANGE_FAILED | A required create/update/delete/submit operation failed. |
PARTIAL_STATE_CHANGE | Only part of the required state mutation was completed. |
TOOL_FAILURE | A retrieval tool, API, MCP tool, database, or script failed. |
TOOL_TIMEOUT | A tool call timed out. |
EXECUTION_TIMEOUT | The overall agent task timed out. |
EVALUATOR_FAILURE | A validator or LLM judge failed. |
LOW_CONFIDENCE_EVALUATION | The evaluator produced an unstable or low-confidence judgment. |
UNKNOWN_FAILURE | The failure cannot be classified. |
ACTION_NOT_EXECUTED | The agent claimed completion, but no external execution record exists. |
EXECUTION_RESULT_NOT_FOUND | The order, appointment, ticket, submission, or email record cannot be found. |
WRONG_EXECUTION_TARGET | The agent acted on the wrong person, item, account, patient, file, or recipient. |
WRONG_EXECUTION_PARAMETERS | Time, amount, quantity, address, field values, or other execution parameters are wrong. |
DUPLICATE_EXECUTION | The agent executed the same action multiple times. |
UNCONFIRMED_HIGH_RISK_ACTION | A high-risk action was executed without required user confirmation. |
UNAUTHORIZED_PAYMENT | A payment or charge was made without explicit authorization. |
With this taxonomy, failure analysis becomes queryable:
Did the task work? hard_success
How good was the task? outcome_score
Why did it fail? primary_failure_reason_code
Where did it fail? eval_validator_result
7. Outcome Result Table
A minimal task-level outcome table:
CREATE TABLE eval_task_result (
eval_id TEXT PRIMARY KEY,
task_id TEXT NOT NULL,
trace_id TEXT,
agent_id TEXT,
skill_id TEXT,
hard_success BOOLEAN NOT NULL,
outcome_score NUMERIC,
completeness_score NUMERIC,
evidence_validity_score NUMERIC,
evidence_consistency_score NUMERIC,
methodology_score NUMERIC,
readability_score NUMERIC,
execution_success_score NUMERIC,
primary_failure_reason_code TEXT,
failure_reason_codes JSONB,
evaluator_version TEXT,
eval_contract_version TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
A validator-level result table:
CREATE TABLE eval_validator_result (
eval_id TEXT NOT NULL,
validator_name TEXT NOT NULL,
validator_type TEXT NOT NULL,
passed BOOLEAN NOT NULL,
score NUMERIC,
confidence NUMERIC,
failure_reason_codes JSONB,
evidence_refs JSONB,
diagnostic_message TEXT,
latency_ms INTEGER,
evaluator_model TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Part II — External State Change Evaluation
8. Why State Change Evaluation Matters
An agent is not just a text generator.
It often creates or mutates external state:
- generate a file;
- modify a document;
- update a database;
- create a ticket;
- submit a form;
- send an email;
- book an appointment;
- place an order;
- trigger an MCP workflow;
- write memory.
Therefore, outcome evaluation must include state verification.
A final message like:
Done.
does not prove that anything was done.
The evaluator must inspect the environment.
9. State Diff Model
We represent external state evaluation as a diff between pre-run and post-run snapshots.
flowchart LR
B[Before Snapshot] --> D[State Diff Engine]
A[After Snapshot] --> D
I[User Intent / Eval Contract] --> D
D --> H[Hard State Gate]
D --> Q[State Quality Score]
D --> S[Side Effect Detector]
State snapshots can include:
| State Type | Example |
|---|---|
| File system | Created, modified, deleted files |
| Document artifacts | DOCX, PDF, PPTX, Markdown, CSV, JSON |
| Database state | Inserted, updated, deleted rows |
| Business systems | Orders, appointments, tickets, submissions |
| Communication systems | Sent emails, messages, notifications |
| Workflow systems | MCP actions, pipeline events, approvals |
| Memory stores | Long-term memory or task memory writes |
10. Hard State Gate
Hard State Gate checks whether the expected state exists and whether unexpected side effects occurred.
Minimum checks:
1. The artifact or external record exists.
2. The artifact or external record is readable.
3. The artifact or external record is non-empty.
4. The target state matches the user intent.
5. No unauthorized or unexpected side effects occurred.
Examples of side effects:
The agent generated the requested report but deleted the input file.
The agent created the expected database record but inserted duplicate rows.
The agent sent the correct email but also sent it to an unintended recipient.
The agent updated the right document but overwrote unrelated sections.
If the user did not request the mutation, it is a side effect.
If the mutation was required by the task, it must be verified against the contract.
11. State Quality Score
After passing the hard state gate, the generated or modified state can be scored.
State Change Score =
0.40 × Target State Match
+ 0.30 × Content Usability
+ 0.20 × Format Quality
+ 0.10 × Boundary Control
| Metric | Meaning | Evaluation Focus |
|---|---|---|
Target State Match | Whether the final artifact or external record reached the expected target state. | Correctness of state mutation |
Content Usability | Whether the content is useful, readable, logically organized, and aligned with user intent. | Practical usability |
Format Quality | Whether formatting, tables, structure, Markdown, JSON, DOCX, PPTX, or report layout meet expectations. | Presentation quality |
Boundary Control | Whether the agent avoided irrelevant expansion, unsupported additions, unrelated edits, or over-promising. | Scope control |
12. State Result Table
CREATE TABLE eval_state_result (
eval_id TEXT NOT NULL,
trace_id TEXT NOT NULL,
state_object_id TEXT NOT NULL,
state_object_type TEXT,
state_object_path TEXT,
state_action TEXT, -- create / modify / delete / submit / send / write
exists_after_run BOOLEAN,
readable_after_run BOOLEAN,
non_empty_after_run BOOLEAN,
expected_state_match BOOLEAN,
side_effect_detected BOOLEAN,
target_state_match_score NUMERIC,
content_usability_score NUMERIC,
format_quality_score NUMERIC,
boundary_control_score NUMERIC,
state_change_score NUMERIC,
failure_reason_codes JSONB,
diff_summary JSONB,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Part III — Trace Evaluation Layer
13. Goal of Trace Evaluation
The Trace Evaluation Layer does not decide whether the task passes or fails.
It builds an execution ledger.
It answers:
1. How many tokens did the agent consume in total?
2. Which runtime states produced those tokens?
3. How many were uncached input tokens, cached input tokens, output tokens, and reasoning tokens?
4. How much did each token type cost?
5. How much did tools, retrieval, APIs, databases, scripts, file parsing, and file writing cost?
6. Which states dominated the total cost?
7. Which context sources caused input-token amplification?
8. Which tool results entered the next LLM context?
9. Which steps are the best targets for optimization?
The output is not a score.
It is a bill of execution.
14. Agent Runtime as a State Machine
A real agent loop is not linear.
It is not:
retrieve -> read -> answer
It is closer to:
OBSERVE
-> THINK / DECIDE
-> ACTION
-> OBSERVE
-> THINK / DECIDE
-> ACTION
-> ...
-> FINALIZE
We normalize the runtime into a sequence of typed states.
stateDiagram-v2
[*] --> OBSERVE
OBSERVE --> THINK
THINK --> RETRIEVE
THINK --> MCP_CALL
THINK --> API_CALL
THINK --> DB_QUERY
THINK --> SCRIPT_EXEC
THINK --> FILE_READ
THINK --> FILE_WRITE
THINK --> MEMORY_READ
THINK --> VALIDATE
THINK --> FINALIZE
RETRIEVE --> OBSERVE
MCP_CALL --> OBSERVE
API_CALL --> OBSERVE
DB_QUERY --> OBSERVE
SCRIPT_EXEC --> OBSERVE
FILE_READ --> OBSERVE
FILE_WRITE --> OBSERVE
MEMORY_READ --> OBSERVE
VALIDATE --> REFINE
REFINE --> OBSERVE
FINALIZE --> [*]
Each state becomes a trace_step.
A full run becomes a trace_run.
Example:
trace_run
├── step_1: OBSERVE
├── step_2: THINK
├── step_3: RETRIEVE
├── step_4: OBSERVE
├── step_5: THINK
├── step_6: DB_QUERY
├── step_7: OBSERVE
├── step_8: THINK
├── step_9: VALIDATE
├── step_10: REFINE
├── step_11: FINALIZE
15. Runtime State Types
Recommended state taxonomy:
OBSERVE Read context, environment state, tool results, or intermediate artifacts.
THINK Plan, reason, decide next action, or generate tool call arguments.
RETRIEVE Search, retrieval, knowledge-base query, or web search.
MCP_CALL MCP server tool call.
API_CALL External API call.
DB_QUERY SQL, graph query, vector query, or hybrid database access.
SCRIPT_EXEC Code execution, data processing, sandbox execution.
FILE_READ File parsing or reading.
FILE_WRITE File creation, update, append, or export.
MEMORY_READ Read long-term memory, task memory, or historical preference.
MEMORY_WRITE Write memory or compressed experience.
VALIDATE Check, review, fact-verify, format-verify, or compliance-review.
REFINE Revise, rewrite, repair, or optimize based on validation.
FINALIZE Produce final answer or submit final artifact.
16. Common Trace Step Schema
Every state should record a shared envelope:
{
"trace_id": "task_20260428_001",
"step_id": 3,
"parent_step_id": 2,
"state_type": "RETRIEVE",
"agent_id": "search_agent",
"skill_id": "clinical_evidence_search",
"model_name": "model_x",
"start_time": "2026-04-28T10:00:00Z",
"end_time": "2026-04-28T10:00:04Z",
"latency_ms": 4200,
"status": "success"
}
| Field | Meaning |
|---|---|
trace_id | Unique ID for one agent run. |
step_id | Step ID inside the run. |
parent_step_id | Parent step or previous step. |
state_type | Runtime state type. |
agent_id | Agent that executed this step. |
skill_id | Skill or workflow package used by the agent. |
model_name | Model used in this step, if applicable. |
start_time | Step start timestamp. |
end_time | Step end timestamp. |
latency_ms | Step latency. |
status | success, error, timeout, cancelled, etc. |
17. Token Accounting Schema
If a state invokes a model, record the token ledger:
{
"input_tokens_total": 12800,
"input_tokens_uncached": 4200,
"input_tokens_cached": 8600,
"output_tokens": 320,
"reasoning_tokens": 0,
"total_tokens": 13120
}
| Field | Meaning |
|---|---|
input_tokens_total | Total input tokens for this model call. |
input_tokens_uncached | Input tokens not served from cache. |
input_tokens_cached | Input tokens served from cache. |
output_tokens | Visible model output tokens. |
reasoning_tokens | Reasoning tokens if exposed and billable. |
total_tokens | Total token usage for this step. |
A key accounting rule:
Tool responses are not model output tokens.
But once tool responses enter the next model context, they become input tokens.
Therefore, trace evaluation must track not only model tokens, but also the flow of tool outputs into later model contexts.
18. Context Provenance Breakdown
To understand why input tokens explode, split input tokens by source:
{
"input_token_breakdown": {
"system_prompt_tokens": 1200,
"skill_instruction_tokens": 2600,
"user_instruction_tokens": 300,
"history_tokens": 1800,
"memory_tokens": 600,
"tool_result_tokens": 4200,
"retrieved_context_tokens": 1800,
"artifact_context_tokens": 300,
"other_context_tokens": 0
}
}
| Field | Meaning |
|---|---|
system_prompt_tokens | Tokens from system prompt. |
skill_instruction_tokens | Tokens from Skill files, SOPs, and tool specs. |
user_instruction_tokens | Tokens from the user's original task. |
history_tokens | Tokens from previous conversation or previous trace context. |
memory_tokens | Tokens loaded from memory. |
tool_result_tokens | Tokens from previous tool outputs entering the context. |
retrieved_context_tokens | Tokens from retrieved documents or search results. |
artifact_context_tokens | Tokens from files, tables, intermediate artifacts, or generated outputs. |
other_context_tokens | Other context sources. |
This makes cost attribution concrete:
The model is not expensive.
The context is expensive.
The user query is not large.
The retrieved evidence is large.
The final answer is not costly.
The repeated validation-refinement loop is costly.
The task is not inherently expensive.
The Skill instruction and history context are too heavy.
19. State-Specific Schemas
19.1 OBSERVE
OBSERVE reads current context, environment state, tool results, or intermediate artifacts.
{
"state_type": "OBSERVE",
"observed_sources": [
"user_input",
"previous_tool_result",
"memory",
"file_state"
],
"observed_tokens": 5200,
"tokens_sent_to_next_think": 4800
}
| Field | Meaning |
|---|---|
observed_sources | Sources observed in this step. |
observed_tokens | Raw tokens observed. |
tokens_sent_to_next_think | Tokens preserved for the next thinking step. |
19.2 THINK
THINK plans, decides, or generates tool-call arguments.
{
"state_type": "THINK",
"decision_type": "call_tool",
"next_action": "RETRIEVE",
"input_tokens_total": 6800,
"input_tokens_uncached": 2200,
"input_tokens_cached": 4600,
"output_tokens": 420,
"tool_call_instruction_tokens": 90
}
| Field | Meaning |
|---|---|
decision_type | answer, call_tool, validate, refine, etc. |
next_action | Next planned runtime state. |
tool_call_instruction_tokens | Tokens used to generate tool-call instructions. |
output_tokens | Visible planning, decision, or tool argument tokens. |
If hidden reasoning is not exposed by the model or platform, record only visible planning, decision text, and tool-call arguments.
19.3 RETRIEVE
RETRIEVE covers search, knowledge-base retrieval, vector search, hybrid retrieval, or web search.
{
"state_type": "RETRIEVE",
"retriever_name": "knowledge_base_search",
"query_count": 3,
"query_tokens": 180,
"top_k": 20,
"returned_chunks": 60,
"raw_result_tokens": 52000,
"deduplicated_tokens": 36000,
"selected_tokens": 8000,
"tokens_sent_to_next_llm": 5000,
"tool_latency_ms": 2800,
"tool_cost": 0.02
}
| Field | Meaning |
|---|---|
retriever_name | Name of the retriever. |
query_count | Number of generated queries. |
query_tokens | Tokens in retrieval queries. |
top_k | Number of results returned per query. |
returned_chunks | Raw returned chunks. |
raw_result_tokens | Raw retrieved token volume. |
deduplicated_tokens | Token count after deduplication. |
selected_tokens | Token count after reranking or filtering. |
tokens_sent_to_next_llm | Tokens that enter the next model context. |
tool_latency_ms | Retrieval latency. |
tool_cost | Retrieval cost. |
This is one of the most important states for search-heavy agents, because retrieval result inflation often dominates total token cost.
19.4 MCP_CALL
MCP_CALL records calls to external MCP servers.
{
"state_type": "MCP_CALL",
"mcp_server": "clinical_trial_mcp",
"tool_name": "search_trial_registry",
"request_tokens": 260,
"response_tokens_raw": 12000,
"response_tokens_selected": 3000,
"tokens_sent_to_next_llm": 2400,
"tool_latency_ms": 3500,
"tool_cost": 0.05
}
19.5 API_CALL
API_CALL records external API invocations.
{
"state_type": "API_CALL",
"api_name": "drug_database_api",
"endpoint": "/v1/drug/trials",
"request_tokens": 180,
"response_tokens_raw": 9000,
"response_tokens_selected": 2500,
"tokens_sent_to_next_llm": 1800,
"api_latency_ms": 1600,
"api_cost": 0.03,
"http_status": 200
}
19.6 DB_QUERY
DB_QUERY records SQL, graph, vector, or hybrid queries.
{
"state_type": "DB_QUERY",
"database_name": "clinical_kb",
"query_type": "sql",
"query_tokens": 220,
"rows_returned": 128,
"raw_result_tokens": 16000,
"selected_rows": 12,
"selected_tokens": 2600,
"tokens_sent_to_next_llm": 2000,
"db_latency_ms": 900,
"db_cost": 0.01
}
19.7 SCRIPT_EXEC
SCRIPT_EXEC records code execution, sandbox runtime, data transformation, and computational workloads.
{
"state_type": "SCRIPT_EXEC",
"runtime": "python",
"script_input_tokens": 900,
"code_tokens": 1200,
"stdout_tokens": 1800,
"stderr_tokens": 0,
"generated_file_count": 2,
"tokens_sent_to_next_llm": 1200,
"execution_time_ms": 5200,
"cpu_seconds": 4.8,
"gpu_seconds": 0,
"compute_cost": 0.04
}
19.8 FILE_READ
FILE_READ records file parsing and content extraction.
{
"state_type": "FILE_READ",
"file_path": "/input/protocol.pdf",
"file_type": "pdf",
"file_size_bytes": 1839200,
"raw_extracted_tokens": 48000,
"selected_tokens": 6000,
"tokens_sent_to_next_llm": 5000,
"parse_latency_ms": 2300,
"parse_cost": 0.02
}
19.9 FILE_WRITE
FILE_WRITE records file generation or modification.
{
"state_type": "FILE_WRITE",
"file_path": "/output/report.docx",
"file_type": "docx",
"write_mode": "create",
"content_tokens_written": 8600,
"formatting_tokens": 1200,
"generated_file_size_bytes": 236000,
"write_latency_ms": 1800,
"write_cost": 0.01
}
19.10 MEMORY_READ
MEMORY_READ records retrieval from long-term memory, task memory, or preference memory.
{
"state_type": "MEMORY_READ",
"memory_type": "long_term_memory",
"query_tokens": 80,
"raw_memory_tokens": 12000,
"selected_memory_tokens": 1800,
"tokens_sent_to_next_llm": 1500,
"memory_latency_ms": 500
}
19.11 MEMORY_WRITE
MEMORY_WRITE records memory persistence or experience compression.
{
"state_type": "MEMORY_WRITE",
"memory_type": "task_memory",
"raw_content_tokens": 4200,
"summary_tokens": 600,
"tokens_written": 600,
"write_latency_ms": 400
}
19.12 VALIDATE
VALIDATE records review, checking, fact verification, formatting verification, or compliance validation.
{
"state_type": "VALIDATE",
"validator_name": "evidence_validator",
"validation_input_tokens": 9000,
"input_tokens_uncached": 3000,
"input_tokens_cached": 6000,
"validation_output_tokens": 1200,
"issues_found": 3,
"issues_fixed_later": 2,
"validation_latency_ms": 3600,
"validation_cost": 0.08
}
19.13 REFINE
REFINE records revision, rewriting, repair, or optimization based on validation feedback.
{
"state_type": "REFINE",
"refine_reason": "fix_missing_citation",
"input_tokens_total": 7600,
"input_tokens_uncached": 2600,
"input_tokens_cached": 5000,
"output_tokens": 1800,
"modified_content_tokens": 1200,
"refine_latency_ms": 4200,
"refine_cost": 0.10
}
19.14 FINALIZE
FINALIZE records final answer generation or final artifact submission.
{
"state_type": "FINALIZE",
"final_output_type": "answer_with_citations",
"input_tokens_total": 11000,
"input_tokens_uncached": 4000,
"input_tokens_cached": 7000,
"output_tokens": 3600,
"final_answer_tokens": 3200,
"citation_tokens": 400,
"finalize_latency_ms": 5000,
"finalize_cost": 0.18
}
Part IV — Cost Model
20. LLM Cost Formula
Each model call should be billed by token type:
1. Uncached input tokens
2. Cached input tokens
3. Output tokens
4. Reasoning tokens, if separately exposed and billed
The cost of a model call:
LLM Cost =
(input_tokens_uncached / 1,000,000) × P_input
+ (input_tokens_cached / 1,000,000) × P_cached_input
+ (output_tokens / 1,000,000) × P_output
+ (reasoning_tokens / 1,000,000) × P_reasoning
If reasoning tokens are not exposed or not separately billed, set:
reasoning_tokens = 0
P_reasoning = 0
21. State Cost Formula
A state may include model cost, tool cost, API cost, database cost, compute cost, parsing cost, and writing cost.
State Cost =
LLM Cost
+ Tool Cost
+ API Cost
+ DB Cost
+ Compute Cost
+ Parse Cost
+ Write Cost
22. Task Cost Formula
Total Task Cost =
Σ State Cost
23. Price Snapshot
Price data must be snapshotted at execution time.
{
"model_name": "model_x",
"price_input_per_million": 10,
"price_cached_input_per_million": 2.5,
"price_output_per_million": 30,
"price_reasoning_per_million": 30,
"currency": "RMB",
"price_version": "2026-04-28"
}
Without a price snapshot, old traces become economically unreplayable after model providers change prices.
Part V — Trace Output
24. Cost Profile Example
Trace Evaluation produces a cost profile, not a pass/fail verdict.
{
"trace_id": "task_20260428_001",
"total_tokens": 186000,
"total_input_tokens": 142000,
"total_uncached_input_tokens": 58000,
"total_cached_input_tokens": 84000,
"total_output_tokens": 44000,
"total_reasoning_tokens": 0,
"total_cost": 3.82,
"currency": "RMB",
"total_latency_ms": 126000,
"cost_by_state": {
"THINK": 0.42,
"RETRIEVE": 1.28,
"DB_QUERY": 0.36,
"VALIDATE": 0.74,
"REFINE": 0.61,
"FINALIZE": 0.41
},
"token_by_state": {
"THINK": 22000,
"RETRIEVE": 64000,
"DB_QUERY": 18000,
"VALIDATE": 38000,
"REFINE": 26000,
"FINALIZE": 18000
},
"main_cost_sources": [
"RETRIEVE",
"VALIDATE",
"REFINE"
],
"conclusion": "The run cost 3.82 RMB. The dominant cost sources were retrieved context entering the LLM context, validation, and refinement. Cached input tokens represented 59.1% of total input tokens, indicating that fixed Skill instructions and historical context benefited from caching."
}
Part VI — Data Model
25. trace_run: Run-Level Ledger
CREATE TABLE trace_run (
trace_id TEXT PRIMARY KEY,
task_id TEXT,
agent_id TEXT,
skill_id TEXT,
start_time TIMESTAMP,
end_time TIMESTAMP,
total_latency_ms INTEGER,
total_input_tokens INTEGER,
total_uncached_input_tokens INTEGER,
total_cached_input_tokens INTEGER,
total_output_tokens INTEGER,
total_reasoning_tokens INTEGER,
total_tokens INTEGER,
total_llm_cost NUMERIC,
total_tool_cost NUMERIC,
total_compute_cost NUMERIC,
total_cost NUMERIC,
currency TEXT,
main_cost_state TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
26. trace_step: State-Level Ledger
CREATE TABLE trace_step (
trace_id TEXT NOT NULL,
step_id INTEGER NOT NULL,
parent_step_id INTEGER,
state_type TEXT NOT NULL,
agent_id TEXT,
skill_id TEXT,
model_name TEXT,
start_time TIMESTAMP,
end_time TIMESTAMP,
latency_ms INTEGER,
input_tokens_total INTEGER,
input_tokens_uncached INTEGER,
input_tokens_cached INTEGER,
output_tokens INTEGER,
reasoning_tokens INTEGER,
total_tokens INTEGER,
llm_cost NUMERIC,
tool_cost NUMERIC,
compute_cost NUMERIC,
state_cost NUMERIC,
status TEXT,
error_type TEXT,
PRIMARY KEY (trace_id, step_id)
);
27. trace_context_breakdown: Input Provenance
CREATE TABLE trace_context_breakdown (
trace_id TEXT NOT NULL,
step_id INTEGER NOT NULL,
system_prompt_tokens INTEGER,
skill_instruction_tokens INTEGER,
user_instruction_tokens INTEGER,
history_tokens INTEGER,
memory_tokens INTEGER,
tool_result_tokens INTEGER,
retrieved_context_tokens INTEGER,
artifact_context_tokens INTEGER,
other_context_tokens INTEGER,
PRIMARY KEY (trace_id, step_id)
);
28. trace_tool_event: Tool Event Ledger
CREATE TABLE trace_tool_event (
trace_id TEXT NOT NULL,
step_id INTEGER NOT NULL,
tool_type TEXT,
tool_name TEXT,
request_tokens INTEGER,
response_tokens_raw INTEGER,
response_tokens_selected INTEGER,
tokens_sent_to_next_llm INTEGER,
tool_latency_ms INTEGER,
tool_cost NUMERIC,
status TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
29. trace_price_snapshot: Price Versioning
CREATE TABLE trace_price_snapshot (
trace_id TEXT NOT NULL,
model_name TEXT NOT NULL,
price_input_per_million NUMERIC,
price_cached_input_per_million NUMERIC,
price_output_per_million NUMERIC,
price_reasoning_per_million NUMERIC,
currency TEXT,
price_version TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Part VII — Derived Metrics
30. Total Token Consumption
Total Tokens =
Σ input_tokens_total
+ Σ output_tokens
+ Σ reasoning_tokens
31. Total Cost
Total Cost =
Σ State Cost
32. Cache Hit Ratio
Cache Hit Ratio =
total_cached_input_tokens / total_input_tokens
33. Cache Savings
Cache Saving =
(total_cached_input_tokens / 1,000,000)
× (P_input - P_cached_input)
34. State Cost Share
State Cost Share =
State Cost / Total Task Cost
Example:
RETRIEVE Cost Share =
RETRIEVE Cost / Total Task Cost
35. Tool Context Ratio
Tool Context Ratio =
tokens_sent_to_next_llm / response_tokens_raw
This tells us how much raw tool output entered the next LLM context.
A high ratio may indicate insufficient compression, filtering, deduplication, or reranking.
36. Input Amplification Ratio
Input Amplification Ratio =
total_input_tokens / user_instruction_tokens
This metric shows how much the agent expands a small user instruction into runtime context.
Example:
User instruction: 300 tokens
Total input tokens: 142,000 tokens
Input Amplification Ratio = 473.3×
This is one of the most important metrics for agent cost engineering.
37. Validation Repair Rate
Validation Repair Rate =
issues_fixed_later / issues_found
This measures whether validation is actually useful or merely generating noise.
38. Refinement Efficiency
Refinement Efficiency =
modified_content_tokens / refine_output_tokens
If refinement output is large but few tokens are actually modified, the system may be over-generating during repair.
39. Retrieval Compression Ratio
Retrieval Compression Ratio =
tokens_sent_to_next_llm / raw_result_tokens
A lower ratio usually means better retrieval filtering, assuming answer quality is preserved.
Part VIII — Implementation Blueprint
40. Event Stream First
A production implementation should instrument the agent runtime as an event stream.
Recommended event categories:
run.started
step.started
model.called
tool.called
context.compiled
artifact.created
artifact.modified
state.changed
validator.called
step.completed
run.completed
Every event should include:
{
"trace_id": "task_20260428_001",
"step_id": 7,
"event_type": "model.called",
"timestamp": "2026-04-28T10:00:07Z",
"payload": {}
}
The event stream is then normalized into relational tables or analytical storage.
41. Runtime Instrumentation Points
Instrumentation should happen at the following boundaries:
| Boundary | What to Capture |
|---|---|
| Context compiler | Input token provenance, cacheability, selected context segments |
| Model gateway | Model name, token usage, latency, cost, price snapshot |
| Tool router | Tool name, request size, raw response size, selected response size |
| Retrieval system | Query count, top-k, returned chunks, raw tokens, selected tokens |
| File subsystem | File reads, writes, parses, exports, visual/render checks |
| Database gateway | Query type, rows returned, selected rows, latency, cost |
| Validator runner | Validator input, output, score, confidence, issues found |
| State diff engine | Pre/post state snapshots, expected mutation, side effects |
42. Evaluation Pipeline
flowchart TD
A[Agent Run] --> B[Runtime Event Stream]
B --> C[Trace Normalizer]
C --> D[Token & Cost Ledger]
A --> E[Final Output]
A --> F[State Snapshot After Run]
E --> G[Validator Mesh]
F --> H[State Diff Engine]
G --> I[Outcome Result]
H --> J[State Result]
D --> K[Trace Result]
I --> W[Evaluation Warehouse]
J --> W
K --> W
W --> L[Failure Dashboard]
W --> M[Cost Dashboard]
W --> N[Optimization Engine]
Part IX — Optimization Surfaces
Trace Evaluation reveals where optimization should happen.
Typical findings:
| Cost Pattern | Likely Cause | Optimization Direction |
|---|---|---|
High skill_instruction_tokens | Skill package too verbose | Skill compression, modular Skill loading |
High retrieved_context_tokens | Retrieval returns too much context | Better reranking, deduplication, budgeted context selection |
High tool_result_tokens | Tool outputs are copied into context too aggressively | Tool result summarization, schema extraction |
High history_tokens | Conversation history is not compressed | Rolling summary, hierarchical memory |
High VALIDATE cost | Too many validation rounds | Validator routing, selective validation |
High REFINE cost | Large rewrites for small fixes | Patch-based refinement |
| Low cache hit ratio | Context is unstable | Stable prompt prefixing, Skill cache segmentation |
| High input amplification ratio | Agent loop expands too aggressively | State pruning, tool result filtering, context budgets |
The point is not simply to reduce tokens.
The point is to reduce useless tokens without reducing task success.
Part X — Future: Self-Optimizing Agents
Once the trace ledger is reliable, it can become input to a self-optimizing agent.
The optimizer can read historical traces and propose improvements such as:
- compress long Skill instructions;
- split large Skills into cacheable modules;
- reduce top-k retrieval dynamically;
- tune context selection budgets;
- summarize tool responses before reinserting them into context;
- remove repeated validation steps;
- convert full rewrites into patch-based edits;
- improve cache hit ratio through stable prompt layouts;
- detect high-cost state patterns and rewrite workflow policy.
This creates a closed loop:
flowchart LR
A[Agent Run] --> B[Trace Ledger]
B --> C[Cost Attribution]
C --> D[Optimization Agent]
D --> E[Skill / Workflow / Context Policy Patch]
E --> A
The current phase should not over-optimize prematurely.
The first milestone is:
Capture the trace completely.
Compute the cost correctly.
Attribute the cost to states and context sources.
Make every failure and every yuan explainable.
41. Engineering Principles
A production-grade agent evaluation system should follow these principles:
Principle 1 — No Self-Certified Success
The agent cannot declare itself successful.
Success must be verified by external validators, evidence checks, state checks, and execution records.
Principle 2 — Outcome and Cost Are Separate
A task can be correct and expensive.
A task can be cheap and wrong.
Both dimensions must be measured independently.
Principle 3 — External State Is First-Class
Files, databases, tickets, appointments, orders, emails, and workflow submissions are not side details.
They are part of the task result.
Principle 4 — Token Flow Must Be Attributed
Total token count is not enough.
Tokens must be attributed to source:
system prompt
Skill instruction
user instruction
history
memory
tool result
retrieved context
artifact context
Principle 5 — Failure Must Be Queryable
Failure messages should be structured into failure codes, validator outputs, and diagnostic metadata.
If failure cannot be aggregated, it cannot be improved.
Principle 6 — Price Must Be Snapshotted
Token prices change.
Every trace must store the price version used at execution time.
Principle 7 — Evaluation Data Should Feed Optimization
Evaluation is not only for reporting.
It should become the data foundation for workflow compression, Skill optimization, retrieval tuning, and future self-evolving agents.
Conclusion
Agent evaluation should be built like observability infrastructure, not like a subjective scoring prompt.
The proposed framework separates the problem into two layers:
Outcome Evaluation:
Did the agent complete the task correctly?
Trace Evaluation:
What did it cost to complete the task?
Outcome Evaluation uses contracts, hard gates, soft scores, validators, failure codes, and state-diff verification.
Trace Evaluation turns the agent runtime into an auditable execution ledger with per-state token accounting, context provenance, tool cost, model cost, latency, and optimization metrics.
The final product is not just a score.
It is an engineering substrate for building agents that are:
correct,
auditable,
cost-aware,
optimizable,
and eventually self-improving.
In production agent systems, evaluation is not the last step.
It is the control plane.