2026.4

Building a Production-Grade Evaluation Stack for Agentic Systems

Building a Production-Grade Evaluation Stack for Agentic Systems

Evaluation for agents should not be a vibe check.
It should be a typed, replayable, auditable execution system.

Most LLM evaluation frameworks were originally designed for single-turn answers: ask a question, get a response, compare the response against an expected answer, compute a score.

That is not enough for agentic systems.

A real agent does not simply "answer". It observes context, retrieves evidence, calls tools, queries databases, edits files, writes memory, validates outputs, refines drafts, and sometimes mutates the external world. It may submit a form, create a ticket, send an email, generate a report, update a document, or place an order.

Therefore, an agent evaluation system must answer two different classes of questions:

1. Outcome correctness:
   Did the agent actually complete the task?
   Did the external world change correctly?
   Is the final output semantically acceptable?

2. Execution economics:
   How many tokens did the agent burn?
   Which runtime states created the cost?
   How much did tools, APIs, retrieval, validation, and refinement cost?
   Where is the optimization surface?

This report proposes a two-layer evaluation architecture:

Layer 1: Outcome Evaluation
         Determines whether the task was truly completed.

Layer 2: Trace Evaluation
         Builds a token-level, state-level, cost-level execution ledger.

The key idea is simple:

Outcome Evaluation tells us whether the agent works.
Trace Evaluation tells us how expensive it was to make it work.


1. System Overview

We model an agent run as a combination of final-state verification and runtime trace accounting.

flowchart TD
    U[User Task] --> A[Target Agent Runtime]

    A --> O[Final Answer]
    A --> S[External State / Artifacts]
    A --> T[Execution Trace]

    O --> OE[Outcome Evaluation Layer]
    S --> SE[State Change Evaluation]
    T --> TE[Trace Evaluation Layer]

    OE --> R1[Hard Success / Soft Score]
    SE --> R2[State Match / Side Effect Detection]
    TE --> R3[Token Ledger / Cost Ledger / Latency Ledger]

    R1 --> E[Evaluation Warehouse]
    R2 --> E
    R3 --> E

    E --> D[Dashboards]
    E --> M[Failure Attribution]
    E --> C[Cost Optimization]
    E --> X[Future Self-Optimizing Agents]

The evaluation stack has three major outputs:

OutputPurpose
eval_task_resultDetermines whether the task passes the outcome gate.
eval_state_resultVerifies whether files, artifacts, or external systems reached the expected state.
trace_run / trace_stepRecords the full execution ledger: tokens, cost, latency, tools, context flow.

This design is intentionally strict:

  • The agent cannot self-certify success.
  • The final answer is not the only thing evaluated.
  • Tool results are not ignored.
  • External mutations must be verified.
  • Token cost must be attributed to specific runtime states.
  • Failure reasons must be structured, not described as vague "bad output".

Part I — Outcome Evaluation Layer

2. What the Outcome Layer Measures

The Outcome Evaluation Layer answers three questions:

1. Was the task completed?
2. Did the external environment or target artifact change correctly?
3. Does the final output meet the required semantic quality?

This layer does not care how many steps the agent took, how many tools it called, or how many tokens it consumed. It only cares about the final result.

In other words:

Outcome Evaluation = final-state correctness
Trace Evaluation   = runtime economics

3. Eval Case as a Contract

Every evaluation case should be treated as a contract.

The contract defines:

  • what the user asked for;
  • what output modules are required;
  • what fields must be present;
  • what errors must never appear;
  • whether external execution is required;
  • what proof is required for external execution.

A typical evaluation case can be expressed as YAML:

task_id: search_agent_001
task_name: Search for Phase III clinical evidence for a drug-indication pair

input:
  user_instruction: >
    Search and summarize whether drug X has Phase III clinical evidence
    for indication Y.

success_criteria:
  required_outputs:
    - final_answer
    - evidence_list
    - citations

  must_include:
    - drug_name
    - indication
    - trial_phase
    - trial_status
    - primary_endpoint
    - conclusion

  must_not_include:
    - fabricated_reference
    - unsupported_claim

  # Only required for executable tasks:
  # ticket creation, booking, ordering, scheduling, form submission,
  # email sending, workflow submission, payment, deletion, etc.
  execution_result:
    required: true

    must_include:
      - execution_status
      - confirmation_id
      - execution_target
      - execution_parameters

    must_not_include:
      - action_not_executed
      - execution_result_not_found
      - wrong_execution_target
      - wrong_execution_parameters
      - duplicate_execution
      - unauthorized_payment
      - unconfirmed_high_risk_action

This structure is critical.

Do not define success like this:

The answer should be good.

Define it like this:

The output must contain specific fields.
The evidence must be checkable.
The citations must exist.
The external action must have a confirmation ID.
The execution target and parameters must match the user intent.
High-risk actions must be explicitly confirmed.

The target agent does not have to output JSON. The YAML or JSON schema is used to define the evaluation contract, not to force a rigid output format on the agent.


4. Hard Success vs. Soft Score

A single binary score is too coarse for agent evaluation.

We split task success into two levels:

Hard Success:
  Does the task pass the minimum correctness gate?

Soft Score:
  If the hard gate passes, how good is the result?

4.1 Hard Success

Hard Success is the minimum pass/fail gate.

A task fails immediately if any critical condition is violated.

Hard GateFailure Condition
Final answer existsNo final answer was generated.
Required outputs existRequired modules such as evidence_list or citations are missing.
Required fields existCore fields such as drug, indication, trial status, endpoint, or conclusion are missing.
Evidence existsNo evidence is provided for evidence-critical domains.
Citations are realThe agent cites nonexistent papers, URLs, trial IDs, or database records.
Claims are supportedThe answer contains claims not supported by evidence.
No unauthorized actionThe agent performs an action outside its permission boundary.
External execution is realThe claimed booking, order, ticket, submission, or email does not exist in the external system.
Execution target is correctThe agent acts on the wrong person, account, product, patient, file, or recipient.
Execution parameters are correctTime, amount, quantity, address, fields, or target parameters are wrong.
High-risk action is confirmedPayment, deletion, submission, or sending occurs without required confirmation.
No duplicate executionThe agent places duplicate orders, creates duplicate tickets, or sends duplicate emails.

The output is binary:

Hard Success = true / false

If the hard gate fails, the task is failed regardless of how polished the final text looks.

4.2 Soft Score

Soft Score measures quality after the hard gate passes.

A default scoring formula can be:

Task Success Score =
0.30 × Completeness
+ 0.20 × Evidence Validity
+ 0.20 × Evidence Consistency
+ 0.20 × Methodology Compliance
+ 0.10 × Readability

Each sub-score is normalized to 0–10.

At the benchmark level:

Task Success Rate =
Number of tasks with Hard Success = true / Total number of tasks
Average Outcome Score =
Average Soft Score over tasks that passed the Hard Gate

This separation makes the evaluation interpretable:

Hard Success tells us whether the agent can complete the task.
Soft Score tells us how well the agent completed it.

5. Validator Mesh

After the target agent finishes execution, the final result is passed into a set of validators.

Each validator has a narrow responsibility.

flowchart LR
    R[Agent Result] --> F[Required Field Validator]
    R --> E[Evidence Validator]
    R --> C[Claim-Fact Validator]
    R --> P[Rule / SOP Validator]
    R --> L[LLM Judge]
    R --> X[Execution Validator]

    F --> G[Hard Gate + Soft Score]
    E --> G
    C --> G
    P --> G
    L --> G
    X --> G
ValidatorResponsibilityMetric
Required Field ValidatorCompares output against the Eval Case contract and checks whether required fields are present.Completeness
Evidence ValidatorVerifies that citations exist and can be mapped to accessible source material.Evidence Validity
Claim-Fact ValidatorChecks whether the key claims are supported by the cited evidence.Evidence Consistency
Rule / SOP ValidatorChecks whether the agent followed system prompts, Skill instructions, workflow rules, and domain SOPs.Methodology Compliance
LLM JudgeEvaluates open-ended semantic quality, domain style, logical clarity, and whether the answer matches user intent.Readability
Execution ValidatorVerifies whether external actions actually happened and whether execution records match the user instruction.Execution Success

The Evidence Validator should have access to the same evidence sources as the target agent. Otherwise, it cannot reliably verify whether a citation is real.

The Rule / SOP Validator should have access to the target agent's system prompt, Skill package, tool specification, and workflow policy, because it needs to check whether the agent followed its own operating contract.


6. Structured Failure Taxonomy

Failure reasons must be structured.

A production evaluation system should not record failure as:

bad answer

It should record:

primary_failure_reason_code = FABRICATED_REFERENCE
failure_reason_codes = [
  "FABRICATED_REFERENCE",
  "CLAIM_EVIDENCE_MISMATCH",
  "LOW_EVIDENCE_CONSISTENCY_SCORE"
]

A recommended failure code taxonomy:

MISSING_FINAL_ANSWER
MISSING_REQUIRED_OUTPUT
MISSING_REQUIRED_FIELD
OUTPUT_FORMAT_INVALID
EMPTY_OR_INVALID_OUTPUT

MISSING_EVIDENCE
MISSING_CITATION
CITATION_NOT_FOUND
FABRICATED_REFERENCE
EVIDENCE_SOURCE_INACCESSIBLE

UNSUPPORTED_CLAIM
CLAIM_EVIDENCE_MISMATCH
CONTRADICTED_BY_EVIDENCE
WRONG_FACT

INCOMPLETE_ANSWER
LOW_COMPLETENESS_SCORE
LOW_EVIDENCE_VALIDITY_SCORE
LOW_EVIDENCE_CONSISTENCY_SCORE
LOW_METHODOLOGY_SCORE
LOW_READABILITY_SCORE

SOP_NOT_FOLLOWED
SYSTEM_PROMPT_VIOLATION
SKILL_INSTRUCTION_VIOLATION

UNAUTHORIZED_ACTION
STATE_MISMATCH
STATE_CHANGE_FAILED
PARTIAL_STATE_CHANGE

TOOL_FAILURE
TOOL_TIMEOUT
EXECUTION_TIMEOUT
EVALUATOR_FAILURE
LOW_CONFIDENCE_EVALUATION
UNKNOWN_FAILURE

ACTION_NOT_EXECUTED
EXECUTION_RESULT_NOT_FOUND
WRONG_EXECUTION_TARGET
WRONG_EXECUTION_PARAMETERS
DUPLICATE_EXECUTION
UNCONFIRMED_HIGH_RISK_ACTION
UNAUTHORIZED_PAYMENT

6.1 Failure Code Semantics

CodeMeaning
MISSING_FINAL_ANSWERThe agent did not produce a final conclusion.
MISSING_REQUIRED_OUTPUTRequired output modules are missing, such as evidence_list or citations.
MISSING_REQUIRED_FIELDA required field is missing, such as drug name, indication, trial status, or primary endpoint.
OUTPUT_FORMAT_INVALIDThe output format violates the required structure.
EMPTY_OR_INVALID_OUTPUTThe output is empty, malformed, or clearly unrelated to the task.
MISSING_EVIDENCENo evidence is provided in an evidence-critical task.
MISSING_CITATIONEvidence is discussed but not cited.
CITATION_NOT_FOUNDThe cited item cannot be found in the configured source set.
FABRICATED_REFERENCEThe agent cites a nonexistent paper, URL, trial ID, database record, or document.
EVIDENCE_SOURCE_INACCESSIBLEThe evaluator cannot access the evidence source required for verification.
UNSUPPORTED_CLAIMA claim is made without evidence support.
CLAIM_EVIDENCE_MISMATCHThe cited evidence does not support the claim.
CONTRADICTED_BY_EVIDENCEThe output contradicts the cited evidence.
WRONG_FACTA key factual element is wrong.
INCOMPLETE_ANSWERThe answer does not cover the required parts of the task.
LOW_COMPLETENESS_SCORECompleteness falls below the configured threshold.
LOW_EVIDENCE_VALIDITY_SCORECitation quality or evidence verifiability is too low.
LOW_EVIDENCE_CONSISTENCY_SCOREClaim-evidence alignment is too weak.
LOW_METHODOLOGY_SCOREThe agent did not follow the expected method, Skill, or SOP.
LOW_READABILITY_SCOREThe output is hard to read, poorly structured, or domain-inappropriate.
SOP_NOT_FOLLOWEDThe agent violated a required operating procedure.
SYSTEM_PROMPT_VIOLATIONThe agent violated core system-level constraints.
SKILL_INSTRUCTION_VIOLATIONThe agent did not follow the Skill package instructions.
UNAUTHORIZED_ACTIONThe agent used forbidden tools, accessed unauthorized data, or performed a forbidden operation.
STATE_MISMATCHThe agent claimed a state change that did not occur.
STATE_CHANGE_FAILEDA required create/update/delete/submit operation failed.
PARTIAL_STATE_CHANGEOnly part of the required state mutation was completed.
TOOL_FAILUREA retrieval tool, API, MCP tool, database, or script failed.
TOOL_TIMEOUTA tool call timed out.
EXECUTION_TIMEOUTThe overall agent task timed out.
EVALUATOR_FAILUREA validator or LLM judge failed.
LOW_CONFIDENCE_EVALUATIONThe evaluator produced an unstable or low-confidence judgment.
UNKNOWN_FAILUREThe failure cannot be classified.
ACTION_NOT_EXECUTEDThe agent claimed completion, but no external execution record exists.
EXECUTION_RESULT_NOT_FOUNDThe order, appointment, ticket, submission, or email record cannot be found.
WRONG_EXECUTION_TARGETThe agent acted on the wrong person, item, account, patient, file, or recipient.
WRONG_EXECUTION_PARAMETERSTime, amount, quantity, address, field values, or other execution parameters are wrong.
DUPLICATE_EXECUTIONThe agent executed the same action multiple times.
UNCONFIRMED_HIGH_RISK_ACTIONA high-risk action was executed without required user confirmation.
UNAUTHORIZED_PAYMENTA payment or charge was made without explicit authorization.

With this taxonomy, failure analysis becomes queryable:

Did the task work?             hard_success
How good was the task?         outcome_score
Why did it fail?               primary_failure_reason_code
Where did it fail?             eval_validator_result

7. Outcome Result Table

A minimal task-level outcome table:

CREATE TABLE eval_task_result (
  eval_id                         TEXT PRIMARY KEY,
  task_id                         TEXT NOT NULL,
  trace_id                        TEXT,
  agent_id                        TEXT,
  skill_id                        TEXT,

  hard_success                    BOOLEAN NOT NULL,
  outcome_score                   NUMERIC,
  completeness_score              NUMERIC,
  evidence_validity_score         NUMERIC,
  evidence_consistency_score      NUMERIC,
  methodology_score               NUMERIC,
  readability_score               NUMERIC,
  execution_success_score         NUMERIC,

  primary_failure_reason_code     TEXT,
  failure_reason_codes            JSONB,

  evaluator_version               TEXT,
  eval_contract_version           TEXT,
  created_at                      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

A validator-level result table:

CREATE TABLE eval_validator_result (
  eval_id                         TEXT NOT NULL,
  validator_name                  TEXT NOT NULL,
  validator_type                  TEXT NOT NULL,

  passed                          BOOLEAN NOT NULL,
  score                           NUMERIC,
  confidence                      NUMERIC,

  failure_reason_codes            JSONB,
  evidence_refs                   JSONB,
  diagnostic_message              TEXT,

  latency_ms                      INTEGER,
  evaluator_model                 TEXT,
  created_at                      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Part II — External State Change Evaluation

8. Why State Change Evaluation Matters

An agent is not just a text generator.

It often creates or mutates external state:

- generate a file;
- modify a document;
- update a database;
- create a ticket;
- submit a form;
- send an email;
- book an appointment;
- place an order;
- trigger an MCP workflow;
- write memory.

Therefore, outcome evaluation must include state verification.

A final message like:

Done.

does not prove that anything was done.

The evaluator must inspect the environment.


9. State Diff Model

We represent external state evaluation as a diff between pre-run and post-run snapshots.

flowchart LR
    B[Before Snapshot] --> D[State Diff Engine]
    A[After Snapshot] --> D
    I[User Intent / Eval Contract] --> D

    D --> H[Hard State Gate]
    D --> Q[State Quality Score]
    D --> S[Side Effect Detector]

State snapshots can include:

State TypeExample
File systemCreated, modified, deleted files
Document artifactsDOCX, PDF, PPTX, Markdown, CSV, JSON
Database stateInserted, updated, deleted rows
Business systemsOrders, appointments, tickets, submissions
Communication systemsSent emails, messages, notifications
Workflow systemsMCP actions, pipeline events, approvals
Memory storesLong-term memory or task memory writes

10. Hard State Gate

Hard State Gate checks whether the expected state exists and whether unexpected side effects occurred.

Minimum checks:

1. The artifact or external record exists.
2. The artifact or external record is readable.
3. The artifact or external record is non-empty.
4. The target state matches the user intent.
5. No unauthorized or unexpected side effects occurred.

Examples of side effects:

The agent generated the requested report but deleted the input file.
The agent created the expected database record but inserted duplicate rows.
The agent sent the correct email but also sent it to an unintended recipient.
The agent updated the right document but overwrote unrelated sections.

If the user did not request the mutation, it is a side effect.

If the mutation was required by the task, it must be verified against the contract.


11. State Quality Score

After passing the hard state gate, the generated or modified state can be scored.

State Change Score =
0.40 × Target State Match
+ 0.30 × Content Usability
+ 0.20 × Format Quality
+ 0.10 × Boundary Control
MetricMeaningEvaluation Focus
Target State MatchWhether the final artifact or external record reached the expected target state.Correctness of state mutation
Content UsabilityWhether the content is useful, readable, logically organized, and aligned with user intent.Practical usability
Format QualityWhether formatting, tables, structure, Markdown, JSON, DOCX, PPTX, or report layout meet expectations.Presentation quality
Boundary ControlWhether the agent avoided irrelevant expansion, unsupported additions, unrelated edits, or over-promising.Scope control

12. State Result Table

CREATE TABLE eval_state_result (
  eval_id                         TEXT NOT NULL,
  trace_id                        TEXT NOT NULL,
  state_object_id                 TEXT NOT NULL,

  state_object_type               TEXT,
  state_object_path               TEXT,
  state_action                    TEXT, -- create / modify / delete / submit / send / write

  exists_after_run                BOOLEAN,
  readable_after_run              BOOLEAN,
  non_empty_after_run             BOOLEAN,
  expected_state_match            BOOLEAN,
  side_effect_detected            BOOLEAN,

  target_state_match_score        NUMERIC,
  content_usability_score         NUMERIC,
  format_quality_score            NUMERIC,
  boundary_control_score          NUMERIC,
  state_change_score              NUMERIC,

  failure_reason_codes            JSONB,
  diff_summary                    JSONB,

  created_at                      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Part III — Trace Evaluation Layer

13. Goal of Trace Evaluation

The Trace Evaluation Layer does not decide whether the task passes or fails.

It builds an execution ledger.

It answers:

1. How many tokens did the agent consume in total?
2. Which runtime states produced those tokens?
3. How many were uncached input tokens, cached input tokens, output tokens, and reasoning tokens?
4. How much did each token type cost?
5. How much did tools, retrieval, APIs, databases, scripts, file parsing, and file writing cost?
6. Which states dominated the total cost?
7. Which context sources caused input-token amplification?
8. Which tool results entered the next LLM context?
9. Which steps are the best targets for optimization?

The output is not a score.

It is a bill of execution.


14. Agent Runtime as a State Machine

A real agent loop is not linear.

It is not:

retrieve -> read -> answer

It is closer to:

OBSERVE
-> THINK / DECIDE
-> ACTION
-> OBSERVE
-> THINK / DECIDE
-> ACTION
-> ...
-> FINALIZE

We normalize the runtime into a sequence of typed states.

stateDiagram-v2
    [*] --> OBSERVE
    OBSERVE --> THINK
    THINK --> RETRIEVE
    THINK --> MCP_CALL
    THINK --> API_CALL
    THINK --> DB_QUERY
    THINK --> SCRIPT_EXEC
    THINK --> FILE_READ
    THINK --> FILE_WRITE
    THINK --> MEMORY_READ
    THINK --> VALIDATE
    THINK --> FINALIZE

    RETRIEVE --> OBSERVE
    MCP_CALL --> OBSERVE
    API_CALL --> OBSERVE
    DB_QUERY --> OBSERVE
    SCRIPT_EXEC --> OBSERVE
    FILE_READ --> OBSERVE
    FILE_WRITE --> OBSERVE
    MEMORY_READ --> OBSERVE
    VALIDATE --> REFINE
    REFINE --> OBSERVE
    FINALIZE --> [*]

Each state becomes a trace_step.

A full run becomes a trace_run.

Example:

trace_run
  ├── step_1: OBSERVE
  ├── step_2: THINK
  ├── step_3: RETRIEVE
  ├── step_4: OBSERVE
  ├── step_5: THINK
  ├── step_6: DB_QUERY
  ├── step_7: OBSERVE
  ├── step_8: THINK
  ├── step_9: VALIDATE
  ├── step_10: REFINE
  ├── step_11: FINALIZE

15. Runtime State Types

Recommended state taxonomy:

OBSERVE       Read context, environment state, tool results, or intermediate artifacts.
THINK         Plan, reason, decide next action, or generate tool call arguments.
RETRIEVE      Search, retrieval, knowledge-base query, or web search.
MCP_CALL      MCP server tool call.
API_CALL      External API call.
DB_QUERY      SQL, graph query, vector query, or hybrid database access.
SCRIPT_EXEC   Code execution, data processing, sandbox execution.
FILE_READ     File parsing or reading.
FILE_WRITE    File creation, update, append, or export.
MEMORY_READ   Read long-term memory, task memory, or historical preference.
MEMORY_WRITE  Write memory or compressed experience.
VALIDATE      Check, review, fact-verify, format-verify, or compliance-review.
REFINE        Revise, rewrite, repair, or optimize based on validation.
FINALIZE      Produce final answer or submit final artifact.

16. Common Trace Step Schema

Every state should record a shared envelope:

{
  "trace_id": "task_20260428_001",
  "step_id": 3,
  "parent_step_id": 2,
  "state_type": "RETRIEVE",
  "agent_id": "search_agent",
  "skill_id": "clinical_evidence_search",
  "model_name": "model_x",
  "start_time": "2026-04-28T10:00:00Z",
  "end_time": "2026-04-28T10:00:04Z",
  "latency_ms": 4200,
  "status": "success"
}
FieldMeaning
trace_idUnique ID for one agent run.
step_idStep ID inside the run.
parent_step_idParent step or previous step.
state_typeRuntime state type.
agent_idAgent that executed this step.
skill_idSkill or workflow package used by the agent.
model_nameModel used in this step, if applicable.
start_timeStep start timestamp.
end_timeStep end timestamp.
latency_msStep latency.
statussuccess, error, timeout, cancelled, etc.

17. Token Accounting Schema

If a state invokes a model, record the token ledger:

{
  "input_tokens_total": 12800,
  "input_tokens_uncached": 4200,
  "input_tokens_cached": 8600,
  "output_tokens": 320,
  "reasoning_tokens": 0,
  "total_tokens": 13120
}
FieldMeaning
input_tokens_totalTotal input tokens for this model call.
input_tokens_uncachedInput tokens not served from cache.
input_tokens_cachedInput tokens served from cache.
output_tokensVisible model output tokens.
reasoning_tokensReasoning tokens if exposed and billable.
total_tokensTotal token usage for this step.

A key accounting rule:

Tool responses are not model output tokens.
But once tool responses enter the next model context, they become input tokens.

Therefore, trace evaluation must track not only model tokens, but also the flow of tool outputs into later model contexts.


18. Context Provenance Breakdown

To understand why input tokens explode, split input tokens by source:

{
  "input_token_breakdown": {
    "system_prompt_tokens": 1200,
    "skill_instruction_tokens": 2600,
    "user_instruction_tokens": 300,
    "history_tokens": 1800,
    "memory_tokens": 600,
    "tool_result_tokens": 4200,
    "retrieved_context_tokens": 1800,
    "artifact_context_tokens": 300,
    "other_context_tokens": 0
  }
}
FieldMeaning
system_prompt_tokensTokens from system prompt.
skill_instruction_tokensTokens from Skill files, SOPs, and tool specs.
user_instruction_tokensTokens from the user's original task.
history_tokensTokens from previous conversation or previous trace context.
memory_tokensTokens loaded from memory.
tool_result_tokensTokens from previous tool outputs entering the context.
retrieved_context_tokensTokens from retrieved documents or search results.
artifact_context_tokensTokens from files, tables, intermediate artifacts, or generated outputs.
other_context_tokensOther context sources.

This makes cost attribution concrete:

The model is not expensive.
The context is expensive.

The user query is not large.
The retrieved evidence is large.

The final answer is not costly.
The repeated validation-refinement loop is costly.

The task is not inherently expensive.
The Skill instruction and history context are too heavy.

19. State-Specific Schemas

19.1 OBSERVE

OBSERVE reads current context, environment state, tool results, or intermediate artifacts.

{
  "state_type": "OBSERVE",
  "observed_sources": [
    "user_input",
    "previous_tool_result",
    "memory",
    "file_state"
  ],
  "observed_tokens": 5200,
  "tokens_sent_to_next_think": 4800
}
FieldMeaning
observed_sourcesSources observed in this step.
observed_tokensRaw tokens observed.
tokens_sent_to_next_thinkTokens preserved for the next thinking step.

19.2 THINK

THINK plans, decides, or generates tool-call arguments.

{
  "state_type": "THINK",
  "decision_type": "call_tool",
  "next_action": "RETRIEVE",
  "input_tokens_total": 6800,
  "input_tokens_uncached": 2200,
  "input_tokens_cached": 4600,
  "output_tokens": 420,
  "tool_call_instruction_tokens": 90
}
FieldMeaning
decision_typeanswer, call_tool, validate, refine, etc.
next_actionNext planned runtime state.
tool_call_instruction_tokensTokens used to generate tool-call instructions.
output_tokensVisible planning, decision, or tool argument tokens.

If hidden reasoning is not exposed by the model or platform, record only visible planning, decision text, and tool-call arguments.

19.3 RETRIEVE

RETRIEVE covers search, knowledge-base retrieval, vector search, hybrid retrieval, or web search.

{
  "state_type": "RETRIEVE",
  "retriever_name": "knowledge_base_search",
  "query_count": 3,
  "query_tokens": 180,
  "top_k": 20,
  "returned_chunks": 60,
  "raw_result_tokens": 52000,
  "deduplicated_tokens": 36000,
  "selected_tokens": 8000,
  "tokens_sent_to_next_llm": 5000,
  "tool_latency_ms": 2800,
  "tool_cost": 0.02
}
FieldMeaning
retriever_nameName of the retriever.
query_countNumber of generated queries.
query_tokensTokens in retrieval queries.
top_kNumber of results returned per query.
returned_chunksRaw returned chunks.
raw_result_tokensRaw retrieved token volume.
deduplicated_tokensToken count after deduplication.
selected_tokensToken count after reranking or filtering.
tokens_sent_to_next_llmTokens that enter the next model context.
tool_latency_msRetrieval latency.
tool_costRetrieval cost.

This is one of the most important states for search-heavy agents, because retrieval result inflation often dominates total token cost.

19.4 MCP_CALL

MCP_CALL records calls to external MCP servers.

{
  "state_type": "MCP_CALL",
  "mcp_server": "clinical_trial_mcp",
  "tool_name": "search_trial_registry",
  "request_tokens": 260,
  "response_tokens_raw": 12000,
  "response_tokens_selected": 3000,
  "tokens_sent_to_next_llm": 2400,
  "tool_latency_ms": 3500,
  "tool_cost": 0.05
}

19.5 API_CALL

API_CALL records external API invocations.

{
  "state_type": "API_CALL",
  "api_name": "drug_database_api",
  "endpoint": "/v1/drug/trials",
  "request_tokens": 180,
  "response_tokens_raw": 9000,
  "response_tokens_selected": 2500,
  "tokens_sent_to_next_llm": 1800,
  "api_latency_ms": 1600,
  "api_cost": 0.03,
  "http_status": 200
}

19.6 DB_QUERY

DB_QUERY records SQL, graph, vector, or hybrid queries.

{
  "state_type": "DB_QUERY",
  "database_name": "clinical_kb",
  "query_type": "sql",
  "query_tokens": 220,
  "rows_returned": 128,
  "raw_result_tokens": 16000,
  "selected_rows": 12,
  "selected_tokens": 2600,
  "tokens_sent_to_next_llm": 2000,
  "db_latency_ms": 900,
  "db_cost": 0.01
}

19.7 SCRIPT_EXEC

SCRIPT_EXEC records code execution, sandbox runtime, data transformation, and computational workloads.

{
  "state_type": "SCRIPT_EXEC",
  "runtime": "python",
  "script_input_tokens": 900,
  "code_tokens": 1200,
  "stdout_tokens": 1800,
  "stderr_tokens": 0,
  "generated_file_count": 2,
  "tokens_sent_to_next_llm": 1200,
  "execution_time_ms": 5200,
  "cpu_seconds": 4.8,
  "gpu_seconds": 0,
  "compute_cost": 0.04
}

19.8 FILE_READ

FILE_READ records file parsing and content extraction.

{
  "state_type": "FILE_READ",
  "file_path": "/input/protocol.pdf",
  "file_type": "pdf",
  "file_size_bytes": 1839200,
  "raw_extracted_tokens": 48000,
  "selected_tokens": 6000,
  "tokens_sent_to_next_llm": 5000,
  "parse_latency_ms": 2300,
  "parse_cost": 0.02
}

19.9 FILE_WRITE

FILE_WRITE records file generation or modification.

{
  "state_type": "FILE_WRITE",
  "file_path": "/output/report.docx",
  "file_type": "docx",
  "write_mode": "create",
  "content_tokens_written": 8600,
  "formatting_tokens": 1200,
  "generated_file_size_bytes": 236000,
  "write_latency_ms": 1800,
  "write_cost": 0.01
}

19.10 MEMORY_READ

MEMORY_READ records retrieval from long-term memory, task memory, or preference memory.

{
  "state_type": "MEMORY_READ",
  "memory_type": "long_term_memory",
  "query_tokens": 80,
  "raw_memory_tokens": 12000,
  "selected_memory_tokens": 1800,
  "tokens_sent_to_next_llm": 1500,
  "memory_latency_ms": 500
}

19.11 MEMORY_WRITE

MEMORY_WRITE records memory persistence or experience compression.

{
  "state_type": "MEMORY_WRITE",
  "memory_type": "task_memory",
  "raw_content_tokens": 4200,
  "summary_tokens": 600,
  "tokens_written": 600,
  "write_latency_ms": 400
}

19.12 VALIDATE

VALIDATE records review, checking, fact verification, formatting verification, or compliance validation.

{
  "state_type": "VALIDATE",
  "validator_name": "evidence_validator",
  "validation_input_tokens": 9000,
  "input_tokens_uncached": 3000,
  "input_tokens_cached": 6000,
  "validation_output_tokens": 1200,
  "issues_found": 3,
  "issues_fixed_later": 2,
  "validation_latency_ms": 3600,
  "validation_cost": 0.08
}

19.13 REFINE

REFINE records revision, rewriting, repair, or optimization based on validation feedback.

{
  "state_type": "REFINE",
  "refine_reason": "fix_missing_citation",
  "input_tokens_total": 7600,
  "input_tokens_uncached": 2600,
  "input_tokens_cached": 5000,
  "output_tokens": 1800,
  "modified_content_tokens": 1200,
  "refine_latency_ms": 4200,
  "refine_cost": 0.10
}

19.14 FINALIZE

FINALIZE records final answer generation or final artifact submission.

{
  "state_type": "FINALIZE",
  "final_output_type": "answer_with_citations",
  "input_tokens_total": 11000,
  "input_tokens_uncached": 4000,
  "input_tokens_cached": 7000,
  "output_tokens": 3600,
  "final_answer_tokens": 3200,
  "citation_tokens": 400,
  "finalize_latency_ms": 5000,
  "finalize_cost": 0.18
}

Part IV — Cost Model

20. LLM Cost Formula

Each model call should be billed by token type:

1. Uncached input tokens
2. Cached input tokens
3. Output tokens
4. Reasoning tokens, if separately exposed and billed

The cost of a model call:

LLM Cost =
(input_tokens_uncached / 1,000,000) × P_input
+ (input_tokens_cached / 1,000,000) × P_cached_input
+ (output_tokens / 1,000,000) × P_output
+ (reasoning_tokens / 1,000,000) × P_reasoning

If reasoning tokens are not exposed or not separately billed, set:

reasoning_tokens = 0
P_reasoning = 0

21. State Cost Formula

A state may include model cost, tool cost, API cost, database cost, compute cost, parsing cost, and writing cost.

State Cost =
LLM Cost
+ Tool Cost
+ API Cost
+ DB Cost
+ Compute Cost
+ Parse Cost
+ Write Cost

22. Task Cost Formula

Total Task Cost =
Σ State Cost

23. Price Snapshot

Price data must be snapshotted at execution time.

{
  "model_name": "model_x",
  "price_input_per_million": 10,
  "price_cached_input_per_million": 2.5,
  "price_output_per_million": 30,
  "price_reasoning_per_million": 30,
  "currency": "RMB",
  "price_version": "2026-04-28"
}

Without a price snapshot, old traces become economically unreplayable after model providers change prices.


Part V — Trace Output

24. Cost Profile Example

Trace Evaluation produces a cost profile, not a pass/fail verdict.

{
  "trace_id": "task_20260428_001",
  "total_tokens": 186000,
  "total_input_tokens": 142000,
  "total_uncached_input_tokens": 58000,
  "total_cached_input_tokens": 84000,
  "total_output_tokens": 44000,
  "total_reasoning_tokens": 0,
  "total_cost": 3.82,
  "currency": "RMB",
  "total_latency_ms": 126000,

  "cost_by_state": {
    "THINK": 0.42,
    "RETRIEVE": 1.28,
    "DB_QUERY": 0.36,
    "VALIDATE": 0.74,
    "REFINE": 0.61,
    "FINALIZE": 0.41
  },

  "token_by_state": {
    "THINK": 22000,
    "RETRIEVE": 64000,
    "DB_QUERY": 18000,
    "VALIDATE": 38000,
    "REFINE": 26000,
    "FINALIZE": 18000
  },

  "main_cost_sources": [
    "RETRIEVE",
    "VALIDATE",
    "REFINE"
  ],

  "conclusion": "The run cost 3.82 RMB. The dominant cost sources were retrieved context entering the LLM context, validation, and refinement. Cached input tokens represented 59.1% of total input tokens, indicating that fixed Skill instructions and historical context benefited from caching."
}

Part VI — Data Model

25. trace_run: Run-Level Ledger

CREATE TABLE trace_run (
  trace_id                         TEXT PRIMARY KEY,
  task_id                          TEXT,
  agent_id                         TEXT,
  skill_id                         TEXT,

  start_time                       TIMESTAMP,
  end_time                         TIMESTAMP,
  total_latency_ms                 INTEGER,

  total_input_tokens               INTEGER,
  total_uncached_input_tokens      INTEGER,
  total_cached_input_tokens        INTEGER,
  total_output_tokens              INTEGER,
  total_reasoning_tokens           INTEGER,
  total_tokens                     INTEGER,

  total_llm_cost                   NUMERIC,
  total_tool_cost                  NUMERIC,
  total_compute_cost               NUMERIC,
  total_cost                       NUMERIC,
  currency                         TEXT,

  main_cost_state                  TEXT,
  created_at                       TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

26. trace_step: State-Level Ledger

CREATE TABLE trace_step (
  trace_id                         TEXT NOT NULL,
  step_id                          INTEGER NOT NULL,
  parent_step_id                   INTEGER,
  state_type                       TEXT NOT NULL,

  agent_id                         TEXT,
  skill_id                         TEXT,
  model_name                       TEXT,

  start_time                       TIMESTAMP,
  end_time                         TIMESTAMP,
  latency_ms                       INTEGER,

  input_tokens_total               INTEGER,
  input_tokens_uncached            INTEGER,
  input_tokens_cached              INTEGER,
  output_tokens                    INTEGER,
  reasoning_tokens                 INTEGER,
  total_tokens                     INTEGER,

  llm_cost                         NUMERIC,
  tool_cost                        NUMERIC,
  compute_cost                     NUMERIC,
  state_cost                       NUMERIC,

  status                           TEXT,
  error_type                       TEXT,

  PRIMARY KEY (trace_id, step_id)
);

27. trace_context_breakdown: Input Provenance

CREATE TABLE trace_context_breakdown (
  trace_id                         TEXT NOT NULL,
  step_id                          INTEGER NOT NULL,

  system_prompt_tokens             INTEGER,
  skill_instruction_tokens         INTEGER,
  user_instruction_tokens          INTEGER,
  history_tokens                   INTEGER,
  memory_tokens                    INTEGER,
  tool_result_tokens               INTEGER,
  retrieved_context_tokens         INTEGER,
  artifact_context_tokens          INTEGER,
  other_context_tokens             INTEGER,

  PRIMARY KEY (trace_id, step_id)
);

28. trace_tool_event: Tool Event Ledger

CREATE TABLE trace_tool_event (
  trace_id                         TEXT NOT NULL,
  step_id                          INTEGER NOT NULL,

  tool_type                        TEXT,
  tool_name                        TEXT,

  request_tokens                   INTEGER,
  response_tokens_raw              INTEGER,
  response_tokens_selected         INTEGER,
  tokens_sent_to_next_llm          INTEGER,

  tool_latency_ms                  INTEGER,
  tool_cost                        NUMERIC,
  status                           TEXT,

  created_at                       TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

29. trace_price_snapshot: Price Versioning

CREATE TABLE trace_price_snapshot (
  trace_id                         TEXT NOT NULL,
  model_name                       TEXT NOT NULL,

  price_input_per_million          NUMERIC,
  price_cached_input_per_million   NUMERIC,
  price_output_per_million         NUMERIC,
  price_reasoning_per_million      NUMERIC,

  currency                         TEXT,
  price_version                    TEXT,
  created_at                       TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Part VII — Derived Metrics

30. Total Token Consumption

Total Tokens =
Σ input_tokens_total
+ Σ output_tokens
+ Σ reasoning_tokens

31. Total Cost

Total Cost =
Σ State Cost

32. Cache Hit Ratio

Cache Hit Ratio =
total_cached_input_tokens / total_input_tokens

33. Cache Savings

Cache Saving =
(total_cached_input_tokens / 1,000,000)
× (P_input - P_cached_input)

34. State Cost Share

State Cost Share =
State Cost / Total Task Cost

Example:

RETRIEVE Cost Share =
RETRIEVE Cost / Total Task Cost

35. Tool Context Ratio

Tool Context Ratio =
tokens_sent_to_next_llm / response_tokens_raw

This tells us how much raw tool output entered the next LLM context.

A high ratio may indicate insufficient compression, filtering, deduplication, or reranking.

36. Input Amplification Ratio

Input Amplification Ratio =
total_input_tokens / user_instruction_tokens

This metric shows how much the agent expands a small user instruction into runtime context.

Example:

User instruction: 300 tokens
Total input tokens: 142,000 tokens

Input Amplification Ratio = 473.3×

This is one of the most important metrics for agent cost engineering.

37. Validation Repair Rate

Validation Repair Rate =
issues_fixed_later / issues_found

This measures whether validation is actually useful or merely generating noise.

38. Refinement Efficiency

Refinement Efficiency =
modified_content_tokens / refine_output_tokens

If refinement output is large but few tokens are actually modified, the system may be over-generating during repair.

39. Retrieval Compression Ratio

Retrieval Compression Ratio =
tokens_sent_to_next_llm / raw_result_tokens

A lower ratio usually means better retrieval filtering, assuming answer quality is preserved.


Part VIII — Implementation Blueprint

40. Event Stream First

A production implementation should instrument the agent runtime as an event stream.

Recommended event categories:

run.started
step.started
model.called
tool.called
context.compiled
artifact.created
artifact.modified
state.changed
validator.called
step.completed
run.completed

Every event should include:

{
  "trace_id": "task_20260428_001",
  "step_id": 7,
  "event_type": "model.called",
  "timestamp": "2026-04-28T10:00:07Z",
  "payload": {}
}

The event stream is then normalized into relational tables or analytical storage.

41. Runtime Instrumentation Points

Instrumentation should happen at the following boundaries:

BoundaryWhat to Capture
Context compilerInput token provenance, cacheability, selected context segments
Model gatewayModel name, token usage, latency, cost, price snapshot
Tool routerTool name, request size, raw response size, selected response size
Retrieval systemQuery count, top-k, returned chunks, raw tokens, selected tokens
File subsystemFile reads, writes, parses, exports, visual/render checks
Database gatewayQuery type, rows returned, selected rows, latency, cost
Validator runnerValidator input, output, score, confidence, issues found
State diff enginePre/post state snapshots, expected mutation, side effects

42. Evaluation Pipeline

flowchart TD
    A[Agent Run] --> B[Runtime Event Stream]
    B --> C[Trace Normalizer]
    C --> D[Token & Cost Ledger]

    A --> E[Final Output]
    A --> F[State Snapshot After Run]

    E --> G[Validator Mesh]
    F --> H[State Diff Engine]

    G --> I[Outcome Result]
    H --> J[State Result]
    D --> K[Trace Result]

    I --> W[Evaluation Warehouse]
    J --> W
    K --> W

    W --> L[Failure Dashboard]
    W --> M[Cost Dashboard]
    W --> N[Optimization Engine]

Part IX — Optimization Surfaces

Trace Evaluation reveals where optimization should happen.

Typical findings:

Cost PatternLikely CauseOptimization Direction
High skill_instruction_tokensSkill package too verboseSkill compression, modular Skill loading
High retrieved_context_tokensRetrieval returns too much contextBetter reranking, deduplication, budgeted context selection
High tool_result_tokensTool outputs are copied into context too aggressivelyTool result summarization, schema extraction
High history_tokensConversation history is not compressedRolling summary, hierarchical memory
High VALIDATE costToo many validation roundsValidator routing, selective validation
High REFINE costLarge rewrites for small fixesPatch-based refinement
Low cache hit ratioContext is unstableStable prompt prefixing, Skill cache segmentation
High input amplification ratioAgent loop expands too aggressivelyState pruning, tool result filtering, context budgets

The point is not simply to reduce tokens.

The point is to reduce useless tokens without reducing task success.


Part X — Future: Self-Optimizing Agents

Once the trace ledger is reliable, it can become input to a self-optimizing agent.

The optimizer can read historical traces and propose improvements such as:

- compress long Skill instructions;
- split large Skills into cacheable modules;
- reduce top-k retrieval dynamically;
- tune context selection budgets;
- summarize tool responses before reinserting them into context;
- remove repeated validation steps;
- convert full rewrites into patch-based edits;
- improve cache hit ratio through stable prompt layouts;
- detect high-cost state patterns and rewrite workflow policy.

This creates a closed loop:

flowchart LR
    A[Agent Run] --> B[Trace Ledger]
    B --> C[Cost Attribution]
    C --> D[Optimization Agent]
    D --> E[Skill / Workflow / Context Policy Patch]
    E --> A

The current phase should not over-optimize prematurely.

The first milestone is:

Capture the trace completely.
Compute the cost correctly.
Attribute the cost to states and context sources.
Make every failure and every yuan explainable.

41. Engineering Principles

A production-grade agent evaluation system should follow these principles:

Principle 1 — No Self-Certified Success

The agent cannot declare itself successful.

Success must be verified by external validators, evidence checks, state checks, and execution records.

Principle 2 — Outcome and Cost Are Separate

A task can be correct and expensive.

A task can be cheap and wrong.

Both dimensions must be measured independently.

Principle 3 — External State Is First-Class

Files, databases, tickets, appointments, orders, emails, and workflow submissions are not side details.

They are part of the task result.

Principle 4 — Token Flow Must Be Attributed

Total token count is not enough.

Tokens must be attributed to source:

system prompt
Skill instruction
user instruction
history
memory
tool result
retrieved context
artifact context

Principle 5 — Failure Must Be Queryable

Failure messages should be structured into failure codes, validator outputs, and diagnostic metadata.

If failure cannot be aggregated, it cannot be improved.

Principle 6 — Price Must Be Snapshotted

Token prices change.

Every trace must store the price version used at execution time.

Principle 7 — Evaluation Data Should Feed Optimization

Evaluation is not only for reporting.

It should become the data foundation for workflow compression, Skill optimization, retrieval tuning, and future self-evolving agents.


Conclusion

Agent evaluation should be built like observability infrastructure, not like a subjective scoring prompt.

The proposed framework separates the problem into two layers:

Outcome Evaluation:
  Did the agent complete the task correctly?

Trace Evaluation:
  What did it cost to complete the task?

Outcome Evaluation uses contracts, hard gates, soft scores, validators, failure codes, and state-diff verification.

Trace Evaluation turns the agent runtime into an auditable execution ledger with per-state token accounting, context provenance, tool cost, model cost, latency, and optimization metrics.

The final product is not just a score.

It is an engineering substrate for building agents that are:

correct,
auditable,
cost-aware,
optimizable,
and eventually self-improving.

In production agent systems, evaluation is not the last step.

It is the control plane.