2026.4

Building a Production-Grade Evaluation Stack for Agentic Systems

Yizhi Intelligent Technical Team

Building a Production-Grade Evaluation Stack for Agentic Systems

Evaluation for agents should not be a vibe check.
It should be a typed, replayable, auditable execution system.

Most LLM evaluation frameworks were originally designed for single-turn answers: ask a question, get a response, compare the response against an expected answer, compute a score.

That is not enough for agentic systems.

A real agent does not simply "answer". It observes context, retrieves evidence, calls tools, queries databases, edits files, writes memory, validates outputs, refines drafts, and sometimes mutates the external world. It may submit a form, create a ticket, send an email, generate a report, update a document, or place an order.

Therefore, an agent evaluation system must answer two different classes of questions:

1. Outcome correctness:
   Did the agent actually complete the task?
   Did the external world change correctly?
   Is the final output semantically acceptable?

2. Execution economics:
   How many tokens did the agent burn?
   Which runtime states created the cost?
   How much did tools, APIs, retrieval, validation, and refinement cost?
   Where is the optimization surface?

This report proposes a two-layer evaluation architecture:

Layer 1: Outcome Evaluation
         Determines whether the task was truly completed.

Layer 2: Trace Evaluation
         Builds a token-level, state-level, cost-level execution ledger.

The key idea is simple:

Outcome Evaluation tells us whether the agent works.
Trace Evaluation tells us how expensive it was to make it work.

1. System Overview

We model an agent run as a combination of final-state verification and runtime trace accounting.

flowchart TD
    U[User Task] --> A[Target Agent Runtime]

    A --> O[Final Answer]
    A --> S[External State / Artifacts]
    A --> T[Execution Trace]

    O --> OE[Outcome Evaluation Layer]
    S --> SE[State Change Evaluation]
    T --> TE[Trace Evaluation Layer]

    OE --> R1[Hard Success / Soft Score]
    SE --> R2[State Match / Side Effect Detection]
    TE --> R3[Token Ledger / Cost Ledger / Latency Ledger]

    R1 --> E[Evaluation Warehouse]
    R2 --> E
    R3 --> E

    E --> D[Dashboards]
    E --> M[Failure Attribution]
    E --> C[Cost Optimization]
    E --> X[Future Self-Optimizing Agents]

The evaluation stack has three major outputs:

Output	Purpose
`eval_task_result`	Determines whether the task passes the outcome gate.
`eval_state_result`	Verifies whether files, artifacts, or external systems reached the expected state.
`trace_run` / `trace_step`	Records the full execution ledger: tokens, cost, latency, tools, context flow.

This design is intentionally strict:

The agent cannot self-certify success.
The final answer is not the only thing evaluated.
Tool results are not ignored.
External mutations must be verified.
Token cost must be attributed to specific runtime states.
Failure reasons must be structured, not described as vague "bad output".

Part I — Outcome Evaluation Layer

2. What the Outcome Layer Measures

The Outcome Evaluation Layer answers three questions:

1. Was the task completed?
2. Did the external environment or target artifact change correctly?
3. Does the final output meet the required semantic quality?

This layer does not care how many steps the agent took, how many tools it called, or how many tokens it consumed. It only cares about the final result.

In other words:

Outcome Evaluation = final-state correctness
Trace Evaluation   = runtime economics

3. Eval Case as a Contract

Every evaluation case should be treated as a contract.

The contract defines:

what the user asked for;
what output modules are required;
what fields must be present;
what errors must never appear;
whether external execution is required;
what proof is required for external execution.

A typical evaluation case can be expressed as YAML:

task_id: search_agent_001
task_name: Search for Phase III clinical evidence for a drug-indication pair

input:
  user_instruction: >
    Search and summarize whether drug X has Phase III clinical evidence
    for indication Y.

success_criteria:
  required_outputs:
    - final_answer
    - evidence_list
    - citations

  must_include:
    - drug_name
    - indication
    - trial_phase
    - trial_status
    - primary_endpoint
    - conclusion

  must_not_include:
    - fabricated_reference
    - unsupported_claim

  # Only required for executable tasks:
  # ticket creation, booking, ordering, scheduling, form submission,
  # email sending, workflow submission, payment, deletion, etc.
  execution_result:
    required: true

    must_include:
      - execution_status
      - confirmation_id
      - execution_target
      - execution_parameters

    must_not_include:
      - action_not_executed
      - execution_result_not_found
      - wrong_execution_target
      - wrong_execution_parameters
      - duplicate_execution
      - unauthorized_payment
      - unconfirmed_high_risk_action

This structure is critical.

Do not define success like this:

The answer should be good.

Define it like this:

The output must contain specific fields.
The evidence must be checkable.
The citations must exist.
The external action must have a confirmation ID.
The execution target and parameters must match the user intent.
High-risk actions must be explicitly confirmed.

The target agent does not have to output JSON. The YAML or JSON schema is used to define the evaluation contract, not to force a rigid output format on the agent.

4. Hard Success vs. Soft Score

A single binary score is too coarse for agent evaluation.

We split task success into two levels:

Hard Success:
  Does the task pass the minimum correctness gate?

Soft Score:
  If the hard gate passes, how good is the result?

4.1 Hard Success

Hard Success is the minimum pass/fail gate.

A task fails immediately if any critical condition is violated.

Hard Gate	Failure Condition
Final answer exists	No final answer was generated.
Required outputs exist	Required modules such as `evidence_list` or `citations` are missing.
Required fields exist	Core fields such as drug, indication, trial status, endpoint, or conclusion are missing.
Evidence exists	No evidence is provided for evidence-critical domains.
Citations are real	The agent cites nonexistent papers, URLs, trial IDs, or database records.
Claims are supported	The answer contains claims not supported by evidence.
No unauthorized action	The agent performs an action outside its permission boundary.
External execution is real	The claimed booking, order, ticket, submission, or email does not exist in the external system.
Execution target is correct	The agent acts on the wrong person, account, product, patient, file, or recipient.
Execution parameters are correct	Time, amount, quantity, address, fields, or target parameters are wrong.
High-risk action is confirmed	Payment, deletion, submission, or sending occurs without required confirmation.
No duplicate execution	The agent places duplicate orders, creates duplicate tickets, or sends duplicate emails.

The output is binary:

Hard Success = true / false

If the hard gate fails, the task is failed regardless of how polished the final text looks.

4.2 Soft Score

Soft Score measures quality after the hard gate passes.

A default scoring formula can be:

Task Success Score =
0.30 × Completeness
+ 0.20 × Evidence Validity
+ 0.20 × Evidence Consistency
+ 0.20 × Methodology Compliance
+ 0.10 × Readability

Each sub-score is normalized to 0–10.

At the benchmark level:

Task Success Rate =
Number of tasks with Hard Success = true / Total number of tasks

Average Outcome Score =
Average Soft Score over tasks that passed the Hard Gate

This separation makes the evaluation interpretable:

Hard Success tells us whether the agent can complete the task.
Soft Score tells us how well the agent completed it.

5. Validator Mesh

After the target agent finishes execution, the final result is passed into a set of validators.

Each validator has a narrow responsibility.

flowchart LR
    R[Agent Result] --> F[Required Field Validator]
    R --> E[Evidence Validator]
    R --> C[Claim-Fact Validator]
    R --> P[Rule / SOP Validator]
    R --> L[LLM Judge]
    R --> X[Execution Validator]

    F --> G[Hard Gate + Soft Score]
    E --> G
    C --> G
    P --> G
    L --> G
    X --> G

Validator	Responsibility	Metric
Required Field Validator	Compares output against the Eval Case contract and checks whether required fields are present.	Completeness
Evidence Validator	Verifies that citations exist and can be mapped to accessible source material.	Evidence Validity
Claim-Fact Validator	Checks whether the key claims are supported by the cited evidence.	Evidence Consistency
Rule / SOP Validator	Checks whether the agent followed system prompts, Skill instructions, workflow rules, and domain SOPs.	Methodology Compliance
LLM Judge	Evaluates open-ended semantic quality, domain style, logical clarity, and whether the answer matches user intent.	Readability
Execution Validator	Verifies whether external actions actually happened and whether execution records match the user instruction.	Execution Success

The Evidence Validator should have access to the same evidence sources as the target agent. Otherwise, it cannot reliably verify whether a citation is real.

The Rule / SOP Validator should have access to the target agent's system prompt, Skill package, tool specification, and workflow policy, because it needs to check whether the agent followed its own operating contract.

6. Structured Failure Taxonomy

Failure reasons must be structured.

A production evaluation system should not record failure as:

bad answer

It should record:

primary_failure_reason_code = FABRICATED_REFERENCE
failure_reason_codes = [
  "FABRICATED_REFERENCE",
  "CLAIM_EVIDENCE_MISMATCH",
  "LOW_EVIDENCE_CONSISTENCY_SCORE"
]

A recommended failure code taxonomy:

MISSING_FINAL_ANSWER
MISSING_REQUIRED_OUTPUT
MISSING_REQUIRED_FIELD
OUTPUT_FORMAT_INVALID
EMPTY_OR_INVALID_OUTPUT

MISSING_EVIDENCE
MISSING_CITATION
CITATION_NOT_FOUND
FABRICATED_REFERENCE
EVIDENCE_SOURCE_INACCESSIBLE

UNSUPPORTED_CLAIM
CLAIM_EVIDENCE_MISMATCH
CONTRADICTED_BY_EVIDENCE
WRONG_FACT

INCOMPLETE_ANSWER
LOW_COMPLETENESS_SCORE
LOW_EVIDENCE_VALIDITY_SCORE
LOW_EVIDENCE_CONSISTENCY_SCORE
LOW_METHODOLOGY_SCORE
LOW_READABILITY_SCORE

SOP_NOT_FOLLOWED
SYSTEM_PROMPT_VIOLATION
SKILL_INSTRUCTION_VIOLATION

UNAUTHORIZED_ACTION
STATE_MISMATCH
STATE_CHANGE_FAILED
PARTIAL_STATE_CHANGE

TOOL_FAILURE
TOOL_TIMEOUT
EXECUTION_TIMEOUT
EVALUATOR_FAILURE
LOW_CONFIDENCE_EVALUATION
UNKNOWN_FAILURE

ACTION_NOT_EXECUTED
EXECUTION_RESULT_NOT_FOUND
WRONG_EXECUTION_TARGET
WRONG_EXECUTION_PARAMETERS
DUPLICATE_EXECUTION
UNCONFIRMED_HIGH_RISK_ACTION
UNAUTHORIZED_PAYMENT

6.1 Failure Code Semantics

Code	Meaning
`MISSING_FINAL_ANSWER`	The agent did not produce a final conclusion.
`MISSING_REQUIRED_OUTPUT`	Required output modules are missing, such as `evidence_list` or `citations`.
`MISSING_REQUIRED_FIELD`	A required field is missing, such as drug name, indication, trial status, or primary endpoint.
`OUTPUT_FORMAT_INVALID`	The output format violates the required structure.
`EMPTY_OR_INVALID_OUTPUT`	The output is empty, malformed, or clearly unrelated to the task.
`MISSING_EVIDENCE`	No evidence is provided in an evidence-critical task.
`MISSING_CITATION`	Evidence is discussed but not cited.
`CITATION_NOT_FOUND`	The cited item cannot be found in the configured source set.
`FABRICATED_REFERENCE`	The agent cites a nonexistent paper, URL, trial ID, database record, or document.
`EVIDENCE_SOURCE_INACCESSIBLE`	The evaluator cannot access the evidence source required for verification.
`UNSUPPORTED_CLAIM`	A claim is made without evidence support.
`CLAIM_EVIDENCE_MISMATCH`	The cited evidence does not support the claim.
`CONTRADICTED_BY_EVIDENCE`	The output contradicts the cited evidence.
`WRONG_FACT`	A key factual element is wrong.
`INCOMPLETE_ANSWER`	The answer does not cover the required parts of the task.
`LOW_COMPLETENESS_SCORE`	Completeness falls below the configured threshold.
`LOW_EVIDENCE_VALIDITY_SCORE`	Citation quality or evidence verifiability is too low.
`LOW_EVIDENCE_CONSISTENCY_SCORE`	Claim-evidence alignment is too weak.
`LOW_METHODOLOGY_SCORE`	The agent did not follow the expected method, Skill, or SOP.
`LOW_READABILITY_SCORE`	The output is hard to read, poorly structured, or domain-inappropriate.
`SOP_NOT_FOLLOWED`	The agent violated a required operating procedure.
`SYSTEM_PROMPT_VIOLATION`	The agent violated core system-level constraints.
`SKILL_INSTRUCTION_VIOLATION`	The agent did not follow the Skill package instructions.
`UNAUTHORIZED_ACTION`	The agent used forbidden tools, accessed unauthorized data, or performed a forbidden operation.
`STATE_MISMATCH`	The agent claimed a state change that did not occur.
`STATE_CHANGE_FAILED`	A required create/update/delete/submit operation failed.
`PARTIAL_STATE_CHANGE`	Only part of the required state mutation was completed.
`TOOL_FAILURE`	A retrieval tool, API, MCP tool, database, or script failed.
`TOOL_TIMEOUT`	A tool call timed out.
`EXECUTION_TIMEOUT`	The overall agent task timed out.
`EVALUATOR_FAILURE`	A validator or LLM judge failed.
`LOW_CONFIDENCE_EVALUATION`	The evaluator produced an unstable or low-confidence judgment.
`UNKNOWN_FAILURE`	The failure cannot be classified.
`ACTION_NOT_EXECUTED`	The agent claimed completion, but no external execution record exists.
`EXECUTION_RESULT_NOT_FOUND`	The order, appointment, ticket, submission, or email record cannot be found.
`WRONG_EXECUTION_TARGET`	The agent acted on the wrong person, item, account, patient, file, or recipient.
`WRONG_EXECUTION_PARAMETERS`	Time, amount, quantity, address, field values, or other execution parameters are wrong.
`DUPLICATE_EXECUTION`	The agent executed the same action multiple times.
`UNCONFIRMED_HIGH_RISK_ACTION`	A high-risk action was executed without required user confirmation.
`UNAUTHORIZED_PAYMENT`	A payment or charge was made without explicit authorization.

With this taxonomy, failure analysis becomes queryable:

Did the task work?             hard_success
How good was the task?         outcome_score
Why did it fail?               primary_failure_reason_code
Where did it fail?             eval_validator_result

7. Outcome Result Table

A minimal task-level outcome table:

CREATE TABLE eval_task_result (
  eval_id                         TEXT PRIMARY KEY,
  task_id                         TEXT NOT NULL,
  trace_id                        TEXT,
  agent_id                        TEXT,
  skill_id                        TEXT,

  hard_success                    BOOLEAN NOT NULL,
  outcome_score                   NUMERIC,
  completeness_score              NUMERIC,
  evidence_validity_score         NUMERIC,
  evidence_consistency_score      NUMERIC,
  methodology_score               NUMERIC,
  readability_score               NUMERIC,
  execution_success_score         NUMERIC,

  primary_failure_reason_code     TEXT,
  failure_reason_codes            JSONB,

  evaluator_version               TEXT,
  eval_contract_version           TEXT,
  created_at                      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

A validator-level result table:

CREATE TABLE eval_validator_result (
  eval_id                         TEXT NOT NULL,
  validator_name                  TEXT NOT NULL,
  validator_type                  TEXT NOT NULL,

  passed                          BOOLEAN NOT NULL,
  score                           NUMERIC,
  confidence                      NUMERIC,

  failure_reason_codes            JSONB,
  evidence_refs                   JSONB,
  diagnostic_message              TEXT,

  latency_ms                      INTEGER,
  evaluator_model                 TEXT,
  created_at                      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Part II — External State Change Evaluation

8. Why State Change Evaluation Matters

An agent is not just a text generator.

It often creates or mutates external state:

- generate a file;
- modify a document;
- update a database;
- create a ticket;
- submit a form;
- send an email;
- book an appointment;
- place an order;
- trigger an MCP workflow;
- write memory.

Therefore, outcome evaluation must include state verification.

A final message like:

Done.

does not prove that anything was done.

The evaluator must inspect the environment.

9. State Diff Model

We represent external state evaluation as a diff between pre-run and post-run snapshots.

flowchart LR
    B[Before Snapshot] --> D[State Diff Engine]
    A[After Snapshot] --> D
    I[User Intent / Eval Contract] --> D

    D --> H[Hard State Gate]
    D --> Q[State Quality Score]
    D --> S[Side Effect Detector]

State snapshots can include:

State Type	Example
File system	Created, modified, deleted files
Document artifacts	DOCX, PDF, PPTX, Markdown, CSV, JSON
Database state	Inserted, updated, deleted rows
Business systems	Orders, appointments, tickets, submissions
Communication systems	Sent emails, messages, notifications
Workflow systems	MCP actions, pipeline events, approvals
Memory stores	Long-term memory or task memory writes

10. Hard State Gate

Hard State Gate checks whether the expected state exists and whether unexpected side effects occurred.

Minimum checks:

1. The artifact or external record exists.
2. The artifact or external record is readable.
3. The artifact or external record is non-empty.
4. The target state matches the user intent.
5. No unauthorized or unexpected side effects occurred.

Examples of side effects:

The agent generated the requested report but deleted the input file.
The agent created the expected database record but inserted duplicate rows.
The agent sent the correct email but also sent it to an unintended recipient.
The agent updated the right document but overwrote unrelated sections.

If the user did not request the mutation, it is a side effect.

If the mutation was required by the task, it must be verified against the contract.

11. State Quality Score

After passing the hard state gate, the generated or modified state can be scored.

State Change Score =
0.40 × Target State Match
+ 0.30 × Content Usability
+ 0.20 × Format Quality
+ 0.10 × Boundary Control

Metric	Meaning	Evaluation Focus
`Target State Match`	Whether the final artifact or external record reached the expected target state.	Correctness of state mutation
`Content Usability`	Whether the content is useful, readable, logically organized, and aligned with user intent.	Practical usability
`Format Quality`	Whether formatting, tables, structure, Markdown, JSON, DOCX, PPTX, or report layout meet expectations.	Presentation quality
`Boundary Control`	Whether the agent avoided irrelevant expansion, unsupported additions, unrelated edits, or over-promising.	Scope control

12. State Result Table

CREATE TABLE eval_state_result (
  eval_id                         TEXT NOT NULL,
  trace_id                        TEXT NOT NULL,
  state_object_id                 TEXT NOT NULL,

  state_object_type               TEXT,
  state_object_path               TEXT,
  state_action                    TEXT, -- create / modify / delete / submit / send / write

  exists_after_run                BOOLEAN,
  readable_after_run              BOOLEAN,
  non_empty_after_run             BOOLEAN,
  expected_state_match            BOOLEAN,
  side_effect_detected            BOOLEAN,

  target_state_match_score        NUMERIC,
  content_usability_score         NUMERIC,
  format_quality_score            NUMERIC,
  boundary_control_score          NUMERIC,
  state_change_score              NUMERIC,

  failure_reason_codes            JSONB,
  diff_summary                    JSONB,

  created_at                      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Part III — Trace Evaluation Layer

13. Goal of Trace Evaluation

The Trace Evaluation Layer does not decide whether the task passes or fails.

It builds an execution ledger.

It answers:

1. How many tokens did the agent consume in total?
2. Which runtime states produced those tokens?
3. How many were uncached input tokens, cached input tokens, output tokens, and reasoning tokens?
4. How much did each token type cost?
5. How much did tools, retrieval, APIs, databases, scripts, file parsing, and file writing cost?
6. Which states dominated the total cost?
7. Which context sources caused input-token amplification?
8. Which tool results entered the next LLM context?
9. Which steps are the best targets for optimization?

The output is not a score.

It is a bill of execution.

14. Agent Runtime as a State Machine

A real agent loop is not linear.

It is not:

retrieve -> read -> answer

It is closer to:

OBSERVE
-> THINK / DECIDE
-> ACTION
-> OBSERVE
-> THINK / DECIDE
-> ACTION
-> ...
-> FINALIZE

We normalize the runtime into a sequence of typed states.

stateDiagram-v2
    [*] --> OBSERVE
    OBSERVE --> THINK
    THINK --> RETRIEVE
    THINK --> MCP_CALL
    THINK --> API_CALL
    THINK --> DB_QUERY
    THINK --> SCRIPT_EXEC
    THINK --> FILE_READ
    THINK --> FILE_WRITE
    THINK --> MEMORY_READ
    THINK --> VALIDATE
    THINK --> FINALIZE

    RETRIEVE --> OBSERVE
    MCP_CALL --> OBSERVE
    API_CALL --> OBSERVE
    DB_QUERY --> OBSERVE
    SCRIPT_EXEC --> OBSERVE
    FILE_READ --> OBSERVE
    FILE_WRITE --> OBSERVE
    MEMORY_READ --> OBSERVE
    VALIDATE --> REFINE
    REFINE --> OBSERVE
    FINALIZE --> [*]

Each state becomes a trace_step.

A full run becomes a trace_run.

Example:

trace_run
  ├── step_1: OBSERVE
  ├── step_2: THINK
  ├── step_3: RETRIEVE
  ├── step_4: OBSERVE
  ├── step_5: THINK
  ├── step_6: DB_QUERY
  ├── step_7: OBSERVE
  ├── step_8: THINK
  ├── step_9: VALIDATE
  ├── step_10: REFINE
  ├── step_11: FINALIZE

15. Runtime State Types

Recommended state taxonomy:

OBSERVE       Read context, environment state, tool results, or intermediate artifacts.
THINK         Plan, reason, decide next action, or generate tool call arguments.
RETRIEVE      Search, retrieval, knowledge-base query, or web search.
MCP_CALL      MCP server tool call.
API_CALL      External API call.
DB_QUERY      SQL, graph query, vector query, or hybrid database access.
SCRIPT_EXEC   Code execution, data processing, sandbox execution.
FILE_READ     File parsing or reading.
FILE_WRITE    File creation, update, append, or export.
MEMORY_READ   Read long-term memory, task memory, or historical preference.
MEMORY_WRITE  Write memory or compressed experience.
VALIDATE      Check, review, fact-verify, format-verify, or compliance-review.
REFINE        Revise, rewrite, repair, or optimize based on validation.
FINALIZE      Produce final answer or submit final artifact.

16. Common Trace Step Schema

Every state should record a shared envelope:

{
  "trace_id": "task_20260428_001",
  "step_id": 3,
  "parent_step_id": 2,
  "state_type": "RETRIEVE",
  "agent_id": "search_agent",
  "skill_id": "clinical_evidence_search",
  "model_name": "model_x",
  "start_time": "2026-04-28T10:00:00Z",
  "end_time": "2026-04-28T10:00:04Z",
  "latency_ms": 4200,
  "status": "success"
}

Field	Meaning
`trace_id`	Unique ID for one agent run.
`step_id`	Step ID inside the run.
`parent_step_id`	Parent step or previous step.
`state_type`	Runtime state type.
`agent_id`	Agent that executed this step.
`skill_id`	Skill or workflow package used by the agent.
`model_name`	Model used in this step, if applicable.
`start_time`	Step start timestamp.
`end_time`	Step end timestamp.
`latency_ms`	Step latency.
`status`	`success`, `error`, `timeout`, `cancelled`, etc.

17. Token Accounting Schema

If a state invokes a model, record the token ledger:

{
  "input_tokens_total": 12800,
  "input_tokens_uncached": 4200,
  "input_tokens_cached": 8600,
  "output_tokens": 320,
  "reasoning_tokens": 0,
  "total_tokens": 13120
}

Field	Meaning
`input_tokens_total`	Total input tokens for this model call.
`input_tokens_uncached`	Input tokens not served from cache.
`input_tokens_cached`	Input tokens served from cache.
`output_tokens`	Visible model output tokens.
`reasoning_tokens`	Reasoning tokens if exposed and billable.
`total_tokens`	Total token usage for this step.

A key accounting rule:

Tool responses are not model output tokens.
But once tool responses enter the next model context, they become input tokens.

Therefore, trace evaluation must track not only model tokens, but also the flow of tool outputs into later model contexts.

18. Context Provenance Breakdown

To understand why input tokens explode, split input tokens by source:

{
  "input_token_breakdown": {
    "system_prompt_tokens": 1200,
    "skill_instruction_tokens": 2600,
    "user_instruction_tokens": 300,
    "history_tokens": 1800,
    "memory_tokens": 600,
    "tool_result_tokens": 4200,
    "retrieved_context_tokens": 1800,
    "artifact_context_tokens": 300,
    "other_context_tokens": 0
  }
}

Field	Meaning
`system_prompt_tokens`	Tokens from system prompt.
`skill_instruction_tokens`	Tokens from Skill files, SOPs, and tool specs.
`user_instruction_tokens`	Tokens from the user's original task.
`history_tokens`	Tokens from previous conversation or previous trace context.
`memory_tokens`	Tokens loaded from memory.
`tool_result_tokens`	Tokens from previous tool outputs entering the context.
`retrieved_context_tokens`	Tokens from retrieved documents or search results.
`artifact_context_tokens`	Tokens from files, tables, intermediate artifacts, or generated outputs.
`other_context_tokens`	Other context sources.

This makes cost attribution concrete:

The model is not expensive.
The context is expensive.

The user query is not large.
The retrieved evidence is large.

The final answer is not costly.
The repeated validation-refinement loop is costly.

The task is not inherently expensive.
The Skill instruction and history context are too heavy.

19. State-Specific Schemas

19.1 OBSERVE

OBSERVE reads current context, environment state, tool results, or intermediate artifacts.

{
  "state_type": "OBSERVE",
  "observed_sources": [
    "user_input",
    "previous_tool_result",
    "memory",
    "file_state"
  ],
  "observed_tokens": 5200,
  "tokens_sent_to_next_think": 4800
}

Field	Meaning
`observed_sources`	Sources observed in this step.
`observed_tokens`	Raw tokens observed.
`tokens_sent_to_next_think`	Tokens preserved for the next thinking step.

19.2 THINK

THINK plans, decides, or generates tool-call arguments.

{
  "state_type": "THINK",
  "decision_type": "call_tool",
  "next_action": "RETRIEVE",
  "input_tokens_total": 6800,
  "input_tokens_uncached": 2200,
  "input_tokens_cached": 4600,
  "output_tokens": 420,
  "tool_call_instruction_tokens": 90
}

Field	Meaning
`decision_type`	`answer`, `call_tool`, `validate`, `refine`, etc.
`next_action`	Next planned runtime state.
`tool_call_instruction_tokens`	Tokens used to generate tool-call instructions.
`output_tokens`	Visible planning, decision, or tool argument tokens.

If hidden reasoning is not exposed by the model or platform, record only visible planning, decision text, and tool-call arguments.

19.3 RETRIEVE

RETRIEVE covers search, knowledge-base retrieval, vector search, hybrid retrieval, or web search.

{
  "state_type": "RETRIEVE",
  "retriever_name": "knowledge_base_search",
  "query_count": 3,
  "query_tokens": 180,
  "top_k": 20,
  "returned_chunks": 60,
  "raw_result_tokens": 52000,
  "deduplicated_tokens": 36000,
  "selected_tokens": 8000,
  "tokens_sent_to_next_llm": 5000,
  "tool_latency_ms": 2800,
  "tool_cost": 0.02
}

Field	Meaning
`retriever_name`	Name of the retriever.
`query_count`	Number of generated queries.
`query_tokens`	Tokens in retrieval queries.
`top_k`	Number of results returned per query.
`returned_chunks`	Raw returned chunks.
`raw_result_tokens`	Raw retrieved token volume.
`deduplicated_tokens`	Token count after deduplication.
`selected_tokens`	Token count after reranking or filtering.
`tokens_sent_to_next_llm`	Tokens that enter the next model context.
`tool_latency_ms`	Retrieval latency.
`tool_cost`	Retrieval cost.

This is one of the most important states for search-heavy agents, because retrieval result inflation often dominates total token cost.

19.4 MCP_CALL

MCP_CALL records calls to external MCP servers.

{
  "state_type": "MCP_CALL",
  "mcp_server": "clinical_trial_mcp",
  "tool_name": "search_trial_registry",
  "request_tokens": 260,
  "response_tokens_raw": 12000,
  "response_tokens_selected": 3000,
  "tokens_sent_to_next_llm": 2400,
  "tool_latency_ms": 3500,
  "tool_cost": 0.05
}

19.5 API_CALL

API_CALL records external API invocations.

{
  "state_type": "API_CALL",
  "api_name": "drug_database_api",
  "endpoint": "/v1/drug/trials",
  "request_tokens": 180,
  "response_tokens_raw": 9000,
  "response_tokens_selected": 2500,
  "tokens_sent_to_next_llm": 1800,
  "api_latency_ms": 1600,
  "api_cost": 0.03,
  "http_status": 200
}

19.6 DB_QUERY

DB_QUERY records SQL, graph, vector, or hybrid queries.

{
  "state_type": "DB_QUERY",
  "database_name": "clinical_kb",
  "query_type": "sql",
  "query_tokens": 220,
  "rows_returned": 128,
  "raw_result_tokens": 16000,
  "selected_rows": 12,
  "selected_tokens": 2600,
  "tokens_sent_to_next_llm": 2000,
  "db_latency_ms": 900,
  "db_cost": 0.01
}

19.7 SCRIPT_EXEC

SCRIPT_EXEC records code execution, sandbox runtime, data transformation, and computational workloads.

{
  "state_type": "SCRIPT_EXEC",
  "runtime": "python",
  "script_input_tokens": 900,
  "code_tokens": 1200,
  "stdout_tokens": 1800,
  "stderr_tokens": 0,
  "generated_file_count": 2,
  "tokens_sent_to_next_llm": 1200,
  "execution_time_ms": 5200,
  "cpu_seconds": 4.8,
  "gpu_seconds": 0,
  "compute_cost": 0.04
}

19.8 FILE_READ

FILE_READ records file parsing and content extraction.

{
  "state_type": "FILE_READ",
  "file_path": "/input/protocol.pdf",
  "file_type": "pdf",
  "file_size_bytes": 1839200,
  "raw_extracted_tokens": 48000,
  "selected_tokens": 6000,
  "tokens_sent_to_next_llm": 5000,
  "parse_latency_ms": 2300,
  "parse_cost": 0.02
}

19.9 FILE_WRITE

FILE_WRITE records file generation or modification.

{
  "state_type": "FILE_WRITE",
  "file_path": "/output/report.docx",
  "file_type": "docx",
  "write_mode": "create",
  "content_tokens_written": 8600,
  "formatting_tokens": 1200,
  "generated_file_size_bytes": 236000,
  "write_latency_ms": 1800,
  "write_cost": 0.01
}

19.10 MEMORY_READ

MEMORY_READ records retrieval from long-term memory, task memory, or preference memory.

{
  "state_type": "MEMORY_READ",
  "memory_type": "long_term_memory",
  "query_tokens": 80,
  "raw_memory_tokens": 12000,
  "selected_memory_tokens": 1800,
  "tokens_sent_to_next_llm": 1500,
  "memory_latency_ms": 500
}

19.11 MEMORY_WRITE

MEMORY_WRITE records memory persistence or experience compression.

{
  "state_type": "MEMORY_WRITE",
  "memory_type": "task_memory",
  "raw_content_tokens": 4200,
  "summary_tokens": 600,
  "tokens_written": 600,
  "write_latency_ms": 400
}

19.12 VALIDATE

VALIDATE records review, checking, fact verification, formatting verification, or compliance validation.

{
  "state_type": "VALIDATE",
  "validator_name": "evidence_validator",
  "validation_input_tokens": 9000,
  "input_tokens_uncached": 3000,
  "input_tokens_cached": 6000,
  "validation_output_tokens": 1200,
  "issues_found": 3,
  "issues_fixed_later": 2,
  "validation_latency_ms": 3600,
  "validation_cost": 0.08
}

19.13 REFINE

REFINE records revision, rewriting, repair, or optimization based on validation feedback.

{
  "state_type": "REFINE",
  "refine_reason": "fix_missing_citation",
  "input_tokens_total": 7600,
  "input_tokens_uncached": 2600,
  "input_tokens_cached": 5000,
  "output_tokens": 1800,
  "modified_content_tokens": 1200,
  "refine_latency_ms": 4200,
  "refine_cost": 0.10
}

19.14 FINALIZE

FINALIZE records final answer generation or final artifact submission.

{
  "state_type": "FINALIZE",
  "final_output_type": "answer_with_citations",
  "input_tokens_total": 11000,
  "input_tokens_uncached": 4000,
  "input_tokens_cached": 7000,
  "output_tokens": 3600,
  "final_answer_tokens": 3200,
  "citation_tokens": 400,
  "finalize_latency_ms": 5000,
  "finalize_cost": 0.18
}

Part IV — Cost Model

20. LLM Cost Formula

Each model call should be billed by token type:

1. Uncached input tokens
2. Cached input tokens
3. Output tokens
4. Reasoning tokens, if separately exposed and billed

The cost of a model call:

LLM Cost =
(input_tokens_uncached / 1,000,000) × P_input
+ (input_tokens_cached / 1,000,000) × P_cached_input
+ (output_tokens / 1,000,000) × P_output
+ (reasoning_tokens / 1,000,000) × P_reasoning

If reasoning tokens are not exposed or not separately billed, set:

reasoning_tokens = 0
P_reasoning = 0

21. State Cost Formula

A state may include model cost, tool cost, API cost, database cost, compute cost, parsing cost, and writing cost.

State Cost =
LLM Cost
+ Tool Cost
+ API Cost
+ DB Cost
+ Compute Cost
+ Parse Cost
+ Write Cost

22. Task Cost Formula

Total Task Cost =
Σ State Cost

23. Price Snapshot

Price data must be snapshotted at execution time.

{
  "model_name": "model_x",
  "price_input_per_million": 10,
  "price_cached_input_per_million": 2.5,
  "price_output_per_million": 30,
  "price_reasoning_per_million": 30,
  "currency": "RMB",
  "price_version": "2026-04-28"
}

Without a price snapshot, old traces become economically unreplayable after model providers change prices.

Part V — Trace Output

24. Cost Profile Example

Trace Evaluation produces a cost profile, not a pass/fail verdict.

{
  "trace_id": "task_20260428_001",
  "total_tokens": 186000,
  "total_input_tokens": 142000,
  "total_uncached_input_tokens": 58000,
  "total_cached_input_tokens": 84000,
  "total_output_tokens": 44000,
  "total_reasoning_tokens": 0,
  "total_cost": 3.82,
  "currency": "RMB",
  "total_latency_ms": 126000,

  "cost_by_state": {
    "THINK": 0.42,
    "RETRIEVE": 1.28,
    "DB_QUERY": 0.36,
    "VALIDATE": 0.74,
    "REFINE": 0.61,
    "FINALIZE": 0.41
  },

  "token_by_state": {
    "THINK": 22000,
    "RETRIEVE": 64000,
    "DB_QUERY": 18000,
    "VALIDATE": 38000,
    "REFINE": 26000,
    "FINALIZE": 18000
  },

  "main_cost_sources": [
    "RETRIEVE",
    "VALIDATE",
    "REFINE"
  ],

  "conclusion": "The run cost 3.82 RMB. The dominant cost sources were retrieved context entering the LLM context, validation, and refinement. Cached input tokens represented 59.1% of total input tokens, indicating that fixed Skill instructions and historical context benefited from caching."
}

Part VI — Data Model

25. `trace_run`: Run-Level Ledger

CREATE TABLE trace_run (
  trace_id                         TEXT PRIMARY KEY,
  task_id                          TEXT,
  agent_id                         TEXT,
  skill_id                         TEXT,

  start_time                       TIMESTAMP,
  end_time                         TIMESTAMP,
  total_latency_ms                 INTEGER,

  total_input_tokens               INTEGER,
  total_uncached_input_tokens      INTEGER,
  total_cached_input_tokens        INTEGER,
  total_output_tokens              INTEGER,
  total_reasoning_tokens           INTEGER,
  total_tokens                     INTEGER,

  total_llm_cost                   NUMERIC,
  total_tool_cost                  NUMERIC,
  total_compute_cost               NUMERIC,
  total_cost                       NUMERIC,
  currency                         TEXT,

  main_cost_state                  TEXT,
  created_at                       TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

26. `trace_step`: State-Level Ledger

CREATE TABLE trace_step (
  trace_id                         TEXT NOT NULL,
  step_id                          INTEGER NOT NULL,
  parent_step_id                   INTEGER,
  state_type                       TEXT NOT NULL,

  agent_id                         TEXT,
  skill_id                         TEXT,
  model_name                       TEXT,

  start_time                       TIMESTAMP,
  end_time                         TIMESTAMP,
  latency_ms                       INTEGER,

  input_tokens_total               INTEGER,
  input_tokens_uncached            INTEGER,
  input_tokens_cached              INTEGER,
  output_tokens                    INTEGER,
  reasoning_tokens                 INTEGER,
  total_tokens                     INTEGER,

  llm_cost                         NUMERIC,
  tool_cost                        NUMERIC,
  compute_cost                     NUMERIC,
  state_cost                       NUMERIC,

  status                           TEXT,
  error_type                       TEXT,

  PRIMARY KEY (trace_id, step_id)
);

27. `trace_context_breakdown`: Input Provenance

CREATE TABLE trace_context_breakdown (
  trace_id                         TEXT NOT NULL,
  step_id                          INTEGER NOT NULL,

  system_prompt_tokens             INTEGER,
  skill_instruction_tokens         INTEGER,
  user_instruction_tokens          INTEGER,
  history_tokens                   INTEGER,
  memory_tokens                    INTEGER,
  tool_result_tokens               INTEGER,
  retrieved_context_tokens         INTEGER,
  artifact_context_tokens          INTEGER,
  other_context_tokens             INTEGER,

  PRIMARY KEY (trace_id, step_id)
);

28. `trace_tool_event`: Tool Event Ledger

CREATE TABLE trace_tool_event (
  trace_id                         TEXT NOT NULL,
  step_id                          INTEGER NOT NULL,

  tool_type                        TEXT,
  tool_name                        TEXT,

  request_tokens                   INTEGER,
  response_tokens_raw              INTEGER,
  response_tokens_selected         INTEGER,
  tokens_sent_to_next_llm          INTEGER,

  tool_latency_ms                  INTEGER,
  tool_cost                        NUMERIC,
  status                           TEXT,

  created_at                       TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

29. `trace_price_snapshot`: Price Versioning

CREATE TABLE trace_price_snapshot (
  trace_id                         TEXT NOT NULL,
  model_name                       TEXT NOT NULL,

  price_input_per_million          NUMERIC,
  price_cached_input_per_million   NUMERIC,
  price_output_per_million         NUMERIC,
  price_reasoning_per_million      NUMERIC,

  currency                         TEXT,
  price_version                    TEXT,
  created_at                       TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Part VII — Derived Metrics

30. Total Token Consumption

Total Tokens =
Σ input_tokens_total
+ Σ output_tokens
+ Σ reasoning_tokens

31. Total Cost

Total Cost =
Σ State Cost

32. Cache Hit Ratio

Cache Hit Ratio =
total_cached_input_tokens / total_input_tokens

33. Cache Savings

Cache Saving =
(total_cached_input_tokens / 1,000,000)
× (P_input - P_cached_input)

State Cost Share =
State Cost / Total Task Cost

Example:

RETRIEVE Cost Share =
RETRIEVE Cost / Total Task Cost

35. Tool Context Ratio

Tool Context Ratio =
tokens_sent_to_next_llm / response_tokens_raw

This tells us how much raw tool output entered the next LLM context.

A high ratio may indicate insufficient compression, filtering, deduplication, or reranking.

36. Input Amplification Ratio

Input Amplification Ratio =
total_input_tokens / user_instruction_tokens

This metric shows how much the agent expands a small user instruction into runtime context.

Example:

User instruction: 300 tokens
Total input tokens: 142,000 tokens

Input Amplification Ratio = 473.3×

This is one of the most important metrics for agent cost engineering.

37. Validation Repair Rate

Validation Repair Rate =
issues_fixed_later / issues_found

This measures whether validation is actually useful or merely generating noise.

Refinement Efficiency =
modified_content_tokens / refine_output_tokens

If refinement output is large but few tokens are actually modified, the system may be over-generating during repair.

39. Retrieval Compression Ratio

Retrieval Compression Ratio =
tokens_sent_to_next_llm / raw_result_tokens

A lower ratio usually means better retrieval filtering, assuming answer quality is preserved.

Part VIII — Implementation Blueprint

40. Event Stream First

A production implementation should instrument the agent runtime as an event stream.

Recommended event categories:

run.started
step.started
model.called
tool.called
context.compiled
artifact.created
artifact.modified
state.changed
validator.called
step.completed
run.completed

Every event should include:

{
  "trace_id": "task_20260428_001",
  "step_id": 7,
  "event_type": "model.called",
  "timestamp": "2026-04-28T10:00:07Z",
  "payload": {}
}

The event stream is then normalized into relational tables or analytical storage.

41. Runtime Instrumentation Points

Instrumentation should happen at the following boundaries:

Boundary	What to Capture
Context compiler	Input token provenance, cacheability, selected context segments
Model gateway	Model name, token usage, latency, cost, price snapshot
Tool router	Tool name, request size, raw response size, selected response size
Retrieval system	Query count, top-k, returned chunks, raw tokens, selected tokens
File subsystem	File reads, writes, parses, exports, visual/render checks
Database gateway	Query type, rows returned, selected rows, latency, cost
Validator runner	Validator input, output, score, confidence, issues found
State diff engine	Pre/post state snapshots, expected mutation, side effects

42. Evaluation Pipeline

flowchart TD
    A[Agent Run] --> B[Runtime Event Stream]
    B --> C[Trace Normalizer]
    C --> D[Token & Cost Ledger]

    A --> E[Final Output]
    A --> F[State Snapshot After Run]

    E --> G[Validator Mesh]
    F --> H[State Diff Engine]

    G --> I[Outcome Result]
    H --> J[State Result]
    D --> K[Trace Result]

    I --> W[Evaluation Warehouse]
    J --> W
    K --> W

    W --> L[Failure Dashboard]
    W --> M[Cost Dashboard]
    W --> N[Optimization Engine]

Part IX — Optimization Surfaces

Trace Evaluation reveals where optimization should happen.

Typical findings:

Cost Pattern	Likely Cause	Optimization Direction
High `skill_instruction_tokens`	Skill package too verbose	Skill compression, modular Skill loading
High `retrieved_context_tokens`	Retrieval returns too much context	Better reranking, deduplication, budgeted context selection
High `tool_result_tokens`	Tool outputs are copied into context too aggressively	Tool result summarization, schema extraction
High `history_tokens`	Conversation history is not compressed	Rolling summary, hierarchical memory
High `VALIDATE` cost	Too many validation rounds	Validator routing, selective validation
High `REFINE` cost	Large rewrites for small fixes	Patch-based refinement
Low cache hit ratio	Context is unstable	Stable prompt prefixing, Skill cache segmentation
High input amplification ratio	Agent loop expands too aggressively	State pruning, tool result filtering, context budgets

The point is not simply to reduce tokens.

The point is to reduce useless tokens without reducing task success.

Part X — Future: Self-Optimizing Agents

Once the trace ledger is reliable, it can become input to a self-optimizing agent.

The optimizer can read historical traces and propose improvements such as:

- compress long Skill instructions;
- split large Skills into cacheable modules;
- reduce top-k retrieval dynamically;
- tune context selection budgets;
- summarize tool responses before reinserting them into context;
- remove repeated validation steps;
- convert full rewrites into patch-based edits;
- improve cache hit ratio through stable prompt layouts;
- detect high-cost state patterns and rewrite workflow policy.

This creates a closed loop:

flowchart LR
    A[Agent Run] --> B[Trace Ledger]
    B --> C[Cost Attribution]
    C --> D[Optimization Agent]
    D --> E[Skill / Workflow / Context Policy Patch]
    E --> A

The current phase should not over-optimize prematurely.

The first milestone is:

Capture the trace completely.
Compute the cost correctly.
Attribute the cost to states and context sources.
Make every failure and every yuan explainable.

41. Engineering Principles

A production-grade agent evaluation system should follow these principles:

Principle 1 — No Self-Certified Success

The agent cannot declare itself successful.

Success must be verified by external validators, evidence checks, state checks, and execution records.

Principle 2 — Outcome and Cost Are Separate

A task can be correct and expensive.

A task can be cheap and wrong.

Both dimensions must be measured independently.

Principle 3 — External State Is First-Class

Files, databases, tickets, appointments, orders, emails, and workflow submissions are not side details.

They are part of the task result.

Principle 4 — Token Flow Must Be Attributed

Total token count is not enough.

Tokens must be attributed to source:

system prompt
Skill instruction
user instruction
history
memory
tool result
retrieved context
artifact context

Principle 5 — Failure Must Be Queryable

Failure messages should be structured into failure codes, validator outputs, and diagnostic metadata.

If failure cannot be aggregated, it cannot be improved.

Principle 6 — Price Must Be Snapshotted

Token prices change.

Every trace must store the price version used at execution time.

Principle 7 — Evaluation Data Should Feed Optimization

Evaluation is not only for reporting.

It should become the data foundation for workflow compression, Skill optimization, retrieval tuning, and future self-evolving agents.

Conclusion

Agent evaluation should be built like observability infrastructure, not like a subjective scoring prompt.

The proposed framework separates the problem into two layers:

Outcome Evaluation:
  Did the agent complete the task correctly?

Trace Evaluation:
  What did it cost to complete the task?

Outcome Evaluation uses contracts, hard gates, soft scores, validators, failure codes, and state-diff verification.

Trace Evaluation turns the agent runtime into an auditable execution ledger with per-state token accounting, context provenance, tool cost, model cost, latency, and optimization metrics.

The final product is not just a score.

It is an engineering substrate for building agents that are:

correct,
auditable,
cost-aware,
optimizable,
and eventually self-improving.

In production agent systems, evaluation is not the last step.

It is the control plane.

Building a Production-Grade Evaluation Stack for Agentic Systems

Building a Production-Grade Evaluation Stack for Agentic Systems

1. System Overview

Part I — Outcome Evaluation Layer

2. What the Outcome Layer Measures

3. Eval Case as a Contract

4. Hard Success vs. Soft Score

4.1 Hard Success

4.2 Soft Score

5. Validator Mesh

6. Structured Failure Taxonomy

6.1 Failure Code Semantics

7. Outcome Result Table

Part II — External State Change Evaluation

8. Why State Change Evaluation Matters

9. State Diff Model

10. Hard State Gate

11. State Quality Score

12. State Result Table

Part III — Trace Evaluation Layer

13. Goal of Trace Evaluation

14. Agent Runtime as a State Machine

15. Runtime State Types

16. Common Trace Step Schema

17. Token Accounting Schema

18. Context Provenance Breakdown

19. State-Specific Schemas

19.1 OBSERVE

19.2 THINK

19.3 RETRIEVE

19.4 MCP_CALL

19.5 API_CALL

19.6 DB_QUERY

19.7 SCRIPT_EXEC

19.8 FILE_READ

19.9 FILE_WRITE

19.10 MEMORY_READ

19.11 MEMORY_WRITE

19.12 VALIDATE

19.13 REFINE

19.14 FINALIZE

Part IV — Cost Model

20. LLM Cost Formula

21. State Cost Formula

22. Task Cost Formula

23. Price Snapshot

Part V — Trace Output

24. Cost Profile Example

Part VI — Data Model

25. trace_run: Run-Level Ledger

26. trace_step: State-Level Ledger

27. trace_context_breakdown: Input Provenance

28. trace_tool_event: Tool Event Ledger

29. trace_price_snapshot: Price Versioning

Part VII — Derived Metrics

30. Total Token Consumption

31. Total Cost

32. Cache Hit Ratio

33. Cache Savings

34. State Cost Share

35. Tool Context Ratio

36. Input Amplification Ratio

37. Validation Repair Rate

38. Refinement Efficiency

39. Retrieval Compression Ratio

Part VIII — Implementation Blueprint

40. Event Stream First

41. Runtime Instrumentation Points

42. Evaluation Pipeline

Part IX — Optimization Surfaces

Part X — Future: Self-Optimizing Agents

41. Engineering Principles

Principle 1 — No Self-Certified Success

Principle 2 — Outcome and Cost Are Separate

Principle 3 — External State Is First-Class

Principle 4 — Token Flow Must Be Attributed

Principle 5 — Failure Must Be Queryable

Principle 6 — Price Must Be Snapshotted

Principle 7 — Evaluation Data Should Feed Optimization

Conclusion

25. `trace_run`: Run-Level Ledger

26. `trace_step`: State-Level Ledger

27. `trace_context_breakdown`: Input Provenance

28. `trace_tool_event`: Tool Event Ledger

29. `trace_price_snapshot`: Price Versioning