Experimental design
The design compares agents, memory policies, and procedures under repeated replays of the same scenarios. The goal is to isolate drift induced by behavior, not model changes.
Agent variants
Prompt-only baseline
Single prompt, no external procedure artifact, memory toggled independently.
Skill-executing agent
Loads SKILL.md and thresholds.yaml, follows ordered steps, logs branch IDs.
Verifier extensions
Optional rule-based or LLM critics for escalation checks.
Memory policies
- No memory and rolling window baselines
- TTL-bounded memory defined by last N runs or last T minutes
- Reset-on-regime to avoid cross-regime leakage
Metrics snapshot
DDR (Decision Disagreement Rate) measures how often identical scenarios yield different decisions across replays. SR (Switch Rate) captures label changes between adjacent replays.
Finance-native scenarios
Calm regime
Stable volatility with routine monitoring and low escalation.
Volatility spike
Transient risk jumps that pressure escalation logic.
Drawdown
Extended drawdown that tests persistence and narrative consistency.
Run matrix
Each agent variant runs across multiple regimes with repeated replays. The run matrix defines controlled seeds, scenario orderings, and replay counts to quantify drift over time.