Study blueprint

Experimental design

The design compares agents, memory policies, and procedures under repeated replays of the same scenarios. The goal is to isolate drift induced by behavior, not model changes.

Agent variants

Prompt-only baseline

Single prompt, no external procedure artifact, memory toggled independently.

Skill-executing agent

Loads SKILL.md and thresholds.yaml, follows ordered steps, logs branch IDs.

Verifier extensions

Optional rule-based or LLM critics for escalation checks.

Memory policies

No memory and rolling window baselines
TTL-bounded memory defined by last N runs or last T minutes
Reset-on-regime to avoid cross-regime leakage

Metrics snapshot

DDR (Decision Disagreement Rate) measures how often identical scenarios yield different decisions across replays. SR (Switch Rate) captures label changes between adjacent replays.

Finance-native scenarios

Calm regime

Stable volatility with routine monitoring and low escalation.

Volatility spike

Transient risk jumps that pressure escalation logic.

Drawdown

Extended drawdown that tests persistence and narrative consistency.

Run matrix

Each agent variant runs across multiple regimes with repeated replays. The run matrix defines controlled seeds, scenario orderings, and replay counts to quantify drift over time.