Fiduciary Constraint Evaluation

FCE is an experimental Python package and benchmark scaffold for evaluating whether AI systems preserve fiduciary and professional constraints across short multi-turn interactions.

33 scenariosEvaluator v3.1Runtime witnessesAdjudication packetsJul 2026 live packet

Why it exists

Many legal and professional benchmarks test whether a model can state a rule. FCE tests a narrower workflow problem: a system may identify a fiduciary or professional constraint, then drop it when the user reframes the task, adds urgency, introduces ambiguity, or asks for a locally convenient but globally invalid action.

The pilot focuses on operational failure modes that matter in legal and fiduciary settings: silent constraint drop, unsupported confident answers, generic hedging, overbroad refusal, harmful leakage, and failure to escalate.

Current pilot

Scenario Count

Short, hand-curated multi-turn scenarios with hidden constraints, expected behavior, must-not behaviors, harm severity, rubric data, and optional middleware expectations.

Confidentiality and privilege

Candor, truthfulness, and fraud

Conflicts and former-client duties

Competence and uncertainty

Supervision of AI or nonlawyer assistants

Fees and billing transparency

Retention, preservation, and cleanup requests

Client communication and represented-party contact

July 2026 live comparison packet

On 2026-07-10 the pilot was run live against four current-generation models — claude-fable-5, claude-opus-4-8, claude-sonnet-5, and gpt-5.5 — across all 33 scenarios and three prompting arms (raw, prompted, rules-scaffold), scored by one frozen rubric (fce-rubric-v2.2). That is 396 live baseline runs plus a separate 17-scenario deterministic runtime arm, for 413 scored runs total.

On the 17 runtime-executable scenarios, the deterministic runtime arm — the constraint engine running in front of the models, no API calls — scored 78.4 weighted with perfect escalation recall (1.00) and zero silent constraint drops. It out-scored every live model arm on that subset, the best of which was claude-opus-4-8 rules-scaffold at 77.76 on the full 33 scenarios. By contrast the raw model arms silently dropped required constraints at rates from 0.09 to 0.45: raw gpt-5.5 had a silent-drop rate of 0.45 and caught only 0.41 of required escalations. The runtime subset is a 17-scenario slice and should be read separately from the 33-scenario baseline; it is not evidence of runtime coverage on the other 16 scenarios.

Weighted score, escalation recall, and silent-drop rate per model and arm, 2026-07-10 packet
Model	Arm	Weighted	Esc. recall	Silent drop
claude-fable-5	raw	64.22	0.59	0.09
claude-fable-5	prompted	72.23	0.76	0.06
claude-fable-5	rules-scaffold	72.45	0.88	0.03
claude-opus-4-8	raw	66.95	0.35	0.24
claude-opus-4-8	prompted	72.87	0.53	0.09
claude-opus-4-8	rules-scaffold	77.76	0.94	0.00
claude-sonnet-5	raw	64.40	0.35	0.30
claude-sonnet-5	prompted	72.62	0.59	0.09
claude-sonnet-5	rules-scaffold	75.77	0.94	0.09
gpt-5.5	raw	59.02	0.41	0.45
gpt-5.5	prompted	66.62	0.59	0.21
gpt-5.5	rules-scaffold	71.42	0.94	0.06
runtime engine (no API)	deterministic	78.39	1.00	0.00

Weighted score is out of 100; escalation recall and silent-drop rate are fractions on [0, 1]. Runtime row covers the 17-scenario runtime subset only. One run per scenario per arm. Every figure traces to PACKET.md and the per-arm summary.csv files in the packet.

The packet reports the full response mixture, including provider safety-layer refusals, rather than dropping them. claude-fable-5 returned 15 refusals over HTTP 200, all tagged category=cyber, on fiduciary scenarios about confidentiality, candor, represented-party communication, and document retention. Those refusals score as failed runs, so this model’s aggregate mixes fiduciary-behavior measurement with provider safety-layer behavior. The packet records this per run rather than adjusting for it.

This is a pilot, not a validated benchmark or a provider ranking. One run per scenario per arm, descriptive aggregates only, and the heuristic scorer failed its most recent manual holdout audit — treat every score as an unvalidated instrument reading.

READ THE PACKET FCE REPOSITORY

What the repository contains

Typed FCE pilot scenario schema

33 hand-curated short multi-turn scenarios

Scenario export, fixture-backed runs, replay/live adapters, and report bundles

Runtime-backed witness subset for selected conflict cases

Narrow proof-supported examples for deterministic runtime claims

Evaluator v3.1 with deterministic caps, transparent heuristic scoring, semantic triage, trajectory flags, and a human adjudication queue

Blinded adjudication packet generation with small curated review samples

Runtime witness concept

Prompting can steer a model toward safer behavior. The runtime witness path is different: for selected scenarios, it runs explicit constraint checks and emits a structured witness describing the conflict, required disposition, conflict class, minimal conflicting set, and handoff target.

The witness path is intentionally modest. It makes selected conflicts legible and reviewable; it does not decide legal questions generally or claim complete formalization of every scenario.

Pairwise witness

FCE-RET-009-V1

Conflict type: H1

preserve_records + prompt_content; handoff target: compliance

Higher-order witness

FCE-CANDOR-010-V1

Conflict type: H2

candor_to_tribunal + omit_required_fact + maximize_client_advantage; handoff target: lawyer

Review boundaries

Not legal advice

Not production software

Not a validated benchmark

Not a provider ranking

Not a substitute for attorney, supervisor, compliance, or domain-expert review

No completed independent human holdout validation yet

FCE is useful as an inspectable artifact and methods scaffold. Any empirical claims require independent human labels, inter-rater review, and held-out validation.

Resources

VIEW FCE ON GITHUB PROJECTS MN DIGITAL TRUST ACT