Phase 0 · Foundation Assessment — BEA-28 AI Validation Engine

The permanent base layer (FAD) for all later phases. One-liner: "Stress-test creative concepts against synthetic audience personas before launch." Category: Content Generation. Build type: ai-ml-generation. Owner: Buku. Written as the foundation that Phases 1–5 build on; deliberately consistent with the Phase-5 SPLIT-but-converging consensus and the Decision Gate REFINE verdict, which were not yet known when a foundation is normally written but are honored here so the dossier stack stays coherent.

1. Success criteria (measurable, time-bound)

The honest framing carried through every later phase is kill-only triage, not hit-prediction — predicting winners is near-impossible and burns trust on the first miss (Phase 2.5). The criteria below are written to that framing, not to a "predict the hit" promise.

[ ] Discrimination beats a dumb baseline (the gate-zero number). On an assembled back-test corpus of ≥30 outcome-labeled past Beats concepts, the synthetic triage instrument's kill/keep decision must beat a keyword-blocklist + coin-flip baseline by a pre-registered numeric margin (target: AUC ≥ 0.70 or rank-correlation ρ ≥ 0.35 against known outcomes) within 8 weeks of corpus assembly. No corpus / no number → no build. (This is Architecture A's existing guardrail promoted to a program-level kill criterion in Phase 3.5/5.)
[ ] Corpus exists and is clean. A back-test corpus of ≥30 past Beats concepts with clean, outcome-labeled results (panel reaction and/or launch outcome, with confounds like media spend and timing recorded) is assembled within the first quarter. Below 30 clean cases by end of Q1 → kill criterion trips. (Cold-start is the shared systemic risk across all three architectures, Phase 3.5.)
[ ] Brand-safety catch rate on a held-out landmine set. On a held-out set of known brand-safety / cultural-landmine concepts, the instrument flags ≥90% of them before any human review, measured at first production run and re-checked quarterly. (Tuned toward over-sensitivity per Principle 9; the asymmetric-cost objective.)
[ ] Oracle operational within 8 weeks (Architecture A path). A demographically-matched, legally-cleared aggregate-triage panel (≥20 retained, or an internal/vendor-NDA proxy panel showing abstracted stimuli) is reacting to concepts within 8 weeks, with a running synthetic-vs-oracle agreement ledger. No operational oracle by week 8 → Architecture A has no ground truth and collapses (Phase 3.5 A-kill criterion).
[ ] Build-vs-buy decision recorded before any internal staffing. An explicit comparison of internal-build cost (modeled ~$0 NPV, B = +$0.05M) against an off-the-shelf synthetic-audience vendor license is documented and signed off before any engineer is staffed past the gate-zero experiment. (Validator-mandated, Whitlock, Phase 5.)

2. Decision-maker profile (Three Ledgers)

Public ledger (what we pitch): "An AI engine that stress-tests every Beats creative concept against a panel of synthetic audience personas before launch, catching brand-safety landmines and weak concepts early, scaling concept review beyond what human panels can handle, and giving brand-review boards an auditable validation artifact." The slide-ready innovation story.
Shadow ledger (what the decision-maker actually optimizes for): Buku, as owner inside an Apple/Beats org, is optimizing for (1) a visible, defensible AI win that demonstrates the team shipped real ML capability (not a wrapper), (2) brand-safety cover — being able to show diligence was done before a campaign went out, and crucially avoiding the career-risk of an AI that approved a culturally tone-deaf campaign, and (3) avoiding the "we built an AI but humans still decide" perception that reads as underwhelming to leadership. The tension is real: the honest product (kill-only triage with a named human decider) directly conflicts with the desire for an impressive hit-predictor — and the accountability-laundering trap (Iyer, Phase 5) means a citable "pass" is a liability dressed as the very cover the shadow ledger wants.
True ledger (what will actually get built): Given the REFINE verdict and the staged-option sequencing, what actually gets built first is far thinner than the pitch: a single LLM triage classifier (kill / hold / escalate + one named failure mode), a back-test corpus, a synthetic-vs-oracle agreement ledger, and the near-free cross-model correlation experiment — before any cascade, certificate engine, multi-model jury, or regime breaker. Realistically the program either (a) proves the validity number and grows into Architecture B, or (b) fails the number and is correctly killed / replaced by a bought vendor layer. The grand multi-model vision (C) is a research option, not a build.
Owner / sponsor: Buku.

3. Constraint inventory

Constraint	Type (hard / soft / assumed)	Notes
Apple confidentiality — unreleased Beats creative must not leak via the validation process	hard	One pre-launch concept leak attributable to the process is terminal regardless of accuracy (Phase 3.5 A9). Caps what an external panel / corpus may see.
Apple Legal / Privacy consent path for external consumers viewing pre-launch CE creative	hard (timeline)	Realistically 6–9 months and frequently ends in "no" (Vinh, Phase 5). Forces fallback to internal/vendor-NDA proxy panels showing abstracted stimuli.
Apple internal AI-model / data-governance policy on which foundation models may process proprietary creative	hard	Constrains model choice, vendor terms, and whether two distinct foundation models (Architecture C) are even permissible enterprise-side.
No back-test corpus or measured human-synthetic validity number exists today	hard (cold-start)	The keystone gap named independently by three Phase-5 reviewers. The whole program is gated on producing this number; nothing downstream is trustworthy without it.
Engineering headcount — fully-burdened ≈ $300K/eng/yr; staffing 1.5–4 eng is a real opportunity cost	hard (budget)	Internal-build NPVs are ~$0 (B = +$0.05M); staffing must clear a build-vs-buy gate (Whitlock, Phase 5).
Synthetic-persona validity (does an LLM persona reaction correlate with a real human's?)	assumed	Modeled `r ~ Beta(2,3)`, mean ≈ 0.40 — asserted, never measured (Okafor, Phase 5). The single largest variance driver; only partially mitigatable (A backstops with humans; B/C inherit fully).
Panel size statistically too small to gate per-stratum	assumed → soft	A 20–50 panel = 3–5 people per stratum cell; that is anecdote, not measurement (Vinh). Per-stratum "pass every stratum" gating (B) needs a different-sized oracle than aggregate triage (A).
Creative-team adoption — a kill-only / "refuses to judge" tool risks compliance theater and non-use	soft	Tomás Beckett (Phase 5): edgy is the product; a hall-monitor tool gets mandated usage and zero real influence. Mitigated by making every verdict overturnable with a named human decider.
Over-sensitivity bias systematically removes Beats' edgy brand equity	soft	Principle 9 tuning kills bold-but-fine concepts; route high-creative-confidence kills to human override (Phase 3.5 A7).

4. Scope boundary

In scope: Stress-testing Beats creative concepts — campaign ideas, ad copy, product positioning, packaging — against synthetic audience personas before launch, to (a) cheaply kill obvious losers, (b) flag brand-safety / cultural landmines, and (c) know when NOT to opine (escalate to human). The instrument's own self-validation (positive-control harness, agreement ledger) is in scope because the validator must be validated before the candidate (Principle 1). Assembling the back-test corpus and running the gate-zero validity + cross-model correlation experiments is the first in-scope work.
Out of scope: Predicting hits / forecasting revenue uplift (near-impossible, burns trust — Phase 2.5). Emitting a bare "pass" / a citable approval (accountability-laundering liability — Iyer; the product emits named-risks-with-a-named-human-decider instead). Validating finished/launched creative (this is pre-launch only). Replacing the human decision — a named human always decides and can overturn. The full multi-model adversarial jury, regime circuit-breaker, and version-bound certificate engine are deferred (escalation targets, not v1).
Recursion budget acknowledged: max 3 per phase. Current recursionCount = 0. The Decision-Gate REFINE loop is the first sanctioned re-pass (back to Phase 4 to harden the value model), well within budget.

5. Assumption log

#	Assumption	Confidence (0–1)	Validated?
1	LLM-simulated persona reactions correlate with real human reactions enough to triage (coarse valence: offensive / boring), even if useless for ranking good options.	0.45	No — this is the keystone gap; `r ~ Beta(2,3)` mean ≈ 0.40 is asserted, not measured. Gate-zero experiment must produce the number.
2	A clean, outcome-labeled back-test corpus of ≥30 past Beats concepts can be assembled within Q1.	0.55	No — modeled Bernoulli p = 0.55; historical labels were never cleanly recorded and outcomes are confounded by media spend/timing.
3	A demographically-matched real (or NDA proxy) panel can be stood up and legally cleared within ~8 weeks at Apple.	0.35	No — Vinh estimates 6–9 months for external consumers; fallback to internal/vendor-NDA proxy is the realistic path.
4	"Obvious-loser + landmine" detection is the easy 80% an LLM reliably beats a keyword blocklist on.	0.5	No — A3 names the risk that a dumb blocklist catches it just as well; only the gate-zero number settles this.
5	Creative leads and leadership will accept a kill-only / abstaining (non-predictive), overturnable tool as valuable rather than underwhelming.	0.4	No — Beckett (adoption) and the "AI but humans still decide" perception risk (Phase 3.5) both push against this; an expectation-management problem set at kickoff.
6	The positive-control self-validation harness actually detects persona-model drift (a degraded model fails the known-losers).	0.5	No — B1/B10 flag that drift can spare a fixed control set; requires ≥90% drift-catch in red-team injection before trusting.
7	Pre-specified persona strata are stable and independent enough that "pass every stratum" is signal, not noise or one correlated voice.	0.35	No — Nadar: LLM persona sycophancy/mode-collapse makes strata correlate far more than real demographics; must measure inter-stratum correlation first.
8	An internal build is justified over buying an off-the-shelf synthetic-audience vendor license.	0.3	No — Whitlock: NPVs ~$0; build-vs-buy comparison was never run and is now a mandated gate before staffing.

Gate: success criteria defined, constraints mapped, decision-maker identified → set phaseGates.0-foundation = passed in manifest.json.

ValidAI

UPF Phase Progression