Phase 0 · Foundation Assessment — BEA-28 AI Validation Engine
The permanent base layer (FAD) for all later phases. One-liner: "Stress-test creative concepts against synthetic audience personas before launch." Category: Content Generation. Build type: ai-ml-generation. Owner: Buku. Written as the foundation that Phases 1–5 build on; deliberately consistent with the Phase-5 SPLIT-but-converging consensus and the Decision Gate REFINE verdict, which were not yet known when a foundation is normally written but are honored here so the dossier stack stays coherent.
1. Success criteria (measurable, time-bound)
The honest framing carried through every later phase is kill-only triage, not hit-prediction — predicting winners is near-impossible and burns trust on the first miss (Phase 2.5). The criteria below are written to that framing, not to a "predict the hit" promise.
- [ ] Discrimination beats a dumb baseline (the gate-zero number). On an assembled back-test corpus of ≥30 outcome-labeled past Beats concepts, the synthetic triage instrument's kill/keep decision must beat a keyword-blocklist + coin-flip baseline by a pre-registered numeric margin (target: AUC ≥ 0.70 or rank-correlation ρ ≥ 0.35 against known outcomes) within 8 weeks of corpus assembly. No corpus / no number → no build. (This is Architecture A's existing guardrail promoted to a program-level kill criterion in Phase 3.5/5.)
- [ ] Corpus exists and is clean. A back-test corpus of ≥30 past Beats concepts with clean, outcome-labeled results (panel reaction and/or launch outcome, with confounds like media spend and timing recorded) is assembled within the first quarter. Below 30 clean cases by end of Q1 → kill criterion trips. (Cold-start is the shared systemic risk across all three architectures, Phase 3.5.)
- [ ] Brand-safety catch rate on a held-out landmine set. On a held-out set of known brand-safety / cultural-landmine concepts, the instrument flags ≥90% of them before any human review, measured at first production run and re-checked quarterly. (Tuned toward over-sensitivity per Principle 9; the asymmetric-cost objective.)
- [ ] Oracle operational within 8 weeks (Architecture A path). A demographically-matched, legally-cleared aggregate-triage panel (≥20 retained, or an internal/vendor-NDA proxy panel showing abstracted stimuli) is reacting to concepts within 8 weeks, with a running synthetic-vs-oracle agreement ledger. No operational oracle by week 8 → Architecture A has no ground truth and collapses (Phase 3.5 A-kill criterion).
- [ ] Build-vs-buy decision recorded before any internal staffing. An explicit comparison of internal-build cost (modeled ~$0 NPV, B = +$0.05M) against an off-the-shelf synthetic-audience vendor license is documented and signed off before any engineer is staffed past the gate-zero experiment. (Validator-mandated, Whitlock, Phase 5.)
2. Decision-maker profile (Three Ledgers)
- Public ledger (what we pitch): "An AI engine that stress-tests every Beats creative concept against a panel of synthetic audience personas before launch, catching brand-safety landmines and weak concepts early, scaling concept review beyond what human panels can handle, and giving brand-review boards an auditable validation artifact." The slide-ready innovation story.
- Shadow ledger (what the decision-maker actually optimizes for): Buku, as owner inside an Apple/Beats org, is optimizing for (1) a visible, defensible AI win that demonstrates the team shipped real ML capability (not a wrapper), (2) brand-safety cover — being able to show diligence was done before a campaign went out, and crucially avoiding the career-risk of an AI that approved a culturally tone-deaf campaign, and (3) avoiding the "we built an AI but humans still decide" perception that reads as underwhelming to leadership. The tension is real: the honest product (kill-only triage with a named human decider) directly conflicts with the desire for an impressive hit-predictor — and the accountability-laundering trap (Iyer, Phase 5) means a citable "pass" is a liability dressed as the very cover the shadow ledger wants.
- True ledger (what will actually get built): Given the REFINE verdict and the staged-option sequencing, what actually gets built first is far thinner than the pitch: a single LLM triage classifier (kill / hold / escalate + one named failure mode), a back-test corpus, a synthetic-vs-oracle agreement ledger, and the near-free cross-model correlation experiment — before any cascade, certificate engine, multi-model jury, or regime breaker. Realistically the program either (a) proves the validity number and grows into Architecture B, or (b) fails the number and is correctly killed / replaced by a bought vendor layer. The grand multi-model vision (C) is a research option, not a build.
- Owner / sponsor: Buku.
3. Constraint inventory
| Constraint | Type (hard / soft / assumed) | Notes |
|---|---|---|
| Apple confidentiality — unreleased Beats creative must not leak via the validation process | hard | One pre-launch concept leak attributable to the process is terminal regardless of accuracy (Phase 3.5 A9). Caps what an external panel / corpus may see. |
| Apple Legal / Privacy consent path for external consumers viewing pre-launch CE creative | hard (timeline) | Realistically 6–9 months and frequently ends in "no" (Vinh, Phase 5). Forces fallback to internal/vendor-NDA proxy panels showing abstracted stimuli. |
| Apple internal AI-model / data-governance policy on which foundation models may process proprietary creative | hard | Constrains model choice, vendor terms, and whether two distinct foundation models (Architecture C) are even permissible enterprise-side. |
| No back-test corpus or measured human-synthetic validity number exists today | hard (cold-start) | The keystone gap named independently by three Phase-5 reviewers. The whole program is gated on producing this number; nothing downstream is trustworthy without it. |
| Engineering headcount — fully-burdened ≈ $300K/eng/yr; staffing 1.5–4 eng is a real opportunity cost | hard (budget) | Internal-build NPVs are ~$0 (B = +$0.05M); staffing must clear a build-vs-buy gate (Whitlock, Phase 5). |
| Synthetic-persona validity (does an LLM persona reaction correlate with a real human's?) | assumed | Modeled r ~ Beta(2,3), mean ≈ 0.40 — asserted, never measured (Okafor, Phase 5). The single largest variance driver; only partially mitigatable (A backstops with humans; B/C inherit fully). |
| Panel size statistically too small to gate per-stratum | assumed → soft | A 20–50 panel = 3–5 people per stratum cell; that is anecdote, not measurement (Vinh). Per-stratum "pass every stratum" gating (B) needs a different-sized oracle than aggregate triage (A). |
| Creative-team adoption — a kill-only / "refuses to judge" tool risks compliance theater and non-use | soft | Tomás Beckett (Phase 5): edgy is the product; a hall-monitor tool gets mandated usage and zero real influence. Mitigated by making every verdict overturnable with a named human decider. |
| Over-sensitivity bias systematically removes Beats' edgy brand equity | soft | Principle 9 tuning kills bold-but-fine concepts; route high-creative-confidence kills to human override (Phase 3.5 A7). |
4. Scope boundary
- In scope: Stress-testing Beats creative concepts — campaign ideas, ad copy, product positioning, packaging — against synthetic audience personas before launch, to (a) cheaply kill obvious losers, (b) flag brand-safety / cultural landmines, and (c) know when NOT to opine (escalate to human). The instrument's own self-validation (positive-control harness, agreement ledger) is in scope because the validator must be validated before the candidate (Principle 1). Assembling the back-test corpus and running the gate-zero validity + cross-model correlation experiments is the first in-scope work.
- Out of scope: Predicting hits / forecasting revenue uplift (near-impossible, burns trust — Phase 2.5). Emitting a bare "pass" / a citable approval (accountability-laundering liability — Iyer; the product emits named-risks-with-a-named-human-decider instead). Validating finished/launched creative (this is pre-launch only). Replacing the human decision — a named human always decides and can overturn. The full multi-model adversarial jury, regime circuit-breaker, and version-bound certificate engine are deferred (escalation targets, not v1).
- Recursion budget acknowledged: max 3 per phase. Current
recursionCount = 0. The Decision-Gate REFINE loop is the first sanctioned re-pass (back to Phase 4 to harden the value model), well within budget.
5. Assumption log
| # | Assumption | Confidence (0–1) | Validated? |
|---|---|---|---|
| 1 | LLM-simulated persona reactions correlate with real human reactions enough to triage (coarse valence: offensive / boring), even if useless for ranking good options. | 0.45 | No — this is the keystone gap; r ~ Beta(2,3) mean ≈ 0.40 is asserted, not measured. Gate-zero experiment must produce the number. |
| 2 | A clean, outcome-labeled back-test corpus of ≥30 past Beats concepts can be assembled within Q1. | 0.55 | No — modeled Bernoulli p = 0.55; historical labels were never cleanly recorded and outcomes are confounded by media spend/timing. |
| 3 | A demographically-matched real (or NDA proxy) panel can be stood up and legally cleared within ~8 weeks at Apple. | 0.35 | No — Vinh estimates 6–9 months for external consumers; fallback to internal/vendor-NDA proxy is the realistic path. |
| 4 | "Obvious-loser + landmine" detection is the easy 80% an LLM reliably beats a keyword blocklist on. | 0.5 | No — A3 names the risk that a dumb blocklist catches it just as well; only the gate-zero number settles this. |
| 5 | Creative leads and leadership will accept a kill-only / abstaining (non-predictive), overturnable tool as valuable rather than underwhelming. | 0.4 | No — Beckett (adoption) and the "AI but humans still decide" perception risk (Phase 3.5) both push against this; an expectation-management problem set at kickoff. |
| 6 | The positive-control self-validation harness actually detects persona-model drift (a degraded model fails the known-losers). | 0.5 | No — B1/B10 flag that drift can spare a fixed control set; requires ≥90% drift-catch in red-team injection before trusting. |
| 7 | Pre-specified persona strata are stable and independent enough that "pass every stratum" is signal, not noise or one correlated voice. | 0.35 | No — Nadar: LLM persona sycophancy/mode-collapse makes strata correlate far more than real demographics; must measure inter-stratum correlation first. |
| 8 | An internal build is justified over buying an off-the-shelf synthetic-audience vendor license. | 0.3 | No — Whitlock: NPVs ~$0; build-vs-buy comparison was never run and is now a mandated gate before staffing. |
Gate: success criteria defined, constraints mapped, decision-maker identified → set phaseGates.0-foundation = passed in manifest.json.