Phase 0 · Foundation Assessment — BEA-17 QA Tool / Assistant
The permanent base layer (FAD) for all downstream work. Establishes what "done" means, who actually decides, what cannot move, where the boundary sits, and what we are still guessing — before any decomposition, research, or architecture. Everything in Phases 1–5 (three architectures A/B/C, the registry/L10n/privacy dependency map, the gate paradox, the A=74 / B=58 / C=23 executability ranking, and the Refine verdict) is consistent with — and traces back to — this foundation.
Project: automate web QA across four independent correctness domains — link/CID integrity, language/locale rendering, accessibility (headings/alt/aria), and structural markup — over a large, live, frequently-changing apple.com-class surface. Category: Production Automation. Build type: ui-forward-app. Owner/sponsor: Sharon Chan.
1. Success criteria (measurable, time-bound)
- [ ] Criterion 1 — Deterministic coverage of all four domains on a real surface, honestly labeled. Within 12 weeks of build start, the tool runs link/CID-reachability, locale-render, automated-a11y-subset (axe-core-class), and heading-hierarchy passes across a curated ≥500-page apple.com-class pilot subset, and every report explicitly labels accessibility output as "automated subset, not full compliance" and tracks
unverified-unknownpages as a distinct state (never silently counted as pass). Metric: 4/4 domains executing on the pilot subset with both honesty-of-coverage labels present in 100% of generated reports. - [ ] Criterion 2 — Trust-grade signal-to-noise. Within 16 weeks, median findings-per-page after default advisory-tier suppression is < 10 and the rate of findings dismissed-without-action by pilot users is < 30% over a 4-week observation window. (Directly targets the Phase 3.5 #3 / Theo / Sam false-positive trust-erosion failure mode.)
- [ ] Criterion 3 — Durable, in-workflow adoption (not side-of-desk). By 6 months post-pilot, ≥ 4 of 6 onboarded teams trigger or consume the tool at least weekly AND the tool is referenced as a step in ≥ 1 documented release runbook/checklist (not merely generated and ignored). Metric: weekly-active-run-per-onboarded-team ≥ 0.66, plus ≥ 1 runbook citation. (Targets the Sandy/Sam "lives beside the workflow" non-adoption disease.)
- [ ] Criterion 4 — Verifier integrity is a hard gate, not a hope. From first release onward, scanner/dependency versions are pinned and a synthetic known-good / known-bad reference suite runs on 100% of dependency or rule-set bumps before any promotion to production grading; zero unreviewed rule changes reach the compliance-facing report. (Operationalizes Aileen's "verify the verifier" paradigm requirement.)
- [ ] Criterion 5 — External dependency truth established early. By end of discovery week 2, the existence/owner/queryable-API status of the canonical CID/link registry and the L10n source-string store are each confirmed or formally declared absent — and any claim ("CID verification", "translation verification") is renamed to its honest fallback ("reachability", "renders-without-mojibake") if the source does not exist. Metric: signed-off dependency status for both sources before any value claim ships.
2. Decision-maker profile (Three Ledgers)
- Public ledger (what we pitch): "An AI-assisted QA Assistant that automatically verifies apple.com pages for broken links/CIDs, correct language rendering, and accessibility before publish — catching defects engineers and content publishers miss, at scale." The word "Assistant" headlines the pitch and implies an intelligent, conversational, semantic capability (Architecture C).
- Shadow ledger (what the decision-maker actually optimizes for): A credible, low-risk, shippable-this-cycle win that demonstrably reduces pre-publish defect escapes and manual QA toil without triggering a privacy/legal review, an unbounded inference bill, or an a11y-org turf fight — and that survives a budget review and the sponsor's own continued tenure. In practice the sponsor optimizes for "a thing that ships, gets adopted, and doesn't blow up in a compliance complaint," i.e. the deterministic A floor, even though the pitch leans on the "Assistant" framing.
- True ledger (what will actually get built): Architecture A — the "Checklist Runner + Report" deterministic engine (link/CID-reachability + locale-render + structural-markup + automated-a11y-subset), hardened with the two mandatory honesty-of-coverage refinements and verify-the-verifier as a hard gate. B's stateful diff-gated pipeline is staged behind A and unlocked only on a signed pipeline-owner + baseline-owner commitment; C's LLM/agentic layer is held as a deferred option, exercised only after an offline precision/recall bar is cleared and the privacy data-path is approved. The gap between the public "Assistant" ledger and this true ledger is itself the project's single largest theater risk and the main reason A lands at Refine (74), not Execute.
- Owner / sponsor: Sharon Chan. Single champion; the project's org commitments (pipeline integration, baseline ownership, a11y-org acceptance) currently ride on her relationships rather than written agreements — flagged downstream as B's catastrophic key-person-departure failure mode.
3. Constraint inventory
| Constraint | Type (hard / soft / assumed) | Notes |
|---|---|---|
| Canonical CID/link registry with a queryable API and a real owner must exist for "CID verification" to mean anything beyond a 404 check | assumed | The #1 cross-architecture single point of failure. If absent, the distinctive cross-domain value collapses to a commodity link checker in all three architectures. Must be validated in discovery week 1–2 (Criterion 5). |
| L10n source-of-truth strings must be exportable for true language/locale verification | assumed | If inaccessible, the language-rendering domain degrades from "translation correctness" to "renders without mojibake / missing glyph" — one of four headline domains becomes hollow. |
| Apple privacy / data-handling policy on routing page content (esp. unreleased pages) through any LLM inference path | hard | A non-technical kill switch for Architecture C. No amount of model quality overrides a legal "no". Gates the entire semantic layer; irrelevant to A/B which run no inference. |
| An a11y tooling standard (sanctioned scanner / internal axe fork / contracted vendor) likely already exists at Apple | assumed | If so, BEA-17 must be an aggregator over the sanctioned tool's output, not a competing scanner, or it fails procurement/standardization review and its findings won't count as compliance evidence. |
| Headcount / budget — small internal-tooling team | hard | A is sized at ~1.5 FTE over ~9 weeks; B at ~2.5 FTE over ~15 weeks; C at ~4 FTE incl. a scarce ML-eval specialist over ~6–9 months. C's ML-eval skill is the least bus-factor-resilient dependency. |
| A blocking publish gate requires CI/publish-pipeline-owner buy-in (organizational, not technical) | hard (for B) | B's defining feature cannot land without a written integration commitment from a pipeline owner; without it B degrades to a non-gating side dashboard barely distinguishable from A. |
| Output must not overstate what was actually verified (honesty-of-coverage) | hard | Unanimous validator requirement: "automated subset" labeling + explicit unverified-unknown tracking. A clean report that hides un-crawled pages is unfalsifiable assurance and a liability when a complaint lands. |
| apple.com-scale, CDN-fronted, personalized / A/B-tested surface | hard | Synchronous single-threaded fetch is a non-starter (~dies at ~30k pages); requires bounded concurrency and normalized-DOM hashing keyed on content ID, not raw bytes / URL. Real engineering, not a config flag. |
4. Scope boundary
- In scope: Automated pre-publish/ongoing verification across four correctness domains — (1) link/CID integrity (reachability + registry resolution where a registry exists), (2) language/locale rendering, (3) accessibility headings/alt/aria (explicitly the automated subset axe-core-class checks), (4) structural markup / heading-hierarchy. A
ui-forward-app(Vue / Vibe Elements) front end presenting a severity-tiered, traceable findings report. Honesty-of-coverage (unverified-unknownstate + "automated subset" labeling) and verify-the-verifier (version pinning + synthetic reference suite) are in scope for any shipped architecture. - Out of scope: LLM/agentic semantic judgment (meaningful-alt-text, translation-fidelity, label/section coherence), autonomous adaptive scheduling, and the conversational "assistant" interface — all deferred to Architecture C, which is an option to exercise later, not a Phase-0 commitment. Self-healing / auto-PR'd fixes and per-page "airworthiness certificates" (Phase 2.5 "impossible but worth exploring") are explicitly out. Replacing or competing with any sanctioned enterprise a11y scanner is out — position as aggregator, not replacement. Full WCAG/EN 301 549 compliance certification is out (the tool verifies an automated subset and must say so).
- Recursion budget acknowledged: max 3 per phase. (Phase 5 validation explicitly required no recursion — the blind expert panel reproduced the Phase 3.5 / Phase 4 conclusions cold, the strongest possible confirmation — so the project proceeded straight to the Decision Gate without consuming recursion budget.)
5. Assumption log
| # | Assumption | Confidence (0–1) | Validated? |
|---|---|---|---|
| 1 | A queryable canonical CID/link registry with a real owner exists and is accessible at Apple | 0.40 | No — must validate in discovery week 1–2; named the #1 systemic SPOF. Likely (a) an unmaintained CMS export or (b) several competing systems. |
| 2 | L10n source strings are exportable so language/locale verification means translation correctness, not just "no mojibake" | 0.45 | No — unconfirmed; L10n source-of-truth likely lives in a separate system with no clean export path. |
| 3 | An LLM can judge alt-text quality / translation fidelity at a precision high enough to keep the human review queue small | 0.35 | No — load-bearing for C; requires a labeled eval set + published precision/recall bar built before any production use. Constant failure mode: 55% production precision after an easy demo set. |
| 4 | Apple privacy/legal will approve routing page content (incl. unreleased pages) through some inference path | 0.30 | No — a hard, non-technical gate; resolve before writing a single C prompt. If "no," C is dead on arrival regardless of model quality. |
| 5 | Teams will adopt the tool in their workflow rather than letting reports go generated-but-unread | 0.40 | No — Sandy buried six such tools; durable usage needs a required, in-pipeline step. Highest-leverage assumption for lifting A's adoption/Probability factor. |
| 6 | A CI/publish-pipeline owner will accept a blocking external gate (in writing) | 0.45 | No — organizational, not technical; without a signed commitment B's defining value cannot land and it reverts to a side dashboard. |
| 7 | Semantic QA (meaningful alt, translation fidelity) is actually where material human QA hours are spent at Apple | 0.45 | No — asserted by C's premise; needs real QA-team time-allocation data, or C optimizes a non-bottleneck beautifully. |
| 8 | apple.com page identity is resolvable via a stable content key (not a volatile URL/byte hash) on a personalized/A-B/edge-rendered surface | 0.55 | No — solvable with normalized-DOM hashing keyed on content ID, but real engineering; if skipped, B's diff thrashes and is strictly worse than A. |
| 9 | No sanctioned enterprise a11y scanner already owns the compliance system-of-record role this would duplicate | 0.40 | No — if one exists, BEA-17 must aggregate over it or be blocked at standardization review. |
Gate: success criteria defined (5 measurable, time-bound), constraints mapped (8, typed hard/soft/assumed), decision-maker identified (Sharon Chan, Three Ledgers) → set phaseGates.0-foundation = passed in manifest.json.