Skip to content

Live Sandbox Validation

This playbook is the operator-friendly companion to flow/specs/flow-acceptance-sandbox-contract.md.

Use the disposable sandbox repository code4focus/test-salp unless the canonical contract is updated to name a different sandbox.

Treat code4focus/test-salp as clean sandbox infra on its default branch. It should keep the reviewer workflow under .github/**, ignore rules for generated .salp/flow/state/ and .salp/flow/tmp/, and minimal repo metadata, but it should not track flow/**. The current source repository remains authoritative for the bootstrapped repo pack and the installed wap-flow binary during the run.

The canonical live gate is one fresh compound proof sample, not a four-row matrix.

That sample starts from an untouched clean-infra sandbox clone, proves the clone is already flow-not-configured, bootstraps the current repo pack through wap-flow repo-pack bootstrap, carries one tiny seeded app through a bounded local-private development phase, then carries the same seeded sample family through a bounded GitHub-visible issue/branch/PR/review/merge phase.

The canonical sample proves:

  • installed repo-pack bootstrap on a disposable consumer repo
  • local-private task lifecycle, worktree flow, and blocking commit/push gates
  • GitHub-visible issue, branch, PR, review, repair, finalize, merge, and reconcile flow

It does not prove local packaging health, general product behavior, or arbitrary large-repo cost. Those stay in tools/flow-release-check.

As of April 26, 2026, canonical live-sandbox status is evidence-driven from .salp/flow/tmp/live-sandbox/index.json. Treat the latest flow/live-sandbox.yml entry in that generated index as the current local proof verdict; a latest verdict of pass means the earlier April 23 missing-proof blocker is closed for that source checkout.

The durable sandbox-repo facts are currently aligned with this contract:

  • code4focus/test-salp default branch is infra-only rather than a source-repo copy of flow/**.
  • the default branch tracks no flow/**
  • the default branch ignores generated .salp/flow/state/ and .salp/flow/tmp/
  • the default branch currently exposes the GitHub reviewer workflow on main with the required issue_comment trigger

Treat operator-local prerequisites as verify-before-run checks, not as repo-tracked done state. GitHub auth, installed CLIs, and API keys must still be confirmed by the operator before each live run.

This dated status block is maintainership guidance. Update it whenever the current blocker changes or the latest canonical live-sandbox verdict changes.

ConfigIntended useWhat it can proveWhat it cannot prove
flow/live-sandbox.ymlCanonical GA gateOne real-provider compound sample against the existing Kimi-backed Claude sandbox reviewer pathAnything outside the disposable sandbox scope
flow/live-sandbox.mock.ymlZero-cost frequent tierThe same compound runner model and lifecycle choreography without real reviewer spendReal provider behavior or latency
flow/live-sandbox.codex.ymlCodex parity canaryA supplemental GitHub-visible slice of the same compound sample modelThe canonical Kimi-backed GA proof or the full local-private lifecycle

Only flow/live-sandbox.yml can satisfy the GitHub-backed live proof requirement for Repo-Pack GA.

flowchart TD
A["clean-infra sandbox clone"] --> B["repo-shape proof + doctor fail"]
B --> C["repo-pack bootstrap + doctor"]
C --> D["local task + worktree + blocked commit/push"]
D --> E["local review + strict local finalize"]
E --> F["GitHub issue + branch + PR"]
F --> G["review findings loop"]
G --> H["strict PR finalize + merge + reconcile"]
H --> I["compound proof passes"]

The canonical config requires all of these components in one run:

  • flow_not_configured_bootstrap
  • installed_repo_pack_bootstrap
  • local_task_lifecycle
  • worktree_happy_path
  • blocked_commit_before_local_review
  • successful_commit_after_local_review
  • blocked_push_before_second_local_review
  • successful_push_after_second_local_review
  • local_acceptance_finalize
  • github_issue_branch_lifecycle
  • pr_creation_and_binding
  • github_review_loop
  • pr_acceptance_finalize
  • pr_merge
  • pr_reconcile

Use --rows only as a compatibility selector for component ids. The runner resolves prerequisite components automatically.

Before the live run:

  • start from the current source repository checkout you want to validate
  • confirm the sandbox repo default branch already carries the workflow that handles the GitHub-visible reviewer path
  • confirm the sandbox repo default branch tracks no flow/**
  • confirm the sandbox reviewer path is still the existing Claude sandbox reviewer backed by Kimi
  • keep the sample app small and bounded by the template catalog
  • keep the helper bounded to kimi-for-coding, medium, and the configured file/line/retry/timeout caps
  • keep the sandbox local claude_local review lane bounded to kimi-for-coding, medium, and 1200s
  • keep remote cleanup at close unless you are debugging a failure

tools/flow-live-sandbox-check --config flow/live-sandbox.yml runs the canonical gate.

At a high level it will:

  1. Clone code4focus/test-salp.
  2. Prove the untouched clone has zero tracked flow/**.
  3. Prove doctor --json --strict fails before bootstrap.
  4. Install the current wap-flow binary from the source repository.
  5. Run wap-flow repo-pack bootstrap --source <source-root> --target <clone> --json.
  6. Prove the bootstrap report copied the repo pack, installed hooks, pinned the binary, and passed strict doctor.
  7. Materialize one seeded tiny app template.
  8. Run the local-private lifecycle:
    • task local init with a run-specific key to avoid preserved failure branch collisions
    • start --task ...
    • plan record
    • worktree create/edit/promote/select/reconcile
    • authoritative-branch follow-up edit
    • blocked commit
    • accept review --agent claude_local
    • successful commit
    • blocked push
    • second local review
    • successful push
    • accept check --strict
    • accept finalize
  9. Run the GitHub-visible lifecycle:
    • create parent and task issues
    • create the issue branch
    • start --issue ...
    • plan record
    • commit and push one bounded reviewer-visible defect
    • create and bind the PR
    • complete
    • initialize acceptance
    • request GitHub review through the existing Kimi-backed Claude sandbox reviewer path
    • wait for claim and result
    • sync findings
    • apply the repair
    • resolve findings
    • rerequest review
    • wait and sync again
    • accept check --strict
    • accept finalize
    • merge the PR
    • pr reconcile --strict

Every run must emit a comparable proof bundle under .salp/flow/tmp/live-sandbox/<run-id>/.

Every run also appends .salp/flow/tmp/live-sandbox/index.json and refreshes .salp/flow/tmp/live-sandbox/index.md so release readiness can cite a rolling verdict history instead of only latest.*.

The bundle must record:

  • template id, variant id, seed, and task prompt
  • fresh-clone repo shape for each executed lane
  • local task id, worktree ids, reviewed commits, and changed files
  • GitHub issue ids, PR id, trigger timestamps, and remote artifact ids
  • reviewer provenance for the local lane, GitHub lane, and helper
  • per-component status plus final verdict
  • cleanup outcome

pass is valid only when every required component passes in the same fresh run.

  • A single passing component is not enough for the canonical config.
  • The mock and Codex configs are support signals only.
  • If the sandbox environment is ambiguous or broken, fix the environment before expanding coverage.
  • Keep remote artifacts only when the run fails or when --cleanup keep is explicitly requested.