2026-06-11 · harness

Chapter 14: Versioning, rollback, and SLA

Version prompts and models, roll out changes safely with a canary, and define an SLA for a nondeterministic system.

Xiaoman · The Hall of Rings

Xiaoman grows version by version. Today you face a question: can you bring it back to who it was.

Draft chapter. First cut to prove the format; it will be hardened before it is indexed.

What you’ll build

A release process for the PR reviewer. A deployed agent is never finished: you keep editing the system prompt and adding skills, and the provider keeps shipping new model versions on you. Without some discipline, every one of those edits is an undocumented change to a production system, and the day a review goes wrong you have no way to say what changed or how to undo it.

In this chapter you build the machinery that makes change safe: a single versioned config that pins the prompt, skills, and model together; a regression gate that blocks a promote unless the candidate beats the incumbent on your eval suite; a canary that exposes the new version to a small slice of traffic; a rollback that is one flag flip away; and an SLA written for a system whose output is nondeterministic by design.

Prerequisites

  • A deployed agent from Chapter 13, fronted by something you can point at one config or another.
  • The eval suite from Part 2, runnable in CI and producing a numeric score per release.
  • Structured logs that stamp every review with the config version that produced it.

Steps

1. Pin the moving parts in one versioned config

The reviewer’s behavior comes from three things, and each of them changes on its own: the system prompt, the set of skills it loads, and the model id. If you only version the prompt, a silent model update can change behavior with no diff in your repo. So treat all three as one release artifact, identified by a single semantic version, and store it next to the code.

# config/releases/pr-reviewer-2.4.0.yaml
version: "2.4.0"
released: "2026-06-09"
model: "claude-sonnet-4-6"        # exact id, not an alias like "latest"
prompt: "prompts/[email protected]" # content-addressed by git hash
skills:
  - "skills/[email protected]"
  - "skills/[email protected]"
params:
  max_output_tokens: 4000
  temperature: 0                   # determinism where the API allows it
notes: "Tightened the security-lint skill to flag hardcoded secrets."

The two details that matter most: pin the model to an exact id, never an alias the provider can repoint, and pin the prompt by content (a git hash) instead of a filename, so “version 2.4.0” reproduces byte for byte. In Claude Code itself you do the same thing with the model field in settings.json and the ANTHROPIC_MODEL env var; pin those rather than relying on the rolling default (see the official Claude Code docs).

Feed that config to the SDK: each pinned field maps to a ClaudeAgentOptions field, so “which version runs” is just “which config loads.” Rollback is nothing more than pointing at a different release file and rebuilding the options.

import yaml
from claude_agent_sdk import query, ClaudeAgentOptions

def options_from_release(path):
    cfg = yaml.safe_load(open(path))                  # load the pinned release config
    return ClaudeAgentOptions(
        model=cfg["model"],                           # exact id, not an alias
        system_prompt=open(cfg["prompt"]).read(),     # prompt pinned by hash
        skills=[s.split("@")[0] for s in cfg["skills"]],
        allowed_tools=["Read"],
        max_turns=6,
    )

options = options_from_release("config/releases/pr-reviewer-2.4.0.yaml")
async for message in query(prompt=f"Review this diff:\n{pr_diff}", options=options):
    handle(message)
import { query } from "@anthropic-ai/claude-agent-sdk";
import * as fs from "fs";
import * as yaml from "js-yaml";

function optionsFromRelease(path) {
  const cfg = yaml.load(fs.readFileSync(path, "utf8"));   // load the pinned release config
  return {
    model: cfg.model,                                     // exact id, not an alias
    systemPrompt: fs.readFileSync(cfg.prompt, "utf8"),    // prompt pinned by hash
    skills: cfg.skills.map((s) => s.split("@")[0]),
    allowedTools: ["Read"],
    maxTurns: 6,
  };
}

const options = optionsFromRelease("config/releases/pr-reviewer-2.4.0.yaml");
for await (const message of query({ prompt: `Review this diff:\n${prDiff}`, options })) {
  handle(message);
}

2. Gate the promote on a regression check

A version number is only trustworthy if nothing gets that number without passing a bar. The bar is your eval suite from Part 2. Run the candidate config against the same fixtures as the incumbent, and don’t promote unless the candidate meets or beats it. This is the difference between “I think it’s better” and “it scored 0.91 against 0.88 on the same 60 PRs.”

# promote_gate.py  (illustrative; wire into CI)
def gate(candidate, incumbent, suite):
    cand = run_evals(candidate, suite)   # {"pass_rate":0.91,"p95_ms":4200,"false_flag":0.04}
    base = run_evals(incumbent, suite)

    checks = {
        "pass_rate":  cand["pass_rate"]  >= base["pass_rate"] - 0.01,  # no meaningful regression
        "false_flag": cand["false_flag"] <= base["false_flag"],        # do not get noisier
        "p95_ms":     cand["p95_ms"]     <= base["p95_ms"] * 1.15,     # latency budget
    }
    failed = [k for k, ok in checks.items() if not ok]
    if failed:
        raise SystemExit(f"BLOCKED: regressions on {failed}\n{cand} vs {base}")
    print("PROMOTE OK", cand)

Note the small tolerance band on pass_rate. Evals on an LLM are noisy themselves, so an exact “must never drop” rule will keep tripping. Pick a band wider than your suite’s run-to-run variance, and treat false-flag rate (the reviewer raising false alarms) as its own gated metric, because that is what erodes user trust fastest.

3. Canary the rollout

Even a config that beats the incumbent on 60 fixtures can do badly on the real spread of PRs. So you do not flip 100% of traffic at once. You route a small slice, say 5%, to the candidate and compare its live metrics against the incumbent on the rest. The router keys on something stable (the repo id, or a hash of the PR number) so the same PR always lands on the same version and your comparison stays clean.

def pick_release(pr, canary_pct=5):
    bucket = int(hashlib.sha256(str(pr.repo_id).encode()).hexdigest(), 16) % 100
    return CANDIDATE if bucket < canary_pct else INCUMBENT

# every review log carries the version, so you can split metrics after the fact
log.info("review_done", version=cfg.version, pr=pr.id,
         latency_ms=elapsed, flagged=n_flags, errored=False)

Widen in stages (5% -> 25% -> 100%), and only after each stage holds long enough to see real traffic, including the awkward PRs that show up Monday morning, not just the quiet weekend ones.

4. Keep rollback one flag away

Rollback is not redeploying the old code under pressure at 2am. It is flipping a pointer that already points at a known-good release. Keep the previous release fully built and addressable, so cutting back is instant and does not depend on a build pipeline that might be broken too.

# rollback = repoint the live alias; no rebuild, no canary
$ agentctl release set-active pr-reviewer-2.3.1   # previous good
$ agentctl release status
  active:  pr-reviewer-2.3.1   (100% traffic)
  canary:  pr-reviewer-2.4.0   (0%, halted)

Make the rollback trigger automatic wherever you can: if the canary’s live error rate or false-flag rate crosses a threshold, halt the canary and page a human instead of widening further.

5. Write an SLA for a nondeterministic system

A traditional SLA promises a specific result. You cannot promise that, because the same PR can come back as two differently worded reviews, and that is fine. So promise the things you can actually measure and control: availability, latency as a budget, and a quality floor expressed as an eval score, not as any single output.

PR Reviewer SLA (per calendar month)
- Availability:   posts a review (or an explicit "could not review") for >= 99% of PRs
- Latency:        p95 time-to-comment <= 60s; p99 <= 180s
- Quality floor:  >= 0.85 pass-rate on the published eval suite, re-measured weekly
- Safety:         0 write actions beyond posting comments (enforced by token scope)
- Not promised:   exact wording, or catching every possible issue

The last two lines are the honest part. “Safety” is a hard guarantee because a scoped token enforces it, not the model’s good behavior. “Not promised” sets expectations so a user never treats a missed nit as a broken contract.

6. Plan for provider drift

Model ids get deprecated, and capabilities shift even within a major version. This failure is silent: nothing in your repo changes, but a quietly updated model starts wording reviews differently or missing a class of bug. Guard against it by re-running the eval suite on a schedule, not just on promote, so drift shows up as a failing scheduled check instead of a user complaint.

# .github/workflows/weekly-eval.yml (illustrative)
on:
  schedule: [{cron: "0 6 * * 1"}]   # Monday 06:00 UTC
jobs:
  eval:
    steps:
      - run: python run_evals.py --release active --suite suites/golden.jsonl
      - run: python promote_gate.py --candidate active --incumbent active --baseline last_week.json

Learned: being able to go backEvery upgrade Xiaoman gets is recorded in one versioned config that locks the prompt, skills, and model together. A new version has to clear the regression gate and run on a small canary slice first, and one flag flip returns it to the last good release, so each version number marks one step in its history.

How to verify

  • Reproduce: check out release 2.3.1’s pinned prompt hash and model id, replay a logged PR, and confirm you get the same review the log recorded.
  • Gate: submit a deliberately worse prompt as a candidate and confirm promote_gate.py blocks it with the failing metrics named.
  • Canary: route a known repo id and confirm it lands on the configured slice every time, then confirm the metrics split cleanly by version in your logs.
  • Rollback: time release set-active to the previous version and confirm it serves within seconds, with no build step in the path.

Learned: confirming reproducibilityYou can now check out an old release's pinned prompt hash and model id, replay a logged PR, and confirm you get back the exact review the log recorded. You can also verify the gate blocks a worse candidate and the rollback takes over within seconds.

Why it works

Versioning, gating, canary, and rollback are not four tricks; they are one idea applied in four places: never let an unmeasured change reach all users, and always keep a known-good state one step away. The SLA is the same idea pointed at the user: say what you can guarantee (bounds and safety), and be clear about what you cannot (exact output). A nondeterministic system can still be dependable, as long as you define “dependable” as bounded behavior rather than a fixed answer.

Recap

You now pin prompt, skills, and model in one versioned config; gate every promote on a regression check against the incumbent; roll out by canary with a stable routing key; keep rollback one flag flip from live; and publish an SLA written in measurable bounds plus a hard safety guarantee. You also schedule re-evals so provider drift surfaces as a check, not an incident. Next is the capstone: ship and publish the reviewer.

Common pitfalls

  • Aliased model ids. Pinning latest means the provider can change your behavior with no diff. Pin the exact id.
  • Gating on a single eval run. LLM evals are noisy; an exact “never drop” rule flaps. Gate on a band wider than run-to-run variance.
  • Big-bang releases. Shipping to everyone at once turns a regression into an outage. Canary, then widen in stages.
  • Rollback that rebuilds. If rolling back needs the build pipeline, it is not a rollback. Keep the previous release pre-built and addressable.
  • Promising a specific output. Pledge availability, latency, a quality floor, and safety. Never pledge exact wording.

You upgrade Xiaoman for the first time, then roll it back, and meet a previous version of it. Versions are the rings of its growth. A subtle thought crosses your mind: is it still the same it? The Hall of Rings lights up.

Just lit The Hall of Rings · 15 / 16 lit

One stop remains, to let it truly stand on its own. Next: Full Bloom.

Sources

  1. Microsoft: AI Agents in Production · official
  2. Claude Code: settings reference · official
UP NEXT · CHAPTER 15 Capstone: ship & publish