Chapter 13: Deployment shapes
Choose where your agent runs and the trade-offs each shape makes for concurrency, secrets, and permissions.
Xiaoman · The World Outside
All along, Xiaoman has worked at your side. Today it faces strangers on its own.
Draft chapter. First cut to prove the format; it will be hardened before it is indexed.
What you’ll build
A deployment plan for the PR reviewer. Up to now it ran on your laptop, triggered by you, with your personal credentials. Here you decide where it runs for real and write down the constraints that come with that choice: how many reviews run at once, how you stay under model rate limits, where secrets live, and what the agent is allowed to touch. By the end you will have a comparison table of the four common shapes, a chosen shape (a queue with workers), and concrete pseudocode for a webhook-triggered worker.
The deployment shape is not a cosmetic decision. It sets your latency profile, your cost ceiling, your failure behavior under load, and your blast radius if a key leaks. Pick it carefully and the agent holds up during a traffic spike; pick it carelessly and it falls over.
Prerequisites
- A working agent from Part 3 with guardrails and a sandbox.
- API keys for the model and for the code host (for example a GitHub token).
- A rough idea of your traffic: reviews per hour, and how bursty it is (ten PRs landing in one minute is a very different shape from ten spread over an hour).
Steps
1. List the shapes and their trade-offs
There are four common shapes. Each trades latency against cost and operational weight differently. One rule applies to all four: a production agent should hold no important state in process memory, because any instance can be killed and replaced at any time. State lives in the queue, the database, and the code host.
| Shape | Trigger | Latency | Concurrency control | Best for |
|---|---|---|---|---|
| Sync API service | HTTP request, caller waits | Lowest, but caller blocks | Connection/thread pool | Interactive, fast calls |
| Job queue + workers | Enqueue now, workers pull | Seconds to minutes | Worker count | Webhook-driven background work |
| Cron / batch | Scheduled tick | Whenever it runs | Batch size | Nightly sweeps, digests |
| Serverless function | Event, function spins up | Cold-start penalty | Platform concurrency cap | Spiky, short, infrequent work |
A PR review takes tens of seconds to minutes (it reads files and makes several model calls), so making a webhook caller wait synchronously is a poor fit: the code host will time the webhook out. That points away from the sync service.
2. Match shape to trigger
The PR reviewer reacts to webhooks from the code host. The right pattern is: the webhook handler does almost nothing except validate the payload, enqueue a job, and return 200 immediately; workers pull jobs and do the slow review work out of band. Cron suits nightly batch jobs (a weekly summary of review quality). Serverless suits spiky, short, infrequent reviews where you do not want an always-on worker, accepting the cold-start cost. For steady webhook traffic, the queue-with-workers shape wins on control and cost predictability, so that is what we build.
3. Bound concurrency
Set a fixed worker count and a per-worker in-flight limit so you never fan out unbounded model requests. This is your first defense against both rate limits and runaway cost: N workers each making at most one model call at a time caps your peak load at a number you choose, no matter how many webhooks arrive. The lesson 10 production guidance frames this as keeping the agent from becoming a black box; the observable metrics it lists include request errors, which is exactly what unbounded fan-out produces.
review_pr is the agent loop wrapped in a function: it calls query() inside the worker, triggered by an event, and returns once a review completes.
# review_pr: the agent loop wrapped in a function a worker can call
from claude_agent_sdk import query, ClaudeAgentOptions, AssistantMessage, TextBlock
async def review_pr(job):
options = ClaudeAgentOptions(
system_prompt=REVIEW_CONTRACT,
allowed_tools=["Read"], # read-only: stateless, no side effects
max_turns=6,
cwd=job["repo"],
)
out = ""
async for message in query(prompt=f"Review PR #{job['pr']} ({job['sha']}).", options=options):
if isinstance(message, AssistantMessage):
for block in message.content:
if isinstance(block, TextBlock):
out += block.text
return out
# Worker pool: N workers, each pulls one job at a time -> peak concurrency = N
import asyncio
async def worker(name, queue):
while True:
job = await queue.get() # blocks until work arrives
try:
await review_pr(job) # one review in flight per worker
except Exception as e:
await handle_failure(job, e) # retry/backoff or dead-letter
finally:
queue.task_done()
async def main(concurrency=4):
queue = await connect_queue()
await asyncio.gather(*[worker(f"w{i}", queue) for i in range(concurrency)])
// review_pr: the agent loop wrapped in a function a worker can call
import { query } from "@anthropic-ai/claude-agent-sdk";
async function reviewPr(job) {
const options = {
systemPrompt: REVIEW_CONTRACT,
allowedTools: ["Read"], // read-only: stateless, no side effects
maxTurns: 6,
cwd: job.repo,
};
let out = "";
for await (const message of query({ prompt: `Review PR #${job.pr} (${job.sha}).`, options })) {
if (message.type === "assistant") {
for (const block of message.message.content) {
if (block.type === "text") out += block.text;
}
}
}
return out;
}
// Worker pool: N workers, each pulls one job at a time -> peak concurrency = N
async function worker(name, queue) {
while (true) {
const job = await queue.get(); // blocks until work arrives
try {
await reviewPr(job); // one review in flight per worker
} catch (e) {
await handleFailure(job, e); // retry/backoff or dead-letter
} finally {
queue.taskDone();
}
}
}
async function main(concurrency = 4) {
const queue = await connectQueue();
await Promise.all(
Array.from({ length: concurrency }, (_, i) => worker(`w${i}`, queue)),
);
}
4. Build the webhook-to-worker path
Tie it together: a thin webhook endpoint that returns fast, a durable queue, and the worker pool above. The endpoint must verify the webhook signature before trusting the payload (an endpoint with no auth lets anyone post fake events), then enqueue and return. Doing the review inside the request handler is the classic mistake: the handler blocks for a minute, the code host gives up and retries, and you review the same PR twice.
# Webhook endpoint: validate, enqueue, return. Never review inline.
@app.post("/webhook/pr")
async def on_pr_event(request):
body = await request.body()
if not valid_signature(body, request.headers["X-Hub-Signature-256"], WEBHOOK_SECRET):
return Response(status=401) # reject forged events
event = parse(body)
if event.action in ("opened", "synchronize"):
await queue.put({
"pr": event.pr_number,
"repo": event.repo,
"sha": event.head_sha,
"idempotency_key": f"{event.repo}:{event.pr_number}:{event.head_sha}",
})
return Response(status=200) # return immediately, work happens later
The idempotency_key matters: webhooks can be delivered more than once, and workers can crash mid-job and retry. Keying on repo plus PR plus commit SHA lets the worker skip a review it has already posted for that exact commit, so a duplicate delivery does not produce a duplicate comment.
5. Plan for rate limits
Workers will sometimes hit the model provider’s per-minute limits, especially during a burst. Add retry with exponential backoff and jitter around model calls, and let the bounded queue absorb the burst rather than slamming the API. The queue acts as a buffer: incoming webhooks pile up safely while workers work through them at a rate they can sustain. See the model provider’s official docs for current limits, and never hardcode a limit as a fact, look it up.
async def call_model_with_backoff(req, max_tries=5):
delay = 1.0
for attempt in range(max_tries):
try:
return await model.call(req)
except RateLimited as e:
sleep_for = e.retry_after or (delay + random.random()) # honor server hint
await asyncio.sleep(sleep_for)
delay *= 2 # exponential backoff
raise PermanentFailure("rate limit not clearing") # fail closed
6. Place secrets and draw the permission boundary
Keep keys in a secret manager or platform environment, never in code, never in logs. Give each environment (dev, staging, prod) its own keys so a leak in staging cannot touch production. Then scope the host token to exactly what the agent needs: read the diff and post a review comment, nothing more. A token that can also push code or merge turns a prompt-injection bug into a security incident. How much damage the agent can do is capped by its permissions, so keep them as narrow as you can.
# Per-environment, least-privilege configuration (illustrative).
env: production
secrets: # injected from the secret manager, not in this file
MODEL_API_KEY: ${vault:prod/model_key}
GITHUB_TOKEN: ${vault:prod/gh_review_token}
github_token_scopes: # only what the reviewer needs
- pull_requests: read # read the diff
- pull_requests: write # post the review comment
# NO contents:write, NO workflow, NO admin
Learned: standing on its ownXiaoman moves off your laptop for the first time and runs on a real server taking traffic you did not send. It is triggered by a webhook and works inside a worker pool, with concurrency bounded, secrets isolated per environment, and its token scoped to read-and-comment, so a traffic spike does not take it down.
How to verify
- Fire ten webhooks at once and confirm in-flight reviews stay at your worker count, not ten. Watch the queue depth rise and drain.
- Deliver the same webhook twice and confirm the idempotency key prevents a duplicate comment.
- Revoke a key in staging and confirm the agent fails closed (errors and stops), not silently posting nothing.
- Grep your logs and confirm no secret ever appears, even in error traces.
Learned: holding up under loadYou can now fire ten webhooks at once and confirm in-flight reviews stay at the worker count instead of fanning out unbounded, a duplicate delivery posts no second comment, a revoked key makes it fail closed rather than go silent, and no secret leaks into the logs.
Why it works
The queue decouples the fast, untrusted webhook from the slow, expensive review. That one decoupling takes care of most of the rest: concurrency control becomes “how many workers,” rate-limit absorption becomes “how deep the queue can grow,” and retries become safe because jobs are idempotent and durable. Because workers hold no state, you can throw any of them away, scaling up under load and killing them when idle without losing work.
Recap
You compared four deployment shapes, chose a queue with workers to match a webhook trigger, then bounded concurrency, made jobs idempotent, handled rate limits with backoff, isolated secrets per environment, and scoped the host token to read-and-comment only. Shape is not just where code runs: it sets your cost, latency, and blast radius. Next we cover versioning and rollback.
Common pitfalls
- Reviewing inline in the webhook handler, so the code host times out and retries, doubling your work.
- Unbounded fan-out: no concurrency cap means one busy hour exhausts your rate limit or budget.
- No idempotency key, so a redelivered webhook posts a second review on the same commit.
- Shared keys across environments: a leak in staging then compromises production.
- Over-scoped tokens: a token that can push code, not just comment, turns a bug into an incident.
Xiaoman leaves your machine for the first time and runs on a real server, facing users it has never met. For the first batches it checks itself over and over, slow to hand anything off, plainly nervous. On your side of the screen, you feel like a parent on the first day of school. The World Outside lights up.
Just lit The World Outside · 14 / 16 lit
Sources
- Microsoft: AI Agents in Production · official
- Microsoft: AI Agents in Production (lesson 10) · official