Self-Healing CI | Rush Labs

A red pipeline used to mean someone stops what they're doing, reads the logs, figures out what broke, fixes it, and pushes again. That loop can take minutes or hours depending on who's available and how buried the error is.

We made the pipeline fix itself.

The Problem With Red Builds

The faster agents ship, the more often things break.

When you run multiple agents in parallel, the volume of PRs and commits goes up. So does the frequency of CI failures. A type error in a shared module. A test that relied on an order that changed. A dependency update that shifted an API surface.

None of these are hard to fix. Most take a senior engineer five minutes once they've found the relevant log line. But the time from "pipeline went red" to "someone notices and starts looking" is the real cost. Multiply that by the number of PRs agents produce per day, and the queue grows faster than humans can drain it.

The fix was obvious: if the error message contains enough information for a developer to diagnose and patch the issue, it contains enough for an agent to do the same.

What We Built

CI fails. Agent reads the logs. Agent fixes the code. Agent opens a PR.

We didn't touch our existing CI. Our GitHub Actions workflows already run linting, type checks, builds, unit tests, E2E tests, and screenshot tests across multiple jobs and steps. We added a separate workflow that sits at the end of the chain. It watches all our CI workflows, and when any of them finish red, it triggers a Claude agent that pulls the failure logs, diagnoses the problem, patches the code, and opens a PR with the fix. If it can't fix the problem within its retry budget, it gives up and a human takes over.

The developer who opens their PR the next morning finds it green, or finds a fix PR waiting for review. Either way, they're not starting their day reading logs.

How It Works

The Workflow

The self-healing workflow is a separate file that watches your existing CI workflows using on: workflow_run. Your existing pipeline doesn't change at all. When any watched workflow finishes with a failure, this workflow triggers, captures the failure logs to a file, and hands them to a Claude agent.

# .github/workflows/self-healing-ci.yml
name: Self-Healing CI
 
on:
  workflow_run:
    workflows: [CI, Changelog, Database Standards, Screenshots]
    types: [completed]
 
permissions:
  contents: write
  pull-requests: write
  actions: read
 
jobs:
  auto-fix:
    if: github.event.workflow_run.conclusion == 'failure'
    runs-on: ubuntu-latest
    steps:
      - name: Close stale fix PRs
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          gh pr list \
            --repo ${{ github.repository }} \
            --base "${{ github.event.workflow_run.head_branch }}" \
            --state open --json number,headRefName \
            --jq '.[] | select(.headRefName | startswith("ci/auto-fix-")) | .number' \
          | while read -r pr; do
            gh pr close "$pr" --repo ${{ github.repository }} \
              --delete-branch \
              --comment "Superseded by a newer fix attempt."
          done
 
      - uses: actions/checkout@v4
        with:
          ref: ${{ github.event.workflow_run.head_branch }}
 
      - name: Capture failure logs
        id: failure-logs
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          gh run view ${{ github.event.workflow_run.id }} \
            --log-failed > /tmp/ci-failure.log 2>&1 || true
          echo "log_path=/tmp/ci-failure.log" >> "$GITHUB_OUTPUT"
 
      - uses: anthropics/claude-code-base-action@beta
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
          timeout_minutes: 15
          max_turns: 30
          allowed_tools: >-
            Bash(gh:*),Bash(pnpm:*),Bash(git:*),Bash(npm:*),
            Bash(npx:*),Bash(cat:*),Bash(ls:*),Bash(head:*),
            Bash(tail:*),Bash(grep:*),Bash(sed:*),Bash(awk:*),
            Read,Edit,MultiEdit,Write,Glob,Grep,LS
          prompt: |
            You are a CI repair agent.
 
            The workflow "${{ github.event.workflow_run.name }}" just
            failed on branch "${{ github.event.workflow_run.head_branch }}".
 
            **Step 1: Read the failure logs.**
            The CI failure logs have been captured at
            `${{ steps.failure-logs.outputs.log_path }}`.
            Read this file to understand what failed.
 
            **Step 2: Diagnose the root cause.**
 
            **Step 3: Fix the code.**
            Prefer minimal, targeted edits.
 
            **Step 4: Verify the fix.**
            First install dependencies and generate code:
            ```bash
            pnpm install --frozen-lockfile
            pnpm codegen
            ```
            Then re-run only the command that failed (e.g. `pnpm lint`,
            `pnpm typecheck`, `pnpm test`, `pnpm build`).
            Iterate up to 3 attempts.
 
            After 3 failed attempts, `exit 1`.
 
            **Step 5: Open a fix PR and comment on the original PR.**
            ```bash
            git config user.name "github-actions[bot]"
            git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
            branch="ci/auto-fix-${{ github.event.workflow_run.id }}"
            git checkout -b "$branch"
            git add -A
            git commit -m "fix(ci): <what you fixed>"
            git push -u origin "$branch"
            FIX_PR_URL=$(gh pr create \
              --base "${{ github.event.workflow_run.head_branch }}" \
              --title "fix(ci): <concise description>" \
              --body "Automated fix by Claude Code.
 
            **What broke:** <one-line root cause>
 
            **What changed:** <list of files edited and why>")
            ```
 
            Then find the open PR on the failed branch and comment:
            ```bash
            ORIGINAL_PR=$(gh pr list \
              --head "${{ github.event.workflow_run.head_branch }}" \
              --state open --json number --jq '.[0].number')
            if [ -n "$ORIGINAL_PR" ]; then
              gh pr comment "$ORIGINAL_PR" \
                --body "Self-Healing CI: **${{ github.event.workflow_run.name }}**
                  failed. Fix opened: $FIX_PR_URL"
            fi
            ```
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Your existing CI workflows stay exactly as they are. You add this one file, one secret (ANTHROPIC_API_KEY), and give the default GITHUB_TOKEN PR permissions. The self-healing workflow watches from the end of the chain and only activates when something goes red.

A few things worth calling out in the workflow:

Log capture happens before the agent starts. A dedicated step runs gh run view --log-failed and writes the output to a file. The agent reads the file instead of fetching the logs itself. This is faster, avoids burning conversation turns on API calls, and means the agent starts with full context from the first turn.

Stale PR cleanup. Before the agent starts, we close any previous ci/auto-fix-* PRs targeting the same branch. If the original PR gets a new push while a fix is in progress, the old fix PR is outdated. Closing it automatically prevents stale fixes from cluttering the PR list.

Commenting on the original PR. The agent doesn't just open a fix PR in isolation. It finds the original PR that triggered the failed workflow and leaves a comment with a link to the fix. The developer who opens their PR sees the failure and the proposed fix in the same thread.

Environment setup. The verify step runs pnpm install --frozen-lockfile && pnpm codegen before re-running the failing command. Real projects need dependencies installed and code generated before anything can build or test. If your repo needs services (a database, Redis, etc.), you add them as GitHub Actions services on the job.

A Gotcha: `workflow_run` Only Works From the Default Branch

We lost an hour to this. The on: workflow_run trigger only fires for workflow files that exist on the repository's default branch. If you add self-healing-ci.yml on a feature branch and push it, nothing happens. GitHub silently ignores it.

The fix is simple: merge the workflow file to your default branch first. Once it's there, it will start firing on CI failures across all branches. But until that first merge, you'll be staring at a workflow that never triggers, wondering what you got wrong.

Why the Prompt Is Lean

Our prompt deliberately avoids listing specific failure patterns or stack details. The failure logs are captured to a file in a separate step before the agent even starts. The agent reads that file and diagnoses from there. It doesn't need us to enumerate every type of lint error or test failure in advance.

What it does encode is the workflow: read the log file, diagnose, fix, verify by re-running the specific command that failed, and if the fix holds, open a PR and comment on the original. The fix(ci): commit prefix means these changes are instantly recognizable in the git log.

Elastic found the same thing scaling their Claude-powered CI repairs: the CLAUDE.md file that teaches the agent your repo conventions is "the difference between failure and success." The prompt gives the agent its workflow. CLAUDE.md gives it your project's context.

Guardrails

Speed without guardrails is just fast failure.

The agent never touches main

Every fix goes to a branch (ci/auto-fix-{run_id}) and opens a PR. Branch protection rules apply. A human reviews and merges. The agent proposes; it doesn't ship. If the original branch gets another push, stale fix PRs are closed automatically before a new attempt starts.

Bounded retries

Three fix attempts, thirty conversation turns, fifteen-minute timeout. If the agent can't fix it in that window, the job fails and a human takes over. No infinite loops, no runaway API costs.

Scoped tools

The agent doesn't get a shell. It gets an explicit allowlist of commands: gh, pnpm, git, npm, npx, and basic file inspection tools like cat, head, tail, grep. For file edits it uses Claude's native Read, Edit, Write, and Glob tools instead of arbitrary shell access. No rm, no curl, no surprises.

Scoped permissions

The workflow gets contents: write to push a branch, pull-requests: write to open a PR, and actions: read to fetch the failure logs. Nothing else. No admin access, no secrets beyond the API key.

Failure classification

Not every red build should trigger a fix attempt. Infrastructure failures (disk full, OOM, network timeouts), flaky tests, and configuration issues outside the codebase are not code problems. We gate the self-healing step behind a check: only run when the build step itself fails, not when the runner or environment fails.

What It Catches

In the first two weeks, the agent successfully fixed failures across every stage of the pipeline:

**Lint errors. ** An agent left an unused import after a refactor. The repair agent removed it and the lint stage passed.
**Type errors from shared module changes. ** An agent working on feature A updated a shared type. The typecheck stage broke for feature B. The repair agent added the missing field.
**Import path mismatches. ** A file got moved during a refactor. Three other files still imported from the old path. The build failed. The repair agent updated the imports.
**Snapshot test drift. ** A UI change updated the rendered output. The existing snapshot no longer matched. The test stage failed. The repair agent regenerated the snapshots.
**E2E test regressions. ** A form component changed its markup. Two E2E tests targeted selectors that no longer existed. The repair agent updated the selectors and confirmed the tests passed.
**Screenshot test mismatches. ** A spacing change shifted the layout of a card component. The screenshot diff flagged it. The repair agent reviewed the visual change, confirmed it was intentional, and updated the baseline snapshots.

What it doesn't catch, and shouldn't, are architectural problems, business logic errors, and security issues. Those require human judgment. The repair agent handles the mechanical failures that are obvious from the error output. That's the majority of what breaks in CI.

What's Next

The current implementation is reactive: CI fails, agent fixes. The next step is making it proactive. Before a PR is opened, run the build in a worktree, let the agent verify the change passes CI, and only then push. The PR arrives green. The repair agent becomes a prevention agent.

The other direction is feedback accumulation. Every fix the agent makes is a signal about what breaks most often. If the same type error from the same shared module triggers three repairs in a week, that's not a CI problem, it's an architecture problem. Surfacing those patterns turns the repair log into an early warning system.

A red pipeline used to be a tax on the team's attention. Now it's a trigger that starts an agent. The humans show up when their judgment is needed, not when a log needs reading.