Software for Days

January 14, 2026

Making GitHub Actions Suck a Little Less

A Simple Auto-Retry Workflow for Transient Failures

Do you find yourself babysitting your CI environment because of transients failures like ETIMEDOUT. ECONNRESET. npm ERR! network. 502 Bad Gateway? Yeah us too.

Our problem has gotten even worse lately because of all the AI coding agents under our command. Great throughput requires great responsibility.

The Solution: A Simple Auto-Retry Workflow #

This retry loop:

name: Auto-Retry Failed Workflows

on:
workflow_run:
workflows: ["Deploy"] # Your main workflow name
types: [completed]
branches: [main, dev]

permissions:
actions: write

jobs:
check-and-retry:
runs-on: ubuntu-latest
if: github.event.workflow_run.conclusion == 'failure'
steps:
- name: Check retry count
id: check
env:
GH_TOKEN: $
run: |
ATTEMPT=$
echo "attempt=$ATTEMPT" >> $GITHUB_OUTPUT


# Max 3 total attempts (1 original + 2 retries)
if [ "$ATTEMPT" -ge 3 ]; then
echo "should_retry=false" >> $GITHUB_OUTPUT
else
echo "should_retry=true" >> $GITHUB_OUTPUT
fi

- name: Download and analyze logs
if: steps.check.outputs.should_retry == 'true'
id: analyze
env:
GH_TOKEN: $
run: |
gh run view $ \
--repo $ \
--log-failed > failed_logs.txt 2>&1 || true


# Transient error patterns
TRANSIENT_PATTERNS="ETIMEDOUT|ECONNRESET|ENOTFOUND|rate limit|socket hang up|npm ERR! network|fetch failed|503 Service|502 Bad Gateway|504 Gateway|Connection reset|CERT_HAS_EXPIRED"

if grep -qiE "$TRANSIENT_PATTERNS" failed_logs.txt; then
echo "is_transient=true" >> $GITHUB_OUTPUT
MATCHED=$(grep -oiE "$TRANSIENT_PATTERNS" failed_logs.txt | head -1)
echo "matched_pattern=$MATCHED" >> $GITHUB_OUTPUT
else
echo "is_transient=false" >> $GITHUB_OUTPUT
fi

- name: Re-run failed jobs
if: steps.check.outputs.should_retry == 'true' && steps.analyze.outputs.is_transient == 'true'
env:
GH_TOKEN: $
run: |
gh run rerun $ \
--failed \
--repo $

That’s it, drop this in .github/workflows/auto-retry.yml and you’re mostly done - there’s not much in here specific to our infra besides the name of our github token in the secrets config, the “main” workflow we want watched, and some branch names.

By the way, this entire workflow was vibecoded. I described the problem to Claude, it wrote the workflow, I reviewed and merged… case in point about the compounding nature of the problem.

How It Works #

Event-driven trigger. The workflow_run event fires immediately when your main workflow completes. The connection is by name, not filename. In our case, workflows: ["Deploy"] matches the name: Deploy field in our deploy workflow. When any of the child jobs fail, the entire workflow is marked as failed, and the retry workflow fires.

# deploy.yml
name: Deploy # <-- This is what workflow_run matches on

on:
push:
branches: [main, dev]

jobs:
lint:
runs-on: ubuntu-latest
steps: [...]

typecheck:
runs-on: ubuntu-latest
steps: [...]

deploy:
needs: [lint, typecheck] # Runs after lint & typecheck pass
runs-on: ubuntu-latest
steps: [...]

All these jobs - lint, typecheck, deploy - are part of the single “Deploy” workflow. If lint fails due to ETIMEDOUT, the retry workflow sees the whole “Deploy” workflow failed and can surgically re-run just the lint job.

Smart retry logic. GitHub tracks run_attempt automatically. We check if we’re under 3 total attempts before retrying.

Log analysis. Downloads the failed job logs and greps for known transient patterns. If it finds ETIMEDOUT or 502 Bad Gateway, it’s probably worth retrying. If it finds error TS2345: Argument of type 'string'..., we don’t retry for example.

Surgical retry. gh run rerun --failed only re-runs the jobs that failed, not the entire workflow.

Slack notifications. Sends a message when retrying (so you know it’s on it) and when it gives up after max attempts (so you know to look) by adding the sections below and updating your GH repo secrets with your slack info.

- name: Notify Slack - Retrying
if: steps.check.outputs.should_retry == 'true' && steps.analyze.outputs.is_transient == 'true'
uses: 8398a7/action-slack@v3
with:
status: custom
custom_payload: |
{
"text": ":recycle: Auto-retrying failed workflow (attempt $/3)",
"attachments": [{
"color": "warning",
"fields": [
{ "title": "Branch", "value": "$", "short": true },
{ "title": "Error", "value": "$", "short": true },
{ "title": "Run", "value": "<$|View Failed Run>", "short": false }
]
}]
}

env:
SLACK_WEBHOOK_URL: $

- name: Notify Slack - Final Failure
if: steps.check.outputs.should_retry == 'false' || steps.analyze.outputs.is_transient == 'false'
uses: 8398a7/action-slack@v3
with:
status: custom
custom_payload: |
{
"text": ":x: Workflow failed after $ attempts (not auto-retrying)",
"attachments": [{
"color": "danger",
"fields": [
{ "title": "Branch", "value": "$", "short": true },
{ "title": "Reason", "value": "$", "short": true },
{ "title": "Run", "value": "<$|View Failed Run>", "short": false }
]
}]
}

env:
SLACK_WEBHOOK_URL: $

The Transient Pattern List #

These are the patterns that trigger a retry:

Pattern What It Catches
ETIMEDOUT Network timeout
ECONNRESET Connection reset by peer
ENOTFOUND DNS resolution failure
npm ERR! network Any npm network error
rate limit GitHub/npm rate limiting
socket hang up Connection dropped
fetch failed Generic fetch failure
502 Bad Gateway Upstream server error
503 Service Service temporarily unavailable
504 Gateway Gateway timeout
CERT_HAS_EXPIRED TLS certificate issues

When you discover a new transient pattern in the wild, just add it to the regex.

Why Doesn’t GitHub Just Do This Internally? #

Not sure?

Closing Thoughts #

Honestly, this is a lot better.