January 14, 2026
Making GitHub Actions Suck a Little Less
A Simple Auto-Retry Workflow for Transient Failures
Do you find yourself babysitting your CI environment because of transients failures like ETIMEDOUT. ECONNRESET. npm ERR! network. 502 Bad Gateway? Yeah us too.
Our problem has gotten even worse lately because of all the AI coding agents under our command. Great throughput requires great responsibility.
The Solution: A Simple Auto-Retry Workflow #
This retry loop:
name: Auto-Retry Failed Workflows
on:
workflow_run:
workflows: ["Deploy"] # Your main workflow name
types: [completed]
branches: [main, dev]
permissions:
actions: write
jobs:
check-and-retry:
runs-on: ubuntu-latest
if: github.event.workflow_run.conclusion == 'failure'
steps:
- name: Check retry count
id: check
env:
GH_TOKEN: $
run: |
ATTEMPT=$
echo "attempt=$ATTEMPT" >> $GITHUB_OUTPUT
# Max 3 total attempts (1 original + 2 retries)
if [ "$ATTEMPT" -ge 3 ]; then
echo "should_retry=false" >> $GITHUB_OUTPUT
else
echo "should_retry=true" >> $GITHUB_OUTPUT
fi
- name: Download and analyze logs
if: steps.check.outputs.should_retry == 'true'
id: analyze
env:
GH_TOKEN: $
run: |
gh run view $ \
--repo $ \
--log-failed > failed_logs.txt 2>&1 || true
# Transient error patterns
TRANSIENT_PATTERNS="ETIMEDOUT|ECONNRESET|ENOTFOUND|rate limit|socket hang up|npm ERR! network|fetch failed|503 Service|502 Bad Gateway|504 Gateway|Connection reset|CERT_HAS_EXPIRED"
if grep -qiE "$TRANSIENT_PATTERNS" failed_logs.txt; then
echo "is_transient=true" >> $GITHUB_OUTPUT
MATCHED=$(grep -oiE "$TRANSIENT_PATTERNS" failed_logs.txt | head -1)
echo "matched_pattern=$MATCHED" >> $GITHUB_OUTPUT
else
echo "is_transient=false" >> $GITHUB_OUTPUT
fi
- name: Re-run failed jobs
if: steps.check.outputs.should_retry == 'true' && steps.analyze.outputs.is_transient == 'true'
env:
GH_TOKEN: $
run: |
gh run rerun $ \
--failed \
--repo $
That’s it, drop this in .github/workflows/auto-retry.yml and you’re mostly done - there’s not much in here specific to our infra besides the name of our github token in the secrets config, the “main” workflow we want watched, and some branch names.
By the way, this entire workflow was vibecoded. I described the problem to Claude, it wrote the workflow, I reviewed and merged… case in point about the compounding nature of the problem.
How It Works #
Event-driven trigger. The workflow_run event fires immediately when your main workflow completes. The connection is by name, not filename. In our case, workflows: ["Deploy"] matches the name: Deploy field in our deploy workflow. When any of the child jobs fail, the entire workflow is marked as failed, and the retry workflow fires.
# deploy.yml
name: Deploy # <-- This is what workflow_run matches on
on:
push:
branches: [main, dev]
jobs:
lint:
runs-on: ubuntu-latest
steps: [...]
typecheck:
runs-on: ubuntu-latest
steps: [...]
deploy:
needs: [lint, typecheck] # Runs after lint & typecheck pass
runs-on: ubuntu-latest
steps: [...]All these jobs -
lint,typecheck,deploy- are part of the single “Deploy” workflow. Iflintfails due toETIMEDOUT, the retry workflow sees the whole “Deploy” workflow failed and can surgically re-run just thelintjob.
Smart retry logic. GitHub tracks run_attempt automatically. We check if we’re under 3 total attempts before retrying.
Log analysis. Downloads the failed job logs and greps for known transient patterns. If it finds ETIMEDOUT or 502 Bad Gateway, it’s probably worth retrying. If it finds error TS2345: Argument of type 'string'..., we don’t retry for example.
Surgical retry. gh run rerun --failed only re-runs the jobs that failed, not the entire workflow.
Slack notifications. Sends a message when retrying (so you know it’s on it) and when it gives up after max attempts (so you know to look) by adding the sections below and updating your GH repo secrets with your slack info.
- name: Notify Slack - Retrying
if: steps.check.outputs.should_retry == 'true' && steps.analyze.outputs.is_transient == 'true'
uses: 8398a7/action-slack@v3
with:
status: custom
custom_payload: |
{
"text": ":recycle: Auto-retrying failed workflow (attempt $/3)",
"attachments": [{
"color": "warning",
"fields": [
{ "title": "Branch", "value": "$", "short": true },
{ "title": "Error", "value": "$", "short": true },
{ "title": "Run", "value": "<$|View Failed Run>", "short": false }
]
}]
}
env:
SLACK_WEBHOOK_URL: $
- name: Notify Slack - Final Failure
if: steps.check.outputs.should_retry == 'false' || steps.analyze.outputs.is_transient == 'false'
uses: 8398a7/action-slack@v3
with:
status: custom
custom_payload: |
{
"text": ":x: Workflow failed after $ attempts (not auto-retrying)",
"attachments": [{
"color": "danger",
"fields": [
{ "title": "Branch", "value": "$", "short": true },
{ "title": "Reason", "value": "$", "short": true },
{ "title": "Run", "value": "<$|View Failed Run>", "short": false }
]
}]
}
env:
SLACK_WEBHOOK_URL: $
The Transient Pattern List #
These are the patterns that trigger a retry:
| Pattern | What It Catches |
|---|---|
ETIMEDOUT |
Network timeout |
ECONNRESET |
Connection reset by peer |
ENOTFOUND |
DNS resolution failure |
npm ERR! network |
Any npm network error |
rate limit |
GitHub/npm rate limiting |
socket hang up |
Connection dropped |
fetch failed |
Generic fetch failure |
502 Bad Gateway |
Upstream server error |
503 Service |
Service temporarily unavailable |
504 Gateway |
Gateway timeout |
CERT_HAS_EXPIRED |
TLS certificate issues |
When you discover a new transient pattern in the wild, just add it to the regex.
Why Doesn’t GitHub Just Do This Internally? #
Not sure?
Closing Thoughts #
Honestly, this is a lot better.