Linux Service Incident Response Runbook: From Alert to Root Cause in 30 Minutes

Published on March 27, 2026

An alert fires. A service is down. The clock starts. This runbook gives you a repeatable 30-minute workflow for triaging systemd service incidents on Linux, from the first shell prompt to a documented root cause.

Minutes 0-5: First-Response Triage

Your only goal in the first five minutes is to confirm the scope and severity. Do not fix anything yet.

  • Confirm the service state:
systemctl status <service>
systemctl is-active <service>
  • Check how long it has been down and whether systemd is restarting it in a loop:
systemctl show <service> -p ActiveEnterTimestamp -p NRestarts -p Result
  • Grab the last crash output immediately:
journalctl -u <service> -n 50 --no-pager -o short-iso
  • Check for broader system trouble (OOM, disk, CPU):
dmesg -T | tail -30
df -h
free -m

At the five-minute mark you should know: is this a single-service crash, a restart loop, or a system-wide resource problem?

Minutes 5-15: Log Analysis Deep Dive

Now dig into the journal. Filter by the service unit and the time window surrounding the incident.

  • View logs since the last boot, errors only:
journalctl -u <service> -b -p err --no-pager
  • Narrow to a specific time window when the alert fired:
journalctl -u <service> --since "2026-03-27 14:00" --until "2026-03-27 14:30" --no-pager
  • Correlate with kernel and other units that may share dependencies:
journalctl -b -p warning --no-pager | grep -iE "oom|segfault|killed|timeout"
  • If the service writes its own log files outside the journal, check those too:
tail -200 /var/log/<service>/*.log | less

Look for the pattern: what was the last successful operation before the failure line? That transition point is usually where the root cause lives.

Minutes 15-20: Dependency and Configuration Validation

Many service failures are caused by something the service depends on, not the service itself.

  • List what this service requires and what it is ordered after:
systemctl list-dependencies <service>
systemctl cat <service> | grep -iE "requires|after|wants|bindsto"
  • Verify those dependencies are healthy:
systemctl is-active postgresql nginx redis
  • Check for recent config changes (the number-one cause of incidents):
find /etc -name "*.conf" -newer /var/run/utmp -mmin -120 -ls
systemctl cat <service>
diff <(systemctl cat <service>) /path/to/known-good-backup.service
  • Validate environment files and secrets the unit references:
systemctl show <service> -p EnvironmentFiles -p Environment
cat /etc/default/<service>
  • Check listening ports and connectivity for network services:
ss -tlnp | grep <expected_port>
curl -sf http://localhost:<port>/health || echo "health check failed"

Minutes 20-25: Rollback and Mitigation Decisions

By now you should have a hypothesis. Choose the right action based on what you found.

If a config change caused the failure:

# Revert the config
cp /etc/<service>/config.bak /etc/<service>/config
systemctl daemon-reload
systemctl restart <service>
systemctl status <service>

If a package update broke the service:

# Check recent package changes
rpm -qa --last | head -20        # RHEL/CentOS
zcat /var/log/apt/history.log.*.gz | grep -A5 "Start-Date:$(date +%Y-%m-%d)"  # Debian/Ubuntu

# Roll back the specific package
apt install <package>=<previous_version>    # Debian/Ubuntu
dnf history undo last                         # RHEL/Fedora

If the service is in a restart loop and you need to stop the bleeding:

systemctl stop <service>
systemctl reset-failed <service>
# Fix the underlying issue, then:
systemctl start <service>

If resource exhaustion caused the crash (OOM, disk full):

# Free up disk space
journalctl --vacuum-size=500M
find /var/log -name "*.gz" -mtime +30 -delete

# Check OOM kills
dmesg -T | grep -i "out of memory"
journalctl -k | grep -i oom

Minutes 25-30: Verify Recovery and Document

Confirm the service is stable, not just running.

# Confirm it stays up for at least 60 seconds
systemctl restart <service>
sleep 5 && systemctl is-active <service>
# Watch logs in real-time for follow-on errors
journalctl -u <service> -f &

Post-Incident Template

Copy this into your incident tracker before you close the alert. Fill it in while the details are fresh.

## Incident Report
**Service:**
**Alert fired at:**
**Acknowledged at:**
**Resolved at:**
**Duration:**
**Severity:** (P1/P2/P3/P4)
**On-call responder:**

## Timeline
- HH:MM - Alert received
- HH:MM - Triage started, confirmed [service] down/degraded
- HH:MM - Root cause identified: [description]
- HH:MM - Mitigation applied: [action taken]
- HH:MM - Service confirmed stable

## Root Cause
[One paragraph: what failed and why]

## What Changed
[Package update / config change / traffic spike / upstream dependency / unknown]

## Mitigation Applied
[Rollback / restart / config revert / resource cleanup]

## Prevention Actions
- [ ] [Action item with owner and due date]
- [ ] [Action item with owner and due date]
- [ ] [Add monitoring/alerting for the gap that allowed this]

Key Principles

  • Triage first, fix second. Understand the blast radius before you touch anything.
  • Preserve evidence. Copy logs and timestamps before restarting services.
  • One change at a time. If you change two things and the service recovers, you do not know which one fixed it.
  • Write it down while it is fresh. The post-incident report written 10 minutes after resolution is worth ten times more than one written next week.