An alert fires. A service is down. The clock starts. This runbook gives you a repeatable 30-minute workflow for triaging systemd service incidents on Linux, from the first shell prompt to a documented root cause.
Minutes 0-5: First-Response Triage
Your only goal in the first five minutes is to confirm the scope and severity. Do not fix anything yet.
- Confirm the service state:
systemctl status <service>
systemctl is-active <service>- Check how long it has been down and whether systemd is restarting it in a loop:
systemctl show <service> -p ActiveEnterTimestamp -p NRestarts -p Result- Grab the last crash output immediately:
journalctl -u <service> -n 50 --no-pager -o short-iso- Check for broader system trouble (OOM, disk, CPU):
dmesg -T | tail -30
df -h
free -mAt the five-minute mark you should know: is this a single-service crash, a restart loop, or a system-wide resource problem?
Minutes 5-15: Log Analysis Deep Dive
Now dig into the journal. Filter by the service unit and the time window surrounding the incident.
- View logs since the last boot, errors only:
journalctl -u <service> -b -p err --no-pager- Narrow to a specific time window when the alert fired:
journalctl -u <service> --since "2026-03-27 14:00" --until "2026-03-27 14:30" --no-pager- Correlate with kernel and other units that may share dependencies:
journalctl -b -p warning --no-pager | grep -iE "oom|segfault|killed|timeout"- If the service writes its own log files outside the journal, check those too:
tail -200 /var/log/<service>/*.log | lessLook for the pattern: what was the last successful operation before the failure line? That transition point is usually where the root cause lives.
Minutes 15-20: Dependency and Configuration Validation
Many service failures are caused by something the service depends on, not the service itself.
- List what this service requires and what it is ordered after:
systemctl list-dependencies <service>
systemctl cat <service> | grep -iE "requires|after|wants|bindsto"- Verify those dependencies are healthy:
systemctl is-active postgresql nginx redis- Check for recent config changes (the number-one cause of incidents):
find /etc -name "*.conf" -newer /var/run/utmp -mmin -120 -ls
systemctl cat <service>
diff <(systemctl cat <service>) /path/to/known-good-backup.service- Validate environment files and secrets the unit references:
systemctl show <service> -p EnvironmentFiles -p Environment
cat /etc/default/<service>- Check listening ports and connectivity for network services:
ss -tlnp | grep <expected_port>
curl -sf http://localhost:<port>/health || echo "health check failed"Minutes 20-25: Rollback and Mitigation Decisions
By now you should have a hypothesis. Choose the right action based on what you found.
If a config change caused the failure:
# Revert the config
cp /etc/<service>/config.bak /etc/<service>/config
systemctl daemon-reload
systemctl restart <service>
systemctl status <service>If a package update broke the service:
# Check recent package changes
rpm -qa --last | head -20 # RHEL/CentOS
zcat /var/log/apt/history.log.*.gz | grep -A5 "Start-Date:$(date +%Y-%m-%d)" # Debian/Ubuntu
# Roll back the specific package
apt install <package>=<previous_version> # Debian/Ubuntu
dnf history undo last # RHEL/FedoraIf the service is in a restart loop and you need to stop the bleeding:
systemctl stop <service>
systemctl reset-failed <service>
# Fix the underlying issue, then:
systemctl start <service>If resource exhaustion caused the crash (OOM, disk full):
# Free up disk space
journalctl --vacuum-size=500M
find /var/log -name "*.gz" -mtime +30 -delete
# Check OOM kills
dmesg -T | grep -i "out of memory"
journalctl -k | grep -i oomMinutes 25-30: Verify Recovery and Document
Confirm the service is stable, not just running.
# Confirm it stays up for at least 60 seconds
systemctl restart <service>
sleep 5 && systemctl is-active <service>
# Watch logs in real-time for follow-on errors
journalctl -u <service> -f &Post-Incident Template
Copy this into your incident tracker before you close the alert. Fill it in while the details are fresh.
## Incident Report
**Service:**
**Alert fired at:**
**Acknowledged at:**
**Resolved at:**
**Duration:**
**Severity:** (P1/P2/P3/P4)
**On-call responder:**
## Timeline
- HH:MM - Alert received
- HH:MM - Triage started, confirmed [service] down/degraded
- HH:MM - Root cause identified: [description]
- HH:MM - Mitigation applied: [action taken]
- HH:MM - Service confirmed stable
## Root Cause
[One paragraph: what failed and why]
## What Changed
[Package update / config change / traffic spike / upstream dependency / unknown]
## Mitigation Applied
[Rollback / restart / config revert / resource cleanup]
## Prevention Actions
- [ ] [Action item with owner and due date]
- [ ] [Action item with owner and due date]
- [ ] [Add monitoring/alerting for the gap that allowed this]Key Principles
- Triage first, fix second. Understand the blast radius before you touch anything.
- Preserve evidence. Copy logs and timestamps before restarting services.
- One change at a time. If you change two things and the service recovers, you do not know which one fixed it.
- Write it down while it is fresh. The post-incident report written 10 minutes after resolution is worth ten times more than one written next week.
This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License .