DevOps2026-05-18·10 phút đọc

Incident response runbook mẫu cho startup 2026

Template runbook P1/P2/P3 đầy đủ: severity definition, on-call rotation, communication channel, escalation matrix, postmortem template - áp dụng được ngay cho startup VN 5-50 người.

TL;DR

Một runbook tốt cần 5 phần: severity matrix, roles + on-call, communication plan, response checklist, postmortem template. Startup không có runbook = mỗi incident là một cuộc khủng hoảng. Copy template ở dưới, fill thông tin của bạn, deploy trong 1 ngày. Hoặc dùng TND Server Management để có sẵn runbook + on-call team.

Tại sao mọi startup cần incident response runbook

Năm 2026, khách hàng SaaS B2B Việt Nam yêu cầu uptime SLA ≥ 99.9% - chỉ được phép down 43 phút/tháng. Không có runbook đồng nghĩa với:

Mỗi sự cố đều "ad-hoc" - mất 30-60 phút chỉ để hiểu ai làm gì
Khách hàng đợi 3-4 giờ không nhận được update → mất trust
Founder/CTO bị gọi điện cho mọi sự cố, không scale được team
Cùng một bug lặp lại nhiều lần vì không có postmortem
Audit SOC 2 / ISO 27001 fail vì thiếu documented incident process

"Runbook không phải để xử lý incident khi xảy ra - nó để incident không trở thành crisis. Khác biệt giữa downtime 30 phút và 6 giờ thường là một runbook 2 trang." - SRE lead, TND.

Phần 1 - Severity matrix (P1/P2/P3/P4)

Phân loại đúng severity = response đúng resource. Đừng treat mọi alert là P1:

Severity	Định nghĩa	SLA response	SLA resolve	Ví dụ
P1 - Critical	Toàn bộ hệ thống down, dữ liệu mất, security breach	15 phút	4 giờ	Production API trả 500 toàn bộ, database master crash, ransomware
P2 - High	Một feature chính down, > 30% user bị ảnh hưởng	30 phút	8 giờ	Login flow fail, payment gateway lỗi, search return empty
P3 - Medium	Một feature phụ down, < 10% user bị ảnh hưởng	2 giờ business	2 ngày business	Email notification delay, report dashboard chậm, image upload fail intermittent
P4 - Low	Cosmetic, UX bug, không ảnh hưởng business	1 ngày business	2 tuần	Typo, button alignment, icon thiếu

Quy tắc downgrade/upgrade severity

Khi mới phát hiện, default về severity cao hơn - downgrade sau khi có thêm thông tin
Nếu > 3 customer complain trong 30 phút → upgrade lên P1
Nếu chỉ 1-2 user lẻ tẻ + workaround tồn tại → keep P3
Incident manager (IM) có quyền override severity dựa trên context business

Phần 2 - Roles + On-call rotation

Mỗi incident cần phân vai rõ ràng. Tránh tình trạng "ai cũng gõ lệnh, không ai update khách":

Role	Trách nhiệm	Ai đảm nhận
Incident Commander (IC)	Decide priority, escalate, không tự fix	Tech lead / Senior engineer
Tech Lead	Diagnose root cause, deploy fix	On-call engineer
Communications Lead	Update status page, email khách, internal Slack	CSM / PM / Marketing
Scribe	Ghi timeline, log mọi action - dùng cho postmortem	Junior engineer / Designated

On-call rotation template (1 tuần/người)

# Schedule example - 4 engineer rotation
# Mỗi người trực 1 tuần (T2-CN), nhận handoff vào 9h sáng T2

Week 1 (May 19-25): Alice  (primary)  / Bob   (secondary)
Week 2 (May 26-1):  Bob    (primary)  / Charlie (secondary)
Week 3 (Jun 2-8):   Charlie (primary) / Dave  (secondary)
Week 4 (Jun 9-15):  Dave   (primary)  / Alice (secondary)

# Compensation
- $40-100/tuần on-call allowance (theo seniority)
- 1 ngày off in lieu nếu bị page > 3 lần ngoài giờ trong tuần đó
- Page sau 22h thì tuần sau được nghỉ T2

# Handoff checklist (mỗi T2 9h)
- Review open incident từ tuần trước
- Briefing alert nào đang nhiễu (false positive)
- Update doc nếu có service mới deploy
- Verify on-call rotation trên PagerDuty/Opsgenie

Phần 3 - Communication plan

Communication kém là lý do số 1 khách hàng chuyển sang competitor sau incident.

Internal channels

#incidents: Slack channel public, mọi incident bắt đầu ở đây
#incident-{number}: Slack channel ephemeral cho mỗi P1/P2 - tự động tạo bằng bot
Voice bridge: Google Meet / Zoom link cố định cho P1 - IC mở ngay
PagerDuty/Opsgenie: Page on-call qua SMS + voice call

External communication

Status page: status.yourdomain.com - update mỗi 15-30 phút trong P1
Email customer (P1 only): Trong vòng 1 giờ sau khi xác nhận incident
Twitter/Facebook: Mỗi lần update status page, post mirror
Postmortem public: Trong vòng 5 ngày sau khi resolve

Template message status page

# Investigating (lần đầu update)
We are investigating reports of [symptom]. Affected: [scope].
We will update in 30 minutes.

# Identified
We have identified the root cause as [cause]. Working on fix.
ETA resolve: [time].

# Monitoring
A fix has been deployed. We are monitoring for stability.

# Resolved
This incident is fully resolved at [time].
Duration: [X] minutes. Impact: [scope].
A postmortem will be published within 5 business days.

Phần 4 - Response checklist (P1 workflow)

Khi alert page lúc 3 giờ sáng, on-call engineer KHÔNG nên dựa vào trí nhớ. Đọc theo checklist:

00:00 - Acknowledge: Ack page trong 5 phút. Nếu không thể, secondary sẽ tự động được page sau 5 phút.
00:01 - Assess: Mở Grafana/Zabbix dashboard. Xác nhận có thật là incident hay false positive.
00:05 - Declare: Post vào #incidents: "P1 incident declared. Topic: [X]. Joining bridge."
00:05 - Page IC: Page Incident Commander (thường là tech lead). IC mở voice bridge.
00:10 - Communication: Comms Lead post status page "Investigating".
00:15 - Diagnose: Tech Lead chạy diagnostic script. Document mọi command vào incident channel.
00:30 - Mitigate: Áp dụng quick mitigation (rollback, scale up, traffic shift) - fix triệt để có thể đợi sau.
00:45 - Monitor: Confirm metric về normal trong 15 phút liên tục.
01:00 - Resolve: Update status page "Resolved". Page out.
+24h - Postmortem: Draft postmortem trong 24h, public trong 5 ngày.

Phần 5 - Postmortem template

# Postmortem - INC-{number}: {Brief title}

**Date:** YYYY-MM-DD
**Duration:** X minutes (start: HH:MM, resolve: HH:MM)
**Severity:** P1
**Author:** {name}

## Summary (1 đoạn, < 100 từ)
Mô tả ngắn gọn what happened và impact.

## Impact
- Customers affected: N (% of total)
- Revenue impact: $X
- SLA credit owed: $Y

## Timeline (chronological, all timestamps UTC+7)
- 03:12 - Alert PagerDuty: API 5xx rate > 5%
- 03:14 - On-call Alice ack
- 03:18 - IC Bob join bridge
- 03:25 - Identified: database connection pool exhausted
- 03:32 - Mitigation: scale up RDS instance, restart app servers
- 03:48 - Metrics return to normal
- 03:55 - Resolved declared

## Root cause
Detailed technical explanation, 2-4 đoạn.
What chain of events led here? Why didn't existing safeguards catch it?

## What went well
- Quick acknowledgement (2 min)
- Clear communication on status page
- Existing rollback script worked

## What went poorly
- Alert threshold too high - missed early warning
- No runbook for database connection issues
- 3 engineers tried to fix in parallel, conflicting

## Action items
| # | Action | Owner | Due | Status |
|---|--------|-------|-----|--------|
| 1 | Lower 5xx alert threshold to 1% | Alice | 2026-05-25 | Open |
| 2 | Write runbook for DB issues | Bob | 2026-05-30 | Open |
| 3 | Add chaos engineering DB test | Charlie | 2026-06-15 | Open |

Bộ tool stack tối thiểu cho startup

Layer	Tool	Cost/tháng	Alternative miễn phí
Monitoring	Datadog / Grafana Cloud	$15-200	Self-host Prometheus + Grafana trên VPS TND
Alerting	PagerDuty / Opsgenie	$19-29/user	Grafana Alerting + Telegram bot
Status page	Statuspage.io / Better Stack	$29-99	Cachet, Uptime Kuma self-host
Incident management	FireHydrant / Rootly	$0-99/user	Slack workflow + Notion template
Postmortem	Jeli / Blameless	$0-50/user	Google Docs template + Linear

Đọc thêm so sánh Zabbix vs Grafana+Prometheus để chọn monitoring stack phù hợp.

3 sai lầm phổ biến khi viết runbook

Viết quá chi tiết không ai đọc: Runbook > 20 trang = chết. Mỗi runbook P1 nên < 2 trang, action-oriented.
Không drill: Runbook không được test = giả định sai. Chạy game-day mỗi quý - IC giả lập P1, team response thật.
Không update sau postmortem: Mỗi action item từ postmortem phải có owner + deadline + verify. Nếu không track, runbook lỗi thời sau 6 tháng.

"Test runbook khi không có incident. Nếu chỉ đọc runbook khi đang firefight, bạn sẽ tìm ra lỗ hổng vào đúng lúc tệ nhất." - Google SRE handbook.

Outsource on-call cho TND

Nếu startup chưa đủ người để rotation 4 engineer, có thể outsource phần on-call ngoài giờ cho TND:

Team SRE TND nhận page ngoài giờ làm việc (18h-9h, cuối tuần, lễ Tết)
L1 triage trong 15 phút - nếu là known issue, fix theo runbook
L2 escalate cho team in-house qua điện thoại nếu cần product knowledge
Cost: từ 8 triệu/tháng cho coverage 24/7 - rẻ hơn nhiều so với thuê 1 engineer đêm

Xem chi tiết gói TND Server Management và Business Hosting có bundle on-call.

Khám phá thêm

Để team TND lo on-call 24/7 cho hạ tầng của bạn. Runbook + postmortem có sẵn, SLA response 15 phút.

Xem dịch vụ →

Cần tư vấn license + hạ tầng tại TND?

TND nhà cung cấp Microsoft, Adobe, Kaspersky chính hãng / AutoDesk / VMware / TeamViewer / JetBrains tại Việt Nam - license genuine 100%, kích hoạt online từ nhà sản xuất. Hoá đơn VAT điện tử Thông tư 78 đầy đủ cho doanh nghiệp.

💬 Tư vấn miễn phí qua Facebook →

Cloud VPS Việt Nam

VPS Fresh IP Việt Nam

Cloud VPS US

VPS Fresh IP US