DevOps/SRE Interviews: Incident & Debug Question Bank
February 14, 2026

TL;DR
Most DevOps and SRE interviews are not trivia—they’re incident simulations. Interviewers listen for how you triage, communicate, reduce blast radius, validate hypotheses, and choose safe fixes under uncertainty. This question bank gives you incident-style prompts (latency spikes, error floods, failed deploys, Kubernetes outages, database pain, networking weirdness) plus reusable answer templates for triage, debugging, rollbacks, and postmortems. Practice with one loop: confirm impact, stabilize first, narrow scope, test one hypothesis at a time, then end with prevention and learning.
What makes SRE and DevOps interviews different
In many SWE interviews, “correctness” means the function returns the right output. In SRE and DevOps interviews, “correctness” also includes: the service stays up while you troubleshoot, you reduce impact fast, you don’t make the incident worse, you communicate clearly, and you leave the system safer than you found it.
A good answer is calm, structured, and safety-first. If you’re unsure what to say next, fall back to: impact → stabilize → narrow scope → test → smallest safe change → verify → prevent.
The core answer templates
Think of these as your default talk tracks. They keep you from rambling or freezing.
Template A: The TRIAGE loop
Use this whenever you hear: “Production is down. What do you do?”
Talk track
- Confirm impact and scope
- Stabilize and stop the bleeding
- Narrow blast radius
- Form a hypothesis
- Validate with the fastest signal
- Apply the smallest safe fix
- Verify and monitor
- Capture timeline and action items
Short version you can say out loud:
First I confirm user impact and scope. Then I stabilize, reduce blast radius, and test hypotheses with the fastest signals before applying the smallest safe change. After recovery, I capture learnings and prevention steps.
Template B: Signals → Hypothesis → Test
Use this when the interviewer asks: “How would you debug this?”
- Signals: what changed (latency, errors, saturation, traffic)
- Hypothesis: the most likely failure mode
- Test: one quick experiment to confirm or falsify
- Next: if falsified, move to the next hypothesis
Template C: Rollback and risk
Use this for deploy incidents.
- Can we rollback safely?
- What is the risk of rollback vs forward-fix?
- What guardrails would have prevented this?
Template D: Postmortem summary
Use this to end strong.
- What happened (timeline)
- Why it happened (contributing factors)
- What we changed immediately
- What we will change to prevent recurrence
- What we will measure to prove it
A map of incident types and what your answer should contain
| Incident type | First signals to check | Most common root-cause buckets | What a strong answer includes |
|---|---|---|---|
| Latency spike | p95/p99, saturation, downstream latency | load, dependency, lock/GC, cache misses | critical path, isolation, quick mitigation |
| Error rate jump | logs, status codes, deploy diff | bad deploy, config, dependency outage | rollback plan, canary, blast radius |
| Kubernetes outage | restarts, events, probes | bad rollout, limits, DNS, CNI | events → probes → safe rollback |
| DB pain | lag, slow queries, locks | missing index, hot key, pool | plan checks, throttling, caching |
| Networking weirdness | DNS errors, timeouts, loss | DNS, LB, TLS, routing | layer-by-layer isolation |
| Capacity incident | CPU/mem, queue depth, autoscaler | traffic spike, leak, scaling limits | shedding, backpressure, capacity model |
DevOps / SRE interview question bank
Use these as rotations. Don’t try to “cover everything once.” Repeat until your structure is automatic.
Category 1: Incident response fundamentals
- Walk me through your first five minutes during a sev one incident.
- How do you decide whether to page more people or keep the response small?
- What does “stabilize first” mean in practice?
- How do you communicate status updates to non-engineers during an outage?
- What is your checklist before you declare the incident resolved?
Answer template to practice:
Impact → stabilize → narrow scope → hypothesis and test → smallest safe change → verify → postmortem.
Incident mini-story: Latency spike after a traffic surge
Scenario
A major customer campaign launches. Traffic doubles. Your service is “up” but p95 latency is now several seconds.
Questions interviewers love
What do you check first? How do you separate “more traffic” from “less capacity”? When do you shed load? What buys you time?
Answer template
Confirm whether saturation is local (CPU, memory, thread pool) or downstream (DB, cache, external API). Identify the critical path and the slowest dependency. Apply mitigation (caching, rate limiting, queueing, feature degrade, traffic shaping). Verify recovery and watch for secondary failures.
Category 2: Observability and debugging
- If you only had one dashboard for a service, what would you put on it and why?
- How do you debug when logs are noisy and metrics are incomplete?
- How do you decide whether you need tracing vs logs vs metrics?
- What’s the fastest way to isolate whether the issue is in your service or a dependency?
- How do you prevent alert fatigue while still catching real incidents?
Answer template:
Start with user-impact signals, then walk “golden signals” style: latency, traffic, errors, saturation. Correlate to the last change and narrow by service boundaries.
Incident mini-story: Error rate jump after a deploy
Scenario
A new deploy goes out. Within minutes, error rate spikes and some regions are worse than others.
Questions interviewers love
Do you rollback immediately? What would stop you? How do you prove it’s the deploy? What do you check first in the diff?
Answer template
Check deploy timeline vs metric change. Compare healthy vs unhealthy regions (config, dependencies, traffic). If rollback is safe, rollback first to stop harm. If rollback is risky, isolate with feature flag disable, traffic shift, or canary pause. Prevent with canary analysis, rollback thresholds, safer migrations, and better tests.
Category 3: Deployments and release engineering
- Explain canary vs blue-green and when you choose each.
- How do you design health checks so they actually prevent bad releases?
- How do you handle database migrations safely?
- What does “progressive delivery” mean in your day-to-day work?
- What would you include in a rollback runbook?
Answer template:
Define the safety gate, define the rollback path, and define what metrics block promotion.
Incident mini-story: Kubernetes rollout caused downtime
Scenario
A deployment rolls out. Pods keep restarting. Service becomes unavailable intermittently.
Questions interviewers love
What do you check first: events, logs, or probes? How do probes change incident behavior? What’s your rollback strategy?
Answer template
Confirm whether pods are crash looping or failing readiness. Check events for probe failures, image pull issues, OOM, scheduling. Validate resource requests/limits and startup time. Roll back or pause rollout. Prevent with correct probes, startup probes, sane limits, gradual rollout, and circuit breakers.
Category 4: Linux and systems fundamentals
- A host is swapping heavily. What do you check first?
- CPU is high but request rate is normal. How do you reason about this?
- Disk is full on a node. What are safe actions and unsafe actions?
- A process is hanging. How do you inspect it?
- What does it mean when load average is high but CPU is not saturated?
Answer template:
Confirm whether it is CPU, memory, IO, or locks. Use the smallest diagnostic tool first, then go deeper.
Incident mini-story: Database replication lag and timeouts
Scenario
Your read replicas lag behind. Some requests time out. The app retries, making the DB even hotter.
Questions interviewers love
How do you reduce blast radius on the database? Do you disable retries or tune them? What is your “buy time” mitigation?
Answer template
Stabilize: stop retry storms, rate-limit expensive endpoints, shed load. Check: slow query log / query plan, locks, pool saturation, replica health. Mitigate: cache hot reads, add an index carefully, route reads, pause heavy jobs. Prevent: safer retry strategy, backpressure, capacity planning, query budgets.
Category 5: Networking and DNS troubleshooting
- How do you debug intermittent timeouts between services?
- What are the top causes of a sudden spike in DNS failures?
- How do you validate whether TLS or certificate issues are causing outages?
- How do you isolate LB vs service vs network issues?
- What tools do you use to trace the path and confirm packet loss?
Answer template:
Work layer by layer: DNS, TLS, routing, LB, service. Compare a known-good path with a failing path.
Incident mini-story: “It works in one region but not the other”
Scenario
Region A is fine. Region B has elevated errors. Deploy is the same version.
Questions interviewers love
What do you compare first? How do you validate config drift? When do you declare a regional dependency outage?
Answer template
Compare config, secrets, flags, dependencies, quotas, capacity. Validate drift detection and recent changes. Mitigate with traffic shift, region failover, or disabling region-specific features. Prevent with stronger drift controls and consistent release processes.
Category 6: Cloud and infrastructure
- How do you choose instance types and autoscaling policies for a service?
- What metrics drive scaling decisions in your system?
- How do you design multi-zone vs multi-region trade-offs?
- What is your strategy for secrets management and rotation?
- How do you handle rate limits and quotas in cloud environments?
Answer template:
Tie scaling to workload shape. Tie reliability to failure domains. Make trade-offs explicit.
Category 7: CI/CD, automation, and tooling
- What would you automate first if you joined a team with lots of toil?
- How do you prevent “pipeline success” from shipping broken releases?
- What is your strategy for infrastructure as code reviews?
- How do you handle rollback automation safely?
- What is your approach to standardizing environments?
Answer template:
Automate repeatable risk. Keep manual override. Use guardrails and reviews for changes with blast radius.
Category 8: On-call and operational excellence
- How do you define and measure toil?
- What would you do if on-call load is unsustainable?
- How do you decide what to alert on vs what to log?
- How do you create runbooks that people actually use?
- What does “blameless” mean in an incident review?
Answer template:
Reduce toil by improving automation, alert quality, and system safety. Use blameless postmortems to learn and prevent repeats.
Mini drills: practice like an incident, not like a quiz
Pick one mini-story. Give yourself a short timebox. Speak out loud.
Drill one: “Latency spike with rising CPU”
Say your first three checks, name your first hypothesis, name your first test, and name your “buy time” mitigation.
Drill two: “Error jump after deploy”
State when you would rollback, name the metric gate that blocks promotion, and name the fastest mitigation if rollback is risky.
Drill three: “Kubernetes crash loop”
Identify what you check in events, name the probe misconfig you suspect, and state the safe rollback action.
If you want to rehearse out loud under pressure, try Beyz Solo Practice and keep a one-page checklist in Interview cheat sheets.
Incident Snapshot: an answer that interviews reward
A critical API started returning errors right after a config change. Customers were blocked, the on-call channel was noisy, and the quickest risk was making things worse by “thrashing” changes.
I started by confirming impact and scoping affected endpoints, then stabilized by pausing the rollout and shifting traffic away from the failing path. With the blast radius shrinking, I compared a healthy region to an unhealthy one to look for config drift, validated the hypothesis with logs plus a targeted test request, and applied the smallest safe change to restore normal behavior.
After recovery, I wrote a blameless postmortem: a tight timeline, contributing factors (guardrails that failed, signals we didn’t have), and prevention actions. The most important follow-up was making config safer: stricter validation, stronger canary gates, and a rollback runbook that’s fast to execute under stress.
How Beyz + IQB fit into a DevOps and SRE prep loop
A tool shouldn’t “answer incidents” for you. It should help you practice structured thinking.
- Use Interview Questions & Answers to rotate across role-specific prompts.
- Use Interview cheat sheets to store your default talk tracks (triage, hypothesis testing, rollback, postmortem).
- Use Beyz Solo Practice to rehearse incident answers out loud with a timebox.
- Use Beyz Interview Assistant as a lightweight structure reminder, so you keep your answers constraint-driven and safety-first.
- For curated prompt sets you can rotate, use the IQB interview question bank.
If you want your incident answers to feel calm and structured, pick one mini-story above and rehearse it in Beyz Solo Practice with your checklist pinned in Interview cheat sheets. Your goal isn’t to name every tool. It’s to show safe triage, clear hypotheses, and credible trade-offs.
References
- Google SRE — Incident management guide
- Google SRE Book — Effective troubleshooting
- Google SRE Book — Postmortem culture: learning from failure
- DevOps Interview Questions (GitHub)
Frequently Asked Questions
What should I expect in an SRE interview?
Expect scenario questions that test incident response, troubleshooting under uncertainty, and reliability trade-offs. Interviewers want to see a structured triage loop, clear communication, and safety-first operational habits.
How do I answer incident response questions clearly?
Use a consistent template: confirm impact, stabilize first, narrow the blast radius, form a hypothesis, validate with signals, and apply the smallest safe change. End by explaining what you’d do to prevent repeats.
How do I talk about postmortems without blaming people?
Focus on contributing factors and system gaps, not individuals. Describe missing signals, failed guardrails, and the concrete changes you’d make to reduce recurrence. Keep the tone blameless and learning-focused.