有效的故障排除-摘录

摘录Google SRE中,Effective Troubleshooting(有效的故障排除)一节

Written by Chris Jones

Common Pitfalls  常见陷阱

Ineffective troubleshooting sessions are plagued by problems at the Triage, Examine, and Diagnose steps, often because of a lack of deep system understanding. The following are common pitfalls to avoid:
无效的故障排除会话会受到 Triage、Examine 和 Diagnose 步骤中问题的困扰,这通常是由于缺乏对系统的深入了解。以下是需要避免的常见陷阱:

Fixing the first and second common pitfalls is a matter of learning the system in question and becoming experienced with the common patterns used in distributed systems. The third trap is a set of logical fallacies that can be avoided by remembering that not all failures are equally probable—as doctors are taught, “when you hear hoofbeats, think of horses not zebras.”61 Also remember that, all things being equal, we should prefer simpler explanations.62
修复第一个和第二个常见陷阱是学习相关系统并熟练使用分布式系统中使用的常见模式的问题。第三个陷阱是一组逻辑谬误,可以通过记住并非所有失败的可能性都相同来避免——正如医生被教导的那样,“当你听到蹄声时,想想马而不是斑马。61 也要记住,在所有条件相同的情况下,我们应该选择更简单的解释。62

Finally, we should remember that correlation is not causation:63 some correlated events, say packet loss within a cluster and failed hard drives in the cluster, share common causes—in this case, a power outage, though network failure clearly doesn’t cause the hard drive failures nor vice versa. Even worse, as systems grow in size and complexity and as more metrics are monitored, it’s inevitable that there will be events that happen to correlate well with other events, purely by coincidence.64
最后,我们应该记住,相关性不是因果关系:63 一些相关事件,例如集群内的数据包丢失和集群中的硬盘驱动器故障,具有共同的原因 — 在本例中,停电,尽管网络故障显然不会导致硬盘驱动器故障,反之亦然。更糟糕的是,随着系统规模和复杂性的增加,以及监控的指标越来越多,不可避免地会有一些事件恰好与其他事件密切相关,这纯粹是巧合。64

Understanding failures in our reasoning process is the first step to avoiding them and becoming more effective in solving problems. A methodical approach to knowing what we do know, what we don’t know, and what we need to know, makes it simpler and more straightforward to figure out what’s gone wrong and how to fix it.
了解我们推理过程中的失败是避免失败并更有效地解决问题的第一步。一种有条不紊的方法来了解我们确实知道什么、我们不知道什么以及我们需要知道什么,这使得弄清楚哪里出了问题以及如何解决它变得更简单、更直接。

Table of Contents