To tolerate faults, the first step is to detect them. Most systems don’t have an accurate mechanism of detecting whether a node has failed so most distributed algorithms rely on timeouts to determine if a node is still available. Also, variable network delays sometimes causes a node to be falsely suspected of crashing.
Once a fault is detected, making a system tolerate it is not easy due to no global variables, shared memory, or common knowledge.