how-to-handle-failing-processes#overview1our systems fail. They end up in an inconsistent state. The reason for the failures is unclear, and possibly unknowable.
The first question is whether the failures are in evitable and need to be designed for.
The second question is how do you design for failures?
Is there an acceptable level of granularity of the failures? In other words if the unit of work is very big, is that OK if that fails and you can retry the entire unit of work? Or, is it better to have smaller units of work and more granular workflow fixes? how-to-handle-failing-processes#overview1
the dependencies are actually between the data, not between the computation processes themselves that generate the data
These systems achieve their scalability and fault tolerance by providing a programming model where the user creates acyclic data flow graphs to pass input data through a set of operators. This allows the underlying system to manage scheduling and to react to faults without user intervention. (#)