Made to address the shortcomings of Map Reduce. The main benefit of new data flow engines are:
- Expensive work like sorting only needs to be performed in places where it is required
- There are no unnecessary map tasks
- Locality optimization can be made since joins and dependencies are explicitly declared
- Intermediate states are kept in memory
- Operators can execute as soon as input is ready rather than waiting for a preceding stage to finish Since Spark avoids writing to the HDFS, it achieves durability by keeping tack of how a piece of data was computed for re-computation from earlier steps instead of materializing intermediate data. Therefore, it is important to make operators deterministic.
Pregel
Spark uses the Pregel processing model which is an optimization for batch processing graphs where each vertex can send messages to other vertices along the edges of the graph. Each vertex remembers its state in memory from one iteration to the next so the function only needs to process new messages. Pregel achieves fault tolerance by checkpointing full node states.