Spark

Made to address the shortcomings of Map Reduce. The main benefit of new data flow engines are:

Expensive work like sorting only needs to be performed in places where it is required
There are no unnecessary map tasks
Locality optimization can be made since joins and dependencies are explicitly declared
Intermediate states are kept in memory
Operators can execute as soon as input is ready rather than waiting for a preceding stage to finish Since Spark avoids writing to the HDFS, it achieves durability by keeping tack of how a piece of data was computed for re-computation from earlier steps instead of materializing intermediate data. Therefore, it is important to make operators deterministic.

Pregel

Spark uses the Pregel processing model which is an optimization for batch processing graphs where each vertex can send messages to other vertices along the edges of the graph. Each vertex remembers its state in memory from one iteration to the next so the function only needs to process new messages. Pregel achieves fault tolerance by checkpointing full node states.

🤖 Dan Huynh

Recent Notes

All-Pairs Shortest Paths

Dynamic Programming

Knapsack

Linear Programming

Memoization

Explorer