This takes a large amount of input data, runs a job, and processes the data to produce some output data. These jobs can take a while and there usually isn’t a user waiting for the job to finish. Instead, these jobs are scheduled to run periodically. We measure performance by measuring throughput. Unix is very powerful here.
The output of batch workflows is often not the same as a query used for analytic purposes but it is not for transaction processing either. Some things it is used for is building indices and databases. These are different key value stores. Therefore, Map Reduce and batch processing is like a general-purpose operating system that can run arbitrary programs.
Philosophy
- You can roll back code always and rerun the job to get the correct output because…
- Inputs are immutable and outputs are atomically produced (idempotent)
- The same set of files can be used as input for different jobs
- The logic is separate from the wiring
- There is also the thought that raw data is better since there could be different consumers with different priorities and different views on the same data so a “schema on read” approach makes sense since there may not be one ideal data model.
- The map reduce approach is suited for larger jobs that process a lot of data and run for such a long time that a single task will usually fail
Tip
To better understand a tool / philosophy, it is a good idea to go back and examine the history for which the tool / philosophy was developed (gain context)