Before we see the intermediate data produced by the mapper, it would be quite interesting to see the fault tolerant aspects of Hadoop with respect to MapReduce processing.
Once Name node (NN) received data files which has to be processed, it splits data files to assign it to Data Node (DN). This assignment would be based on total DN available at the time of allocation and configuration parameter “Replication Factor (RF)”.
Let’s define a scenario to explain the elegant approach used by Hadoop cluster.
- Consider a Hadoop cluster of 6 Data Nodes.
- Out of this 6 Node one node is dead. To represent a dead node it is marked as RED. Hence, total 5 node is available to take part into data processing as of now.
- Size of data is 612 MB and default size to split is 64 MB. Hence total 9 Packet of 64 MB and the 10th Packet would for 32 MB.
- Since, data is split into 10 pkts and there are 5 DN which can hold data for processing, each node is assigned 2 pkts.
- RF, for our scenario, is defined as 2. Using this configuration, NN always tries to keep at least 2 copies of each pkt so that if any node goes offline during processing, data should not be lost. Seeing this RF and no of DN, a possible allocation would be :
Total size of data now is 612 MB *2 = 1224 MB.
During data processing in this cluster, if any node goes down, for example Node -4 goes down, fault tolerant feature comes into play here.
With this node down, total size of the data in cluster is 1224 MB – 64 MB *4 (packets on DN -4) = 968 MB.
First thing Hadoop does here is to see if any other node, which was down earlier, has come back. If not found any such node, it will re-distribute the data packet lost with Node – 4 to the available node of cluster.
We have lost two things here. One is the packet which was used by DN-4 for data processing and second, we also lost packet which was used to fulfill RF.
- PKT-3 & 8 which it was processing and
- PKT-1 & 6 as its RFmaintenance policy.
It is solely at NN discretion to decide which node should take the responsibility of processing of lost packets of DN-4 and then assigns those packets to the decided DN.
- For our scenario, NN has picked up DN 1 to assign Packet 3. Packet -3 can be found at DN-6 which was placed to maintain RF of data. Arrow in Green color explain about this.
- Packet – 8 is assigned to DN -5. This packet was kept at DN -6 as RF Policy. Purple color can explain this.
Packet 1 & 6, which as placed to maintain RF on DN -4, can be picked up from DN -1. this is explained as Yellow color. Packet 1 is assigned to DN -3 and Packet 6 is assigned to DN -5. Blue color is describing this assignment.
After assigning required data packets to DN, total data size are again reached to 1224 MB.
Meanwhile, if any dead node comes live, it will be kept as redundant node which might be used in case other nodes goes down. NN does not assign jobs to such nodes any task immediately. Node in yellow color is describing as newly alive node