Reduce in MapReduce … Unwinding

In our previous blogs we have studied about Big data, Hadoop.  We have also explained MapReduce internal workings like how Map works using short and shuffle.  This blog is dedicated to Reduce in MapReduce.

Once this shuffling completed, it is where Reduce in MapReduce come into action.

Its task is to process the input provided by SHUFFLE so that user can understand what is the result of the file processed by hadoop.

Reduce in MapReduce:

After shuffling completed, it is clear that one word will be processed by only one DN and not multiple DNs.  Hence, to find a count of one particular word Reduce in MapReduce has to search that particular word at one specific  DN only.

The task of Reduce  is to pick a word on a DN and search for all occurrences of that particular word on that DN and clubbed its number.

Elegancy of Reduce:

We will explain this with the help of tables.  If we look at the picture of tables below, we can easily understand how elegantly Reduce in MapReduce execute this task.

Reduce pick a word, for example, we have taken “can” on DN -1.  What it does, it searches for all “can” in this node and clubbed together at one place.

Reduce Explained:

In the picture below, it found a total 4 “can” on the DN -1.  It marked all 3 in red (encircled by green circle to remove this) and add 1 to one “can” (marked in brown rectangle).

Cell Marked in red color will be reduced and clubbed at cell not marked in any color for the same word.

Marking for Clubbing of word on DN-1:

NODE – 1 [ L (K, L(V))]

(K)

V

 

(K)

V

achieved

1

 

accommodate

1

adding

1

 

also

1

allow

1

 

an

1

and

1,1

 

analysis

1

any

1

 

and

1

at

1

 

be

1

be

1,1

 

but

1

best

1

 

By

1

by

1,1,1

 

can

1

By

1

 

can

1

can

1,1,1,1

 

can

1

capability

1

 

commodity

1

capture

1

 

configured

1

commodity

1,1

 

cost

1

data

1,1,1,1,1

 

data

1

data

1

 

data

1

decentralize

1,1

 

data

1

design

1

 

decentralized

1

distributed

1,1

 

enabling

1

distributed

1

 

failure

1

enables

1

 

falut

1

ETL

1

 

for

1,1

every

1

 

for

1

Marking for Clubbing of word on DN-2:

NODE – 2 [ L (K, L(V))]

 

 

 

 

 

 

 

 

 

 

Marking for Clubbing of word on DN-3:

NODE – 3 [ L (K,L(
V))]

(K)

V

 

(K)

V

get

1

 

granular

1

Hadoop

1,1,1

 

hardware

1

Hadoop

1

 

horizontal

1,1

Hadoop

1

 

Horizontal

1

hardware

1,1

 

hours

1

harnessing

1

 

in

1

hit

1

 

it

1

huge

1

 

its

1

is

1

 

less

1

its

1,1,1

 

level

1

its

1

 

low

1

limitations

1

 

 

 

low

1,1

 

 

 

Marking for Clubbing of word on DN-4:

NODE – 4 [ L (K, L(V))]

(K)

V

 

(K)

V

,

1,1,1

 

,

1

.

1,1,1

 

.

1

.

1

 

machine

1

.

1

 

machines

1

machines

1,1,1

 

maximum

1

minute

1

 

more

1

much

1

 

new

1

of

1,1,1,1,1

 

not

1

of

1

 

on

1

of

1

 

only

1

of

1

 

or

1

of

1

 

organization

1

on

1,1

 

overcome

1

one

1

 

parallel

1,1

organization

1,1

 

parallel

1

part

1

 

performance

1

 

1,1,1

 

processing

1,1,1

 

1

 

processing

1

 

1

 

processing

1

Marking for Clubbing of word on DN-5:

NODE – 5 [ L (K,L(V))]

(K)

V

 

(K)

V

speeds

1

 

relies

1

the

1,1,1,1

 

scaling

1,1

the

1

 

scaling

1

the

1

 

scenarios

1

the

1

 

technique

1

them

1

 

these

1

this

1,1,1

 

This

1

This

1

 

to

1

to

1,1,1

 

tolerent

1

to

1

 

tradition

1

true

1

 

type

1

Marking for Clubbing of word on DN-6:

NODE – 6 [ L (K, L(V))]

(K)

V

 

(K)

V

up

1

 

using

1

use

1

 

very

1

With

1,1,1

 

waiting

1

with

1

 

was

1

within

1

 

which

1

without

1,1

 

with

1

without

1

 

 

 

Next step is to remove words which is clubbed (as marked in RED cells)

The output would be [L<K, L<V> >]

Reducing of word on DN -1 & DN -2.  You can see here the cell marked in RED is removed now.

Reduce on DN-1:

NODE – 1 [ L (K, L(V))]

(K)

V

 

(K)

V

accommodate

1

 

also

1

achieved

1

 

an

1

adding

1

 

analysis

1

allow

1

 

but

1

and

1,1

 

configured

1

any

1

 

cost

1

at

1

 

decentralize

1,1

be

1,1

 

design

1

best

1

 

distributed

1,1

by

1,1,1

 

enables

1

can

1,1,1,1

 

enabling

1

capability

1

 

ETL

1

capture

1

 

every

1

commodity

1,1

 

failure

1

data

1,1,1,1,1

 

falut

1

for

1,1

 

 

 

Reduce on DN-2:

NODE – 2 [ L (K, L(V))]

 

 

 

 

 

 

 

 

 

 

Reduce on DN-3:

NODE – 3 [ L (K,L(
V))]

(K)

V

 

(K)

V

get

1

 

granular

1

Hadoop

1,1,1

 

horizontal

1,1

hardware

1,1

 

hours

1

harnessing

1

 

in

1

hit

1

 

it

1

huge

1

 

less

1

is

1

 

level

1

its

1,1,1

 

limitations

1

low

1,1

 

 

 

Reduce on DN-4:

NODE – 4 [ L (K, L(V))]

(K)

V

 

(K)

V

,

1,1

 

maximum

1

.

1,1,1,1

 

more

1

machines

1,1,1

 

new

1

minute

1

 

not

1

much

1

 

only

1

of

1,1,1,1,1

 

or

1

on

1,1

 

overcome

1

one

1

 

parallel

1,1

organization

1,1

 

performance

1

part

1

 

processing

1,1,1

 

1,1,1

 

 

 

Reduce on DN-5:

NODE – 5 [ L (K,L(V))]

(K)

V

 

(K)

V

speeds

1

 

relies

1

the

1,1,1,1

 

scaling

1,1

them

1

 

scenarios

1

this

1,1,1

 

technique

1

to

1,1,1

 

these

1

tolerent

1

 

true

1

tradition

1

 

type

1

Reduce on DN-6:

NODE – 6 [ L (K, L(V))]

(K)

V

 

(K)

V

up

1

 

using

1

use

1

 

very

1

With

1,1,1

 

waiting

1

within

1

 

was

1

without

1,1

 

which

1

Calculating words count:

Now reduce will work to reduce the output above.

The list of counts ( L<V> )mentioned in the form of 1,1,1 etc will be converted into one digit ( <V> ).

The output would be [L<K, V >]

Finally, Reduced output will be provided to the user.

WORD

COUNT

 

WORD

COUNT

 

WORD

COUNT

 

WORD

COUNT

 

WORD

COUNT

 

WORD

COUNT

,

2

 

by

3

 

failure

1

 

its

3

 

only

1

 

these

1

.

4

 

can

4

 

falut

1

 

less

1

 

or

1

 

this

3

accommodate

1

 

capability

1

 

for

2

 

level

1

 

organization

2

 

to

3

achieved

1

 

capture

1

 

get

1

 

limitations

1

 

overcome

1

 

tolerent

1

adding

1

 

commodity

2

 

granular

1

 

low

2

 

parallel

2

 

tradition

1

allow

1

 

configured

1

 

Hadoop

3

 

machines

3

 

part

1

 

true

1

also

1

 

cost

1

 

hardware

2

 

maximum

1

 

performance

1

 

type

1

an

1

 

data

5

 

harnessing

1

 

minute

1

 

processing

3

 

up

1

analysis

1

 

decentralize

2

 

hit

1

 

more

1