MapReduce Unwinding … Reduce

Once this shuffling completed, it is where REDUCE come into action.  Its task is to process the input given by SHUFFLE into the output so that user can understand what is the result of the file processed by hadoop.

After shuffling completed, it is clear that one word will be processed by only one DN and not multiple DNs.  Hence, to find a count of one particular word REDUCE has to search that particular word at one specific  DN only.

REDUCE task from here is to pick a word on a DN and search for all occurrences of that particular word on that DN and clubbed its number.  If we look at the picture below, we can easily understand how elegantly REDUCE execute this task.

Reduce pick a word, for example, we have taken “can” on DN -1.  What it does, it searches for all “can” in this node and clubbed together at one place.  In the picture below, it found a total 4 “can” on the DN -1.  It marked all 3 in red (encircled by green circle to remove this) and add 1 to one “can” (marked in brown rectangle).

Cell Marked in red color will be reduced and clubbed at cell not marked in any color for the same word.

Marking for Clubbing of word on DN-1:

NODE – 1 [ L (K, L(V))]

(K)

V

 

(K)

V

achieved

1

 

accommodate

1

adding

1

 

also

1

allow

1

 

an

1

and

1,1

 

analysis

1

any

1

 

and

1

at

1

 

be

1

be

1,1

 

but

1

best

1

 

By

1

by

1,1,1

 

can

1

By

1

 

can

1

can

1,1,1,1

 

can

1

capability

1

 

commodity

1

capture

1

 

configured

1

commodity

1,1

 

cost

1

data

1,1,1,1,1

 

data

1

data

1

 

data

1

decentralize

1,1

 

data

1

design

1

 

decentralized

1

distributed

1,1

 

enabling

1

distributed

1

 

failure

1

enables

1

 

falut

1

ETL

1

 

for

1,1

every

1

 

for

1

Marking for Clubbing of word on DN-2:

NODE – 2 [ L (K, L(V))]

 

 

 

 

 

 

 

 

 

 

Marking for Clubbing of word on DN-3:

NODE – 3 [ L (K,L(
V))]

(K)

V

 

(K)

V

get

1

 

granular

1

Hadoop

1,1,1

 

hardware

1

Hadoop

1

 

horizontal

1,1

Hadoop

1

 

Horizontal

1

hardware

1,1

 

hours

1

harnessing

1

 

in

1

hit

1

 

it

1

huge

1

 

its

1

is

1

 

less

1

its

1,1,1

 

level

1

its

1

 

low

1

limitations

1

 

 

 

low

1,1

 

 

 

Marking for Clubbing of word on DN-4:

NODE – 4 [ L (K, L(V))]

(K)

V

 

(K)

V

,

1,1,1

 

,

1

.

1,1,1

 

.

1

.

1

 

machine

1

.

1

 

machines

1

machines

1,1,1

 

maximum

1

minute

1

 

more

1

much

1

 

new

1

of

1,1,1,1,1

 

not

1

of

1

 

on

1

of

1

 

only

1

of

1

 

or

1

of

1

 

organization

1

on

1,1

 

overcome

1

one

1

 

parallel

1,1

organization

1,1

 

parallel

1

part

1

 

performance

1

 

1,1,1

 

processing

1,1,1

 

1

 

processing

1

 

1

 

processing

1

Marking for Clubbing of word on DN-5:

NODE – 5 [ L (K,L(V))]

(K)

V

 

(K)

V

speeds

1

 

relies

1

the

1,1,1,1

 

scaling

1,1

the

1

 

scaling

1

the

1

 

scenarios

1

the

1

 

technique

1

them

1

 

these

1

this

1,1,1

 

This

1

This

1

 

to

1

to

1,1,1

 

tolerent

1

to

1

 

tradition

1

true

1

 

type

1

Marking for Clubbing of word on DN-6:

NODE – 6 [ L (K, L(V))]

(K)

V

 

(K)

V

up

1

 

using

1

use

1

 

very

1

With

1,1,1

 

waiting

1

with

1

 

was

1

within

1

 

which

1

without

1,1

 

with

1

without

1

 

 

 

Next step is to remove words which is clubbed (as marked in RED cells)

The output would be [L<K, L<V> >]

Reducing of word on DN -1 & DN -2.  You can see here the cell marked in RED is removed now.

Reduce on DN-1:

NODE – 1 [ L (K, L(V))]

(K)

V

 

(K)

V

accommodate

1

 

also

1

achieved

1

 

an

1

adding

1

 

analysis

1

allow

1

 

but

1

and

1,1

 

configured

1

any

1

 

cost

1

at

1

 

decentralize

1,1

be

1,1

 

design

1

best

1

 

distributed

1,1

by

1,1,1

 

enables

1

can

1,1,1,1

 

enabling

1

capability

1

 

ETL

1

capture

1

 

every

1

commodity

1,1

 

failure

1

data

1,1,1,1,1

 

falut

1

for

1,1

 

 

 

Reduce on DN-2:

NODE – 2 [ L (K, L(V))]

 

 

 

 

 

 

 

 

 

 

Reduce on DN-3:

NODE – 3 [ L (K,L(
V))]

(K)

V

 

(K)

V

get

1

 

granular

1

Hadoop

1,1,1

 

horizontal

1,1

hardware

1,1

 

hours

1

harnessing

1

 

in

1

hit

1

 

it

1

huge

1

 

less

1

is

1

 

level

1

its

1,1,1

 

limitations

1

low

1,1

 

 

 

Reduce on DN-4:

NODE – 4 [ L (K, L(V))]

(K)

V

 

(K)

V

,

1,1

 

maximum

1

.

1,1,1,1

 

more

1

machines

1,1,1

 

new

1

minute

1

 

not

1

much

1

 

only

1

of

1,1,1,1,1

 

or

1

on

1,1

 

overcome

1

one

1

 

parallel

1,1

organization

1,1

 

performance

1

part

1

 

processing

1,1,1

 

1,1,1

 

 

 

Reduce on DN-5:

NODE – 5 [ L (K,L(V))]

(K)

V

 

(K)

V

speeds

1

 

relies

1

the

1,1,1,1

 

scaling

1,1

them

1

 

scenarios

1

this

1,1,1

 

technique

1

to

1,1,1

 

these

1

tolerent

1

 

true

1

tradition

1

 

type

1

Reduce on DN-6:

NODE – 6 [ L (K, L(V))]

(K)

V

 

(K)

V

up

1

 

using

1

use

1

 

very

1

With

1,1,1

 

waiting

1

within

1

 

was

1

without

1,1

 

which

1

Now reduce will work to reduce the output above.  The list of counts ( L<V> )mentioned in the form of 1,1,1 etc will be converted into one digit ( <V> ).

The output would be [L<K, V >]

Finally, Reduced output will be provided to the user.

WORD

COUNT

 

WORD

COUNT

 

WORD

COUNT

 

WORD

COUNT

 

WORD

COUNT

 

WORD

COUNT

,

2

 

by

3

 

failure

1

 

its

3

 

only

1

 

these

1

.

4

 

can

4

 

falut

1

 

less

1

 

or

1

 

this

3

accommodate

1

 

capability

1

 

for

2

 

level

1

 

organization

2

 

to

3

achieved

1

 

capture

1

 

get

1

 

limitations

1

 

overcome

1

 

tolerent

1

adding

1

 

commodity

2

 

granular

1

 

low

2

 

parallel

2

 

tradition

1

allow

1

 

configured

1

 

Hadoop

3

 

machines

3

 

part

1

 

true

1

also

1

 

cost

1

 

hardware

2

 

maximum

1

 

performance

1

 

type

1

an

1

 

data

5

 

harnessing

1

 

minute

1

 

processing

3

 

up

1

analysis

1

 

decentralize

2

 

hit

1

 

more

1

 

relies