How to filter Record values of files in hadoop mapreduce? -

- March 15, 2012

i working program in mapreduce. have 2 files , want delete information file1 exists in file2. every line has id key , numbers (separated comma) value.

file1: 1    1,2,10 2    2,7,8,5 3    3,9,12

and

file2: 1    1 2    2,5 3    3,9

i want output this:

output:     1    2,10 2    7,8 3    12

i want delete values of file1 have same key in file2. 1 way have 2 files input files , in map step produce: (id, line). in reduce step filter values. but, files very large , therefore can't way.

or, efficient if file1 input file , in map open file2 , seek line , compare value? have million keys , every key must open file1, think have excessive i/o.

what can do?

you can make both file1 , file2 inputs of mapper. in mapper you'd add source (file1 or file2) records. use secondary sort make sure records file2 come first. so, combined input reducer that:

1    file2,1 1    file1,1,2,10 2    file2,2,5 2    file1,2,7,8,5 3    file2,3,9 3    file1,3,9,12

you can take design of reducer here.

Search This Blog

Employment & Recruiting

How to filter Record values of files in hadoop mapreduce? -

Popular posts from this blog

Php - Delimiter must not be alphanumeric or backslash -

Delphi interface implements -

java - How to create Table using Apache PDFBox -