How to filter Record values of files in hadoop mapreduce? -
i working program in mapreduce. have 2 files , want delete information file1 exists in file2. every line has id key , numbers (separated comma) value.
file1: 1 1,2,10 2 2,7,8,5 3 3,9,12
and
file2: 1 1 2 2,5 3 3,9
i want output this:
output: 1 2,10 2 7,8 3 12
i want delete values of file1 have same key in file2. 1 way have 2 files input files , in map step produce: (id, line)
. in reduce step filter values. but, files very large , therefore can't way.
or, efficient if file1 input file , in map open file2 , seek line , compare value? have million keys , every key must open file1, think have excessive i/o.
what can do?
you can make both file1 , file2 inputs of mapper. in mapper you'd add source (file1 or file2) records. use secondary sort make sure records file2 come first. so, combined input reducer that:
1 file2,1 1 file1,1,2,10 2 file2,2,5 2 file1,2,7,8,5 3 file2,3,9 3 file1,3,9,12
you can take design of reducer here.