How to filter Record values of files in hadoop mapreduce? -

i working program in mapreduce. have 2 files , want delete information file1 exists in file2. every line has id key , numbers (separated comma) value.

file1: 1    1,2,10 2    2,7,8,5 3    3,9,12 


file2: 1    1 2    2,5 3    3,9 

i want output this:

output:     1    2,10 2    7,8 3    12  

i want delete values of file1 have same key in file2. 1 way have 2 files input files , in map step produce: (id, line). in reduce step filter values. but, files very large , therefore can't way.

or, efficient if file1 input file , in map open file2 , seek line , compare value? have million keys , every key must open file1, think have excessive i/o.

what can do?

you can make both file1 , file2 inputs of mapper. in mapper you'd add source (file1 or file2) records. use secondary sort make sure records file2 come first. so, combined input reducer that:

1    file2,1 1    file1,1,2,10 2    file2,2,5 2    file1,2,7,8,5 3    file2,3,9 3    file1,3,9,12 

you can take design of reducer here.

Popular posts from this blog

How to calculate SNR of signals in MATLAB? -

c# - Attempting to upload to FTP: System.Net.WebException: System error -

ios - UISlider customization: how to properly add shadow to custom knob image -