java - Merge tab delimited files by key -


I have three mapping functions that generate tab delimited files, which work on the same file. The first value is the key for each output of these three MR jobs.

Now what I want to do, use these files together to "stitch" the key together. What will be the best mapper output and reducer input? I tried to use Arreable, but because of the shuffle, ArrayWritable from 1 file for some records is in the third place, instead of the other.

I need this:

  key \ t value-to-first-MR-job \ t Price-to-second-MR-job \ t Value-to- Third-MR-job  

And it should be All is identical to the records but, as I said, due to the shuffle, For:

  key \ t value-to-third-MR-job \ t value-to-first-MR       

Class = "post-text" itemprop = "text">

with simple tagging at the emitted value This is possible because only three types of files are included Remove the path of partition in the map, identify its position and add a suitable prefix for the price. For clarity, say that output is in directory 3:

  1. path1 / mr_out_1
  2. path2 / mr_out_2
  3. path3 / mr_out_3 < / Ol>

    Using TextInputForamt for all these routes, in map you will:

      string [] keyVal = Value.spilt ("\ t", 2); Path filepath = ((file split) context.getInputSplit ()). GetPath (); String dirName = filePath.getParent (). GetName () ToString (); Text Out Value = New Text (); If (dirName.equals ("mr_out_1") {outValue.set ("1_" + keywal [1]); } And if (dirName.equals ("mr_out_2")) {outValue.set ("2_" + keywall [1]); } And {outValue.set ("3_" + key value [1]); } Context.write (new text (key value [0]), outworld);  

    If you have all the files in the same directory, then use fileName instead of dirName. Then identify the flag based on the name (a regex match may be appropriate):

      string filename = filePath.getName (). ToString (); Just enter a list and the incoming values ​​to sort (restart the file name.matches ("regex")) {...}  

    the rest are quite simple

    list & lt; String & gt; List = New Arrestist & lt; String & gt; (3); For (text v: value) {list.add (v.toString ()); } Collections.sort (list); Stringbuilder builder = new stringbiller (); For (string s: list) {builder.append (s.sstring (2) + "\ t"); } Context.write (key, new text (builder.tastering). Trim ()));

I think this will serve the purpose. Keep in mind that if there are more than 9 files (due to alphabetical order), then the Collection.sort strategy will fail. Then, to remove the tag separately, you can put it in the integer and a TreeMap & lt; Tags, actual string & gt; can use

NB: All the above snippets are using the new API. I did not use the IDE to write them, so some syntax errors may exist. And then I did not follow proper conferences in writing. Say that the map can be a class member of , and can use outKey.set (keyVal [0]) a text object Creation overhead can be removed.


Comments