hadoop - Pig Cross product reducer key -
when perform crossproduct operation (followed filtering) reducer sizes imbalanced, reducers writing 0 output , others taking several hours complete. basic example following code:
crossproduct = cross tweets, clients; result = filter crossproduct text matches concat('.*', concat(keyword, '.*')); store result 'result' using pigstorage(' ');
in case reducer key?
this difficult question answer. cross implemented in pig join on synthetic keys. best resource understand cross programming pig - page 68
in example, cross like
a = foreach tweets generate flatten(gfcross(0,2)), flatten(*); b = foreach clients generate flatten(gfcross(1,2)), flatten(*); c = cogroup ($0, $1), b ($0, $1); crossproduct = foreach c generate flatten(a), flatten(b);
as explained in book, gfcross internal udf. first argument input number, , second argument total number of inputs. in example, udf generates records have schema of (int, int). field same first argument has random number between 0 , 3. other field counts 0 3. so, if assume first record in has random number 3, , first record in b has random number 2, following 4 tuples generated udf each input.
a {(3,0), (3,1), (3,2), (3,3)} b {(0,2), (1,2), (2,2), (3,2)}
when join performed, (3,2) tuple joined (3,2) tuple in b. every record in each input, guaranteed there 1 , 1 instance of artificial keys match , produce record.
so, answer question of reduce key... reduce key synthetic key generated gfcross. since random numbers chosen differently each record, resulting joins should done on distribution of reducers.
Comments
Post a Comment